Bio x AI Hackathon
  • Welcome to the Bio x AI Hackathon
  • Getting Started
    • Quickstart
    • Important Links
  • Developers
    • BioAgents
    • CoreAgents
    • Eliza Agent Framework
    • Knowledge Graphs
    • .cursorrules
    • Starter-repos
    • Plugin Guide
  • Vision and Mission
    • Bio x AI Hackathon
    • The Problems in Science
    • TechBio
    • Guidance from the Judges
      • Important Datasets and Code Repositories
      • Reading List
      • Common Mistakes for Developers new to Academia
    • Hackathon Ideas
      • Full Projects
        • The Complexity Slider - Finding Hypotheses at the Limits of Human Knowledge
        • [Hard Mode] Metadata Generation on datasets with No Manuscript or Code Associated
        • Inverse Reproducibility - Given Manuscript and Data, Make the Code
        • Atlas of Research Methods Formatted for Agentic Reuse
        • Utilizing Knowledge Graphs for the Detection of Potential Null Results
        • Creating an Iterative Publication Stack by Linking Together Existing Tooling
        • Longevity Atlas: Building a Decentralized Knowledge Network with Agentic Research Hypothesis Engine
        • CoreAgent Track - Opportunities to work with BioDAOs
        • SpineDAO Chronos Project Spec
      • Individual Plugins
        • Plug-ins for every piece of research tooling known to humankind
        • Reproducibility Assistant - Code Cleaning, Dockerization, etc
        • Finding and Differentiating Cardinal vs Supporting Assertions
        • [Easier Mode] Metadata Generation on Datasets Given the Manuscript and Code Repository
        • Sentiment Analysis on Existing Citations, Dissenting vs Confirming
        • Agentic Metadata Template Creation for Standard Lab Equipment
  • Ops
    • Calendar
      • Key Dates
      • Office Hours
    • Judges and Mentors
      • Communicating to Judges and Mentors
      • BioAgent Judging Panel
      • CoreAgent Judging Panel
      • Mentors
    • Prize Tracks
    • Hackathon Rules
    • Kickoff Speakers
    • FAQ
Powered by GitBook
On this page
  • AI Models for Science & Therapeutics
  • AI Agents & Frameworks for Scientific Discovery
  • Knowledge Graphs: Platforms & Concepts
  • Knowledge Graphs: Tools & Databases
  • Data Repositories & Platforms
  • Data Access APIs
  • Data & Metadata Standards
  • Supporting Tools & Infrastructure
  • Benchmarking & Evaluation
  • Research Highlights & Organizations
  1. Vision and Mission
  2. Guidance from the Judges

Important Datasets and Code Repositories

PreviousGuidance from the JudgesNextReading List

Last updated 1 month ago

Resource Guide

AI Models for Science & Therapeutics

  • txGemma:

    • Collection of open models for therapeutics development.

    • Built on Gemma 2, trained with 7 million examples.

    • Available in three sizes (2B, 9B, 27B) with specialized versions.

    • 27B model outperforms single-task models in 50 of 66 tasks.

    • Designed for further fine-tuning with proprietary data.

  • Evo2: Foundational Model for Genome Modeling:

    • Trained on 9.3 trillion nucleotides from 128,000 genomes.

    • Reported 90% accuracy in detecting disease-causing mutations.

    • Can process sequences up to 1 million nucleotides.

    • Applications: genetic analysis, disease mutation detection, gene therapy design.

    • Open-source with publicly available training data, code, and weights.

AI Agents & Frameworks for Scientific Discovery

  • AgentRxiv: Collaborative Autonomous Research:

    • Centralized platform for autonomous research agents.

    • Enables knowledge sharing through similarity-based search.

    • Reported 78.2% accuracy on MATH-500 benchmarks (vs 73.8% without platform).

    • Demonstrated generalization across different benchmarks and models.

  • AI Scientist: Automated Scientific Discovery:

    • Framework for hypothesis generation, experiment design, and paper writing.

    • Produces research papers with minimal human intervention.

    • Reported cost-effective ($15 per paper).

    • Applications: diffusion models, language modeling, learning dynamics.

  • Curie: Rigorous Scientific Experimentation:

    • Features Architect Agent for planning and Technician Agents for execution.

    • Reported 3.4× improvement in answering experimental questions.

    • Enforces experimental discipline while maintaining creativity.

  • The Virtual Lab: Nanobody Design:

    • Multi-agent collaboration with minimal human input (reported 1.3% of total words).

    • Designed 92 nanobodies with >90% expressing as soluble proteins.

    • Combines agents with different expertise to solve complex challenges.

  • Aviary: Training Language Agents for Scientific Tasks:

    • Open-source agents aiming to match frontier LLMs at lower cost for specific tasks.

    • Handles molecular cloning, literature research, protein engineering.

    • Uses stochastic computation graph framework.

  • Google AI Co-scientist:

    • Multi-agent AI system built with Gemini 2.0.

    • Generates and evaluates research hypotheses through iterative reasoning.

    • Shown promising results in drug repurposing, liver fibrosis treatment, and antimicrobial resistance.

    • Available to research organizations via a Trusted Tester Programme.

  • Popper: Automated Hypothesis Validation:

    • Sequential falsification approach using LLM agents.

    • Designs experiments, executes tests, and analyzes results.

    • Claimed 10× faster than human scientists with comparable accuracy.

    • Maintains strict Type-I error control.

  • YesNoError: Scientific Literature Auditing:

    • Multi-agent system for detecting errors in scientific papers.

    • Checks mathematics, methodology, references, and logical consistency.

    • Uses synthetic data pipeline to improve detection accuracy.

    • Mentions a token-based economy ($YNE) for requesting audits.

  • Talk2Biomodels: Conversational Biological Modeling:

    • Natural language interface for exploring biological models.

    • Supports SBML format and BioModels database.

    • Features time-course simulations, steady-state analysis, parameter scanning.

    • Uses retrieval-augmented generation to prevent hallucinations.

  • BioChatter: Biomedical LLM Platform:

    • Open-source framework for biomedical LLM applications.

    • Integrates knowledge, retrieval-augmented generation, model chaining.

    • Designed for privacy-preserving use with local open-source LLMs.

    • Connects to BioCypher knowledge graphs.

  • Aime: Medical Reasoning System:

    • Two-agent architecture: Dialogue Agent and Mx Agent.

    • Grounds recommendations in clinical guidelines.

    • Performed well on RxQA medication reasoning benchmark.

    • Focuses on longitudinal disease management.

  • Elicit:

    • AI-powered research tool for extracting information from academic papers.

    • Summarizes key findings, identifies related research.

    • Helps users grasp core concepts of complex scientific literature.

    • Offers a free trial.

  • PaperQA:

    • Retrieval Augmented Generation (RAG) tool by Future House.

    • Answers questions based on scientific paper collections.

    • Helps researchers process large amounts of scientific literature.

  • BeeARD:

    • Aims for automated hypothesis generation and validation at scale.

    • Powered by multi-agent AI systems and knowledge graphs.

    • Project token associated (link provided).

  • KGARevion: Knowledge Graph-Based AI Agent:

    • Multi-step process for biomedical question answering using KGs.

    • Reported 5.2% improvement over baseline approaches.

    • Shows strong zero-shot generalization to underrepresented medical contexts.

Knowledge Graphs: Platforms & Concepts

  • HALD: Human Aging and Longevity Knowledge Graph:

    • Contains 12,000+ entities, 115,000+ relations (as reported).

    • Extracted from 340,000 PubMed articles.

    • Provides structured exploration of aging and longevity biomarkers.

  • NebulaGraph: Rhinitis Knowledge Graph Example:

    • Demonstrates leveraging LLMs to extract knowledge from Chinese medical records for KG building.

    • Outlines a five-step methodology for knowledge extraction.

    • ChatGPT-4 achieved 82.75% F1 score in knowledge extraction in this case study.

  • BioCypher: Knowledge Graph Framework:

    • Framework for building KGs, integrates with BioChatter and biomedical datasets.

    • Features flexible ontology structures and adaptable output formats.

    • Supports multiple formats (RDF, SQL, NetworkX).

    • Customizable via YAML configuration.

  • Nanopublication Network:

    • System for publishing and sharing self-contained units of scientific information.

    • Each "nanopublication" contains assertions, provenance, and publication information.

    • Aims to improve transparency and reproducibility of scientific research.

  • Graphiti:

    • Builds temporally-aware knowledge graphs for AI agents.

    • Models relationships and context that change over time.

  • Memary:

    • Aims to give AI Agents human-like memory capabilities.

    • Tracks entity knowledge, preferences, and chat history in an automatically updating knowledge graph.

  • Cognee:

    • Python library combining knowledge graphs and RAG.

    • Builds evolving semantic memory for AI agents/apps using dynamic KGs.

  • Wikidata:

    • Free and open knowledge base.

    • Queryable via SPARQL endpoint.

  • OpenAIRE Graph:

    • A large-scale research information knowledge graph.

    • Connects publications, datasets, software, authors, institutions.

  • Open Research Knowledge Graph (ORKG):

    • Platform for structured representation of scientific knowledge.

    • Aims to make research contributions machine-actionable.

  • Connected Papers:

    • Visual tool for exploring research paper connections and discovering related literature.

    • Creates graph visualizations based on citations and similarity.

  • Research Graph:

    • Focuses on building a global research collaboration and citation network.

    • Connects researchers, institutions, publications, grants, and datasets.

  • UniProt:

    • Comprehensive, high-quality resource for protein sequence and functional information.

    • Widely used database in bioinformatics and life sciences.

  • Self-Organizing Graphs Reasoning:

    • Research suggesting graph reasoning systems evolve towards a critical state.

    • Observed a stable Critical Discovery Parameter (-0.03).

    • Found 12% of connections form between semantically distant concepts.

    • Demonstrates scale-free and small-world properties in emergent graphs.

Knowledge Graphs: Tools & Databases

  • KGLab:

    • Python library for knowledge graph development and machine learning.

    • Provides tools for graph construction, analysis, and ML integration.

  • Relik:

    • Knowledge graph construction and linking tool (entity and relation linking).

    • Developed by Sapienza NLP group.

  • GLiNER:

    • Named Entity Recognition (NER) tool suitable for knowledge graph construction.

    • Designed for general-purpose NER with flexible entity type definitions.

  • KuzuDB:

    • High-performance, embeddable graph database system.

    • Designed for knowledge graph applications, offering Cypher query language support.

  • pyOxigraph (Oxigraph):

    • Python bindings for Oxigraph, an RDF graph database written in Rust.

    • Supports SPARQL 1.1 query and update standards.

  • QLever:

    • SPARQL query engine optimized for large knowledge graphs.

    • Developed by the University of Freiburg. Known for efficient completion-based querying.

  • SHACL via pySHACL:

    • Python library for validating RDF graphs against SHACL (Shapes Constraint Language) shapes.

    • Used for ensuring data quality and consistency in KGs.

  • MorphKG:

    • Tool for constructing knowledge graphs from diverse data sources (CSV, JSON, RDBs) using mapping rules (RML, YARRRML).

Data Repositories & Platforms

  • Dutch Life Sciences Data Portal:

    • Comprehensive repository of open Dutch life sciences data hosted by DANS.

    • Provides access to datasets for research use.

  • Tahoe-100M: Single-Cell Perturbation Atlas:

    • Large dataset mapping drug-cell interactions (reportedly ~60,000) across ~50 cancer cell lines.

    • Contains data from ~300 million cells.

    • Combines natural cell states with deliberately perturbed cells.

    • Open-source resource for biological modeling.

  • Croissant Dataverse Metadata Extraction:

    • Export of public metadata records from Harvard Dataverse in Croissant format.

    • Aims to provide ML-ready dataset descriptions in JSON-LD.

  • BioXYZ Hackathon Kickoff Presentation - Links to Dataverse Datasets:

    • Presentation containing links to FAIR datasets within the Dataverse network.

    • Presented by Slava Tykhonov (DANS-KNAW).

  • Dataverse Installations Directory:

    • Crowdsourced spreadsheet listing global Dataverse installations.

    • Contains metadata about repositories (location, launch year, contact).

    • Resource for understanding the Dataverse network ecosystem.

  • General Index:

    • Vast collection of metadata (n-grams) from digitized materials in the Internet Archive.

    • Searchable index facilitating research across the archive's library.

  • Papers with Code Datasets:

    • A large, community-curated collection of datasets used in machine learning research.

    • Links datasets to papers, code, and benchmarks.

Data Access APIs

  • Bioarxiv API:

    • Provides programmatic access to pre-print articles from bioRxiv.

    • Enables data mining, trend analysis, and application development for biological sciences literature.

  • OpenAlex API:

    • Provides access to a comprehensive, open-source index of scholarly works (publications, authors, institutions, concepts, etc.).

    • Free alternative to proprietary citation databases.

  • ArXiv API:

    • Provides programmatic access to the arXiv repository of electronic preprints (physics, math, CS, etc.).

    • Enables retrieval of metadata and full-text articles.

  • Crossref API:

    • Provides access to metadata for scholarly publications via DOIs (citations, abstracts, funding data, etc.).

    • Enables retrieval of publication information for research and analysis.

  • Medline API (PubMed API / E-utilities):

    • Provides access to the Medline database of biomedical literature citations and abstracts via NCBI E-utilities.

    • Allows retrieval of abstracts, author information, MeSH terms, etc.

  • PMC API (PubMed Central APIs / OAI):

    • Provides access to the PubMed Central (PMC) full-text archive of biomedical and life sciences journal literature.

    • Enables retrieval of full-text articles and metadata, often via OAI-PMH or E-utilities.

  • DataCite API:

    • Provides access to DataCite's metadata registry for research data via DOIs.

    • Supports REST, GraphQL, and OAI-PMH interfaces.

  • ORCID API:

    • Provides access to ORCID registry data (researcher identifiers and profile information).

    • Offers public and member APIs for retrieving and (for members) updating ORCID records.

Data & Metadata Standards

  • ISCC Codes:

    • International Standard Content Codes: A decentralized standard for content identification.

    • Creates digital fingerprints based on content similarity for various digital media.

  • Croissant Specifications:

    • A standard metadata format (JSON-LD based) for describing datasets, especially for machine learning.

    • Aims to simplify sharing and usage by providing a common structure.

    • Enables interoperability between tools and platforms.

  • MLCommons - Croissant and GeoCroissant:

    • Organization developing standards for dataset description (Croissant) and geospatial data (GeoCroissant).

    • GeoCroissant is noted as being under development.

  • ESIP Science on Schema.org:

    • Guidelines and extensions for using Schema.org vocabulary to describe scientific datasets and research artifacts.

    • Promotes FAIR principles for data discovery.

  • CODATA CDIF:

    • Cross-Domain Interoperability Framework: A framework being developed by CODATA for indexing scientific data across disciplines.

  • BioSchema:

    • Community effort extending Schema.org for marking up life sciences data on the web.

    • Aims to improve the findability and interoperability of biological data.

Supporting Tools & Infrastructure

  • Minio:

    • High-performance, distributed object storage system.

    • API compatible with Amazon S3, suitable for large datasets (e.g., ML, backups).

  • LanceDB:

    • Embeddable vector database designed for AI applications.

    • Optimized for similarity search, vector embeddings, and managing ML data.

  • BAML:

    • Boundary AI Markup Language: A framework or language aiming to simplify AI application development, potentially bridging code and models.

Benchmarking & Evaluation

  • BixBench:

    • Comprehensive benchmark for evaluating LLM-based agents in bioinformatics tasks.

    • Contains 53 real-world analytical scenarios with nearly 300 open-answer questions.

    • Tests agents' abilities in complex multi-step analyses.

    • Available as a Hugging Face dataset.

Research Highlights & Organizations

  • Sakana.ai: First AI-Generated Peer-Reviewed Publication:

    • Paper generated by AI Scientist framework passed peer review at an ICLR workshop without human modifications.

    • Paper topic: compositional regularization challenges.

    • Reported average score above acceptance threshold (6.33).

    • Emphasized full transparency with IRB approval.

  • Future House Research:

    • Organization focused on AI applications in scientific discovery.

    • Works on the automation of scientific processes.

    • Aims to accelerate scientific breakthroughs using AI.

  • Lab-in-the-loop: Therapeutic Antibody Design:

    • Research demonstrating combining generative ML models with experimental feedback cycles.

    • Reported 3-100× improvement in binding affinity for lead antibody molecules.

    • Balances exploration and exploitation of the antibody sequence space.

(Note: This links to the general API page, E-utilities are the primary access method)

https://developers.googleblog.com/en/introducing-txgemma-open-models-improving-therapeutics-development/
https://arcinstitute.org/news/blog/evo2
https://agentrxiv.github.io/
https://arxiv.org/abs/2408.06292
https://arxiv.org/abs/2502.16069
https://www.biorxiv.org/content/10.1101/2024.11.11.623004v1
https://arxiv.org/html/2412.21154v1
https://storage.googleapis.com/coscientist_paper/ai_coscientist.pdf
https://arxiv.org/abs/2502.09858
https://yesnoerror.com/
https://www.biorxiv.org/content/10.1101/2025.03.11.642548v1
https://arxiv.org/abs/2305.06488
https://research.google/blog/from-diagnosis-to-treatment-advancing-amie-for-longitudinal-disease-management/
https://elicit.com/
https://github.com/Future-House/paper-qa
https://beeard.ai/
https://dexscreener.com/base/0xa02567fc557C6a409464EC40480b9F5660a991B3
https://arxiv.org/abs/2410.04660
https://www.nature.com/articles/s41597-023-02781-0
https://www.nebula-graph.io/posts/Rhinitis%20Knowledge%20Graph
https://biocypher.org/latest/
https://nanopub.net/
https://github.com/getzep/graphiti
https://github.com/kingjulio8238/Memary
https://www.cognee.ai/
https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial
https://query.wikidata.org/
https://graph.openaire.eu/
https://orkg.org/
https://www.connectedpapers.com/
https://researchgraph.org/
https://www.uniprot.org/
https://arxiv.org/abs/2503.18852
https://derwen.ai/docs/kgl/
https://github.com/SapienzaNLP/relik
https://github.com/urchade/GLiNER
https://kuzudb.com/
https://pypi.org/project/pyoxigraph/
https://github.com/oxigraph/oxigraph
https://github.com/ad-freiburg/qlever
https://github.com/RDFLib/pySHACL
https://morph-kgc.readthedocs.io/en/stable/
https://lifesciences.datastations.nl
https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DSGAVS
https://docs.google.com/presentation/d/104ZlLGRPhpEWN1KjDamatXY42r1ph7Z3AePZfJ6LKsc/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1bfsw7gnHlHerLXuk7YprUT68liHfcaMxs1rFciA-mEo/edit?gid=0#gid=0
https://archive.org/details/GeneralIndex
https://paperswithcode.com/datasets
https://api.biorxiv.org/
https://docs.openalex.org/how-to-use-the-api/api-overview
http://arxiv.org/help/api
https://www.crossref.org/documentation/retrieve-metadata/rest-api/
https://pubmed.ncbi.nlm.nih.gov/api/
https://www.ncbi.nlm.nih.gov/pmc/tools/oai/
https://support.datacite.org/docs/api
https://info.orcid.org/what-is-orcid/services/public-api/
https://iscc.codes/
https://docs.mlcommons.org/croissant/docs/croissant-spec.html
https://mlcommons.org/working-groups/data/croissant/
https://github.com/ESIPFed/science-on-schema.org
https://cdif.codata.org/
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05258-4
https://min.io/
https://lancedb.github.io/lancedb/
https://docs.boundaryml.com/home
https://huggingface.co/datasets/futurehouse/BixBench
https://sakana.ai/ai-scientist-first-publication/
https://www.futurehouse.org/research
https://www.biorxiv.org/content/10.1101/2025.02.19.639050v1.full.pdf