Bio x AI Hackathon
  • Welcome to the Bio x AI Hackathon
  • Getting Started
    • Quickstart
    • Important Links
  • Developers
    • BioAgents
    • CoreAgents
    • Eliza Agent Framework
    • Knowledge Graphs
    • .cursorrules
    • Starter-repos
    • Plugin Guide
  • Vision and Mission
    • Bio x AI Hackathon
    • The Problems in Science
    • TechBio
    • Guidance from the Judges
      • Important Datasets and Code Repositories
      • Reading List
      • Common Mistakes for Developers new to Academia
    • Hackathon Ideas
      • Full Projects
        • The Complexity Slider - Finding Hypotheses at the Limits of Human Knowledge
        • [Hard Mode] Metadata Generation on datasets with No Manuscript or Code Associated
        • Inverse Reproducibility - Given Manuscript and Data, Make the Code
        • Atlas of Research Methods Formatted for Agentic Reuse
        • Utilizing Knowledge Graphs for the Detection of Potential Null Results
        • Creating an Iterative Publication Stack by Linking Together Existing Tooling
        • Longevity Atlas: Building a Decentralized Knowledge Network with Agentic Research Hypothesis Engine
        • CoreAgent Track - Opportunities to work with BioDAOs
        • SpineDAO Chronos Project Spec
      • Individual Plugins
        • Plug-ins for every piece of research tooling known to humankind
        • Reproducibility Assistant - Code Cleaning, Dockerization, etc
        • Finding and Differentiating Cardinal vs Supporting Assertions
        • [Easier Mode] Metadata Generation on Datasets Given the Manuscript and Code Repository
        • Sentiment Analysis on Existing Citations, Dissenting vs Confirming
        • Agentic Metadata Template Creation for Standard Lab Equipment
  • Ops
    • Calendar
      • Key Dates
      • Office Hours
    • Judges and Mentors
      • Communicating to Judges and Mentors
      • BioAgent Judging Panel
      • CoreAgent Judging Panel
      • Mentors
    • Prize Tracks
    • Hackathon Rules
    • Kickoff Speakers
    • FAQ
Powered by GitBook
On this page
  • AI Models for Science & Therapeutics
  • AI Agents & Frameworks for Scientific Discovery
  • Knowledge Graphs: Platforms & Concepts
  • Knowledge Graphs: Tools & Databases
  • Data Repositories & Platforms
  • Data Access APIs
  • Data & Metadata Standards
  • Supporting Tools & Infrastructure
  • Benchmarking & Evaluation
  • Research Highlights & Organizations
  1. Vision and Mission
  2. Guidance from the Judges

Important Datasets and Code Repositories

PreviousGuidance from the JudgesNextReading List

Last updated 28 days ago

Resource Guide

AI Models for Science & Therapeutics

  • txGemma:

    • Collection of open models for therapeutics development.

    • Built on Gemma 2, trained with 7 million examples.

    • Available in three sizes (2B, 9B, 27B) with specialized versions.

    • 27B model outperforms single-task models in 50 of 66 tasks.

    • Designed for further fine-tuning with proprietary data.

  • Evo2: Foundational Model for Genome Modeling:

    • Trained on 9.3 trillion nucleotides from 128,000 genomes.

    • Reported 90% accuracy in detecting disease-causing mutations.

    • Can process sequences up to 1 million nucleotides.

    • Applications: genetic analysis, disease mutation detection, gene therapy design.

    • Open-source with publicly available training data, code, and weights.

AI Agents & Frameworks for Scientific Discovery

  • AgentRxiv: Collaborative Autonomous Research:

    • Centralized platform for autonomous research agents.

    • Enables knowledge sharing through similarity-based search.

    • Reported 78.2% accuracy on MATH-500 benchmarks (vs 73.8% without platform).

    • Demonstrated generalization across different benchmarks and models.

  • AI Scientist: Automated Scientific Discovery:

    • Framework for hypothesis generation, experiment design, and paper writing.

    • Produces research papers with minimal human intervention.

    • Reported cost-effective ($15 per paper).

    • Applications: diffusion models, language modeling, learning dynamics.

  • Curie: Rigorous Scientific Experimentation:

    • Features Architect Agent for planning and Technician Agents for execution.

    • Reported 3.4× improvement in answering experimental questions.

    • Enforces experimental discipline while maintaining creativity.

  • The Virtual Lab: Nanobody Design:

    • Multi-agent collaboration with minimal human input (reported 1.3% of total words).

    • Designed 92 nanobodies with >90% expressing as soluble proteins.

    • Combines agents with different expertise to solve complex challenges.

  • Aviary: Training Language Agents for Scientific Tasks:

    • Open-source agents aiming to match frontier LLMs at lower cost for specific tasks.

    • Handles molecular cloning, literature research, protein engineering.

    • Uses stochastic computation graph framework.

  • Google AI Co-scientist:

    • Multi-agent AI system built with Gemini 2.0.

    • Generates and evaluates research hypotheses through iterative reasoning.

    • Shown promising results in drug repurposing, liver fibrosis treatment, and antimicrobial resistance.

    • Available to research organizations via a Trusted Tester Programme.

  • Popper: Automated Hypothesis Validation:

    • Sequential falsification approach using LLM agents.

    • Designs experiments, executes tests, and analyzes results.

    • Claimed 10× faster than human scientists with comparable accuracy.

    • Maintains strict Type-I error control.

  • YesNoError: Scientific Literature Auditing:

    • Multi-agent system for detecting errors in scientific papers.

    • Checks mathematics, methodology, references, and logical consistency.

    • Uses synthetic data pipeline to improve detection accuracy.

    • Mentions a token-based economy ($YNE) for requesting audits.

  • Talk2Biomodels: Conversational Biological Modeling:

    • Natural language interface for exploring biological models.

    • Supports SBML format and BioModels database.

    • Features time-course simulations, steady-state analysis, parameter scanning.

    • Uses retrieval-augmented generation to prevent hallucinations.

  • BioChatter: Biomedical LLM Platform:

    • Open-source framework for biomedical LLM applications.

    • Integrates knowledge, retrieval-augmented generation, model chaining.

    • Designed for privacy-preserving use with local open-source LLMs.

    • Connects to BioCypher knowledge graphs.

  • Aime: Medical Reasoning System:

    • Two-agent architecture: Dialogue Agent and Mx Agent.

    • Grounds recommendations in clinical guidelines.

    • Performed well on RxQA medication reasoning benchmark.

    • Focuses on longitudinal disease management.

  • Elicit:

    • AI-powered research tool for extracting information from academic papers.

    • Summarizes key findings, identifies related research.

    • Helps users grasp core concepts of complex scientific literature.

    • Offers a free trial.

  • PaperQA:

    • Retrieval Augmented Generation (RAG) tool by Future House.

    • Answers questions based on scientific paper collections.

    • Helps researchers process large amounts of scientific literature.

  • BeeARD:

    • Aims for automated hypothesis generation and validation at scale.

    • Powered by multi-agent AI systems and knowledge graphs.

    • Project token associated (link provided).

  • KGARevion: Knowledge Graph-Based AI Agent:

    • Multi-step process for biomedical question answering using KGs.

    • Reported 5.2% improvement over baseline approaches.

    • Shows strong zero-shot generalization to underrepresented medical contexts.

Knowledge Graphs: Platforms & Concepts

  • HALD: Human Aging and Longevity Knowledge Graph:

    • Contains 12,000+ entities, 115,000+ relations (as reported).

    • Extracted from 340,000 PubMed articles.

    • Provides structured exploration of aging and longevity biomarkers.

  • NebulaGraph: Rhinitis Knowledge Graph Example:

    • Demonstrates leveraging LLMs to extract knowledge from Chinese medical records for KG building.

    • Outlines a five-step methodology for knowledge extraction.

    • ChatGPT-4 achieved 82.75% F1 score in knowledge extraction in this case study.

  • BioCypher: Knowledge Graph Framework:

    • Framework for building KGs, integrates with BioChatter and biomedical datasets.

    • Features flexible ontology structures and adaptable output formats.

    • Supports multiple formats (RDF, SQL, NetworkX).

    • Customizable via YAML configuration.

  • Nanopublication Network:

    • System for publishing and sharing self-contained units of scientific information.

    • Each "nanopublication" contains assertions, provenance, and publication information.

    • Aims to improve transparency and reproducibility of scientific research.

  • Graphiti:

    • Builds temporally-aware knowledge graphs for AI agents.

    • Models relationships and context that change over time.

  • Memary:

    • Aims to give AI Agents human-like memory capabilities.

    • Tracks entity knowledge, preferences, and chat history in an automatically updating knowledge graph.

  • Cognee:

    • Python library combining knowledge graphs and RAG.

    • Builds evolving semantic memory for AI agents/apps using dynamic KGs.

  • Wikidata:

    • Free and open knowledge base.

    • Queryable via SPARQL endpoint.

  • OpenAIRE Graph:

    • A large-scale research information knowledge graph.

    • Connects publications, datasets, software, authors, institutions.

  • Open Research Knowledge Graph (ORKG):

    • Platform for structured representation of scientific knowledge.

    • Aims to make research contributions machine-actionable.

  • Connected Papers:

    • Visual tool for exploring research paper connections and discovering related literature.

    • Creates graph visualizations based on citations and similarity.

  • Research Graph:

    • Focuses on building a global research collaboration and citation network.

    • Connects researchers, institutions, publications, grants, and datasets.

  • UniProt:

    • Comprehensive, high-quality resource for protein sequence and functional information.

    • Widely used database in bioinformatics and life sciences.

  • Self-Organizing Graphs Reasoning:

    • Research suggesting graph reasoning systems evolve towards a critical state.

    • Observed a stable Critical Discovery Parameter (-0.03).

    • Found 12% of connections form between semantically distant concepts.

    • Demonstrates scale-free and small-world properties in emergent graphs.

Knowledge Graphs: Tools & Databases

  • KGLab:

    • Python library for knowledge graph development and machine learning.

    • Provides tools for graph construction, analysis, and ML integration.

  • Relik:

    • Knowledge graph construction and linking tool (entity and relation linking).

    • Developed by Sapienza NLP group.

  • GLiNER:

    • Named Entity Recognition (NER) tool suitable for knowledge graph construction.

    • Designed for general-purpose NER with flexible entity type definitions.

  • KuzuDB:

    • High-performance, embeddable graph database system.

    • Designed for knowledge graph applications, offering Cypher query language support.

  • pyOxigraph (Oxigraph):

    • Python bindings for Oxigraph, an RDF graph database written in Rust.

    • Supports SPARQL 1.1 query and update standards.

  • QLever:

    • SPARQL query engine optimized for large knowledge graphs.

    • Developed by the University of Freiburg. Known for efficient completion-based querying.

  • SHACL via pySHACL:

    • Python library for validating RDF graphs against SHACL (Shapes Constraint Language) shapes.

    • Used for ensuring data quality and consistency in KGs.

  • MorphKG:

    • Tool for constructing knowledge graphs from diverse data sources (CSV, JSON, RDBs) using mapping rules (RML, YARRRML).

Data Repositories & Platforms

  • Dutch Life Sciences Data Portal:

    • Comprehensive repository of open Dutch life sciences data hosted by DANS.

    • Provides access to datasets for research use.

  • Tahoe-100M: Single-Cell Perturbation Atlas:

    • Large dataset mapping drug-cell interactions (reportedly ~60,000) across ~50 cancer cell lines.

    • Contains data from ~300 million cells.

    • Combines natural cell states with deliberately perturbed cells.

    • Open-source resource for biological modeling.

  • Croissant Dataverse Metadata Extraction:

    • Export of public metadata records from Harvard Dataverse in Croissant format.

    • Aims to provide ML-ready dataset descriptions in JSON-LD.

  • BioXYZ Hackathon Kickoff Presentation - Links to Dataverse Datasets:

    • Presentation containing links to FAIR datasets within the Dataverse network.

    • Presented by Slava Tykhonov (DANS-KNAW).

  • Dataverse Installations Directory:

    • Crowdsourced spreadsheet listing global Dataverse installations.

    • Contains metadata about repositories (location, launch year, contact).

    • Resource for understanding the Dataverse network ecosystem.

  • General Index:

    • Vast collection of metadata (n-grams) from digitized materials in the Internet Archive.

    • Searchable index facilitating research across the archive's library.

  • Papers with Code Datasets:

    • A large, community-curated collection of datasets used in machine learning research.

    • Links datasets to papers, code, and benchmarks.

Data Access APIs

  • Bioarxiv API:

    • Provides programmatic access to pre-print articles from bioRxiv.

    • Enables data mining, trend analysis, and application development for biological sciences literature.

  • OpenAlex API:

    • Provides access to a comprehensive, open-source index of scholarly works (publications, authors, institutions, concepts, etc.).

    • Free alternative to proprietary citation databases.

  • ArXiv API:

    • Provides programmatic access to the arXiv repository of electronic preprints (physics, math, CS, etc.).

    • Enables retrieval of metadata and full-text articles.

  • Crossref API:

    • Provides access to metadata for scholarly publications via DOIs (citations, abstracts, funding data, etc.).

    • Enables retrieval of publication information for research and analysis.

  • Medline API (PubMed API / E-utilities):

    • Provides access to the Medline database of biomedical literature citations and abstracts via NCBI E-utilities.

    • Allows retrieval of abstracts, author information, MeSH terms, etc.

  • PMC API (PubMed Central APIs / OAI):

    • Provides access to the PubMed Central (PMC) full-text archive of biomedical and life sciences journal literature.

    • Enables retrieval of full-text articles and metadata, often via OAI-PMH or E-utilities.

  • DataCite API:

    • Provides access to DataCite's metadata registry for research data via DOIs.

    • Supports REST, GraphQL, and OAI-PMH interfaces.

  • ORCID API:

    • Provides access to ORCID registry data (researcher identifiers and profile information).

    • Offers public and member APIs for retrieving and (for members) updating ORCID records.

Data & Metadata Standards

  • ISCC Codes:

    • International Standard Content Codes: A decentralized standard for content identification.

    • Creates digital fingerprints based on content similarity for various digital media.

  • Croissant Specifications:

    • A standard metadata format (JSON-LD based) for describing datasets, especially for machine learning.

    • Aims to simplify sharing and usage by providing a common structure.

    • Enables interoperability between tools and platforms.

  • MLCommons - Croissant and GeoCroissant:

    • Organization developing standards for dataset description (Croissant) and geospatial data (GeoCroissant).

    • GeoCroissant is noted as being under development.

  • ESIP Science on Schema.org:

    • Guidelines and extensions for using Schema.org vocabulary to describe scientific datasets and research artifacts.

    • Promotes FAIR principles for data discovery.

  • CODATA CDIF:

    • Cross-Domain Interoperability Framework: A framework being developed by CODATA for indexing scientific data across disciplines.

  • BioSchema:

    • Community effort extending Schema.org for marking up life sciences data on the web.

    • Aims to improve the findability and interoperability of biological data.

Supporting Tools & Infrastructure

  • Minio:

    • High-performance, distributed object storage system.

    • API compatible with Amazon S3, suitable for large datasets (e.g., ML, backups).

  • LanceDB:

    • Embeddable vector database designed for AI applications.

    • Optimized for similarity search, vector embeddings, and managing ML data.

  • BAML:

    • Boundary AI Markup Language: A framework or language aiming to simplify AI application development, potentially bridging code and models.

Benchmarking & Evaluation

  • BixBench:

    • Comprehensive benchmark for evaluating LLM-based agents in bioinformatics tasks.

    • Contains 53 real-world analytical scenarios with nearly 300 open-answer questions.

    • Tests agents' abilities in complex multi-step analyses.

    • Available as a Hugging Face dataset.

Research Highlights & Organizations

  • Sakana.ai: First AI-Generated Peer-Reviewed Publication:

    • Paper generated by AI Scientist framework passed peer review at an ICLR workshop without human modifications.

    • Paper topic: compositional regularization challenges.

    • Reported average score above acceptance threshold (6.33).

    • Emphasized full transparency with IRB approval.

  • Future House Research:

    • Organization focused on AI applications in scientific discovery.

    • Works on the automation of scientific processes.

    • Aims to accelerate scientific breakthroughs using AI.

  • Lab-in-the-loop: Therapeutic Antibody Design:

    • Research demonstrating combining generative ML models with experimental feedback cycles.

    • Reported 3-100× improvement in binding affinity for lead antibody molecules.

    • Balances exploration and exploitation of the antibody sequence space.

(Note: This links to the general API page, E-utilities are the primary access method)

https://developers.googleblog.com/en/introducing-txgemma-open-models-improving-therapeutics-development/
https://arcinstitute.org/news/blog/evo2
https://agentrxiv.github.io/
https://arxiv.org/abs/2408.06292
https://arxiv.org/abs/2502.16069
https://www.biorxiv.org/content/10.1101/2024.11.11.623004v1
https://arxiv.org/html/2412.21154v1
https://storage.googleapis.com/coscientist_paper/ai_coscientist.pdf
https://arxiv.org/abs/2502.09858
https://yesnoerror.com/
https://www.biorxiv.org/content/10.1101/2025.03.11.642548v1
https://arxiv.org/abs/2305.06488
https://research.google/blog/from-diagnosis-to-treatment-advancing-amie-for-longitudinal-disease-management/
https://elicit.com/
https://github.com/Future-House/paper-qa
https://beeard.ai/
https://dexscreener.com/base/0xa02567fc557C6a409464EC40480b9F5660a991B3
https://arxiv.org/abs/2410.04660
https://www.nature.com/articles/s41597-023-02781-0
https://www.nebula-graph.io/posts/Rhinitis%20Knowledge%20Graph
https://biocypher.org/latest/
https://nanopub.net/
https://github.com/getzep/graphiti
https://github.com/kingjulio8238/Memary
https://www.cognee.ai/
https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial
https://query.wikidata.org/
https://graph.openaire.eu/
https://orkg.org/
https://www.connectedpapers.com/
https://researchgraph.org/
https://www.uniprot.org/
https://arxiv.org/abs/2503.18852
https://derwen.ai/docs/kgl/
https://github.com/SapienzaNLP/relik
https://github.com/urchade/GLiNER
https://kuzudb.com/
https://pypi.org/project/pyoxigraph/
https://github.com/oxigraph/oxigraph
https://github.com/ad-freiburg/qlever
https://github.com/RDFLib/pySHACL
https://morph-kgc.readthedocs.io/en/stable/
https://lifesciences.datastations.nl
https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DSGAVS
https://docs.google.com/presentation/d/104ZlLGRPhpEWN1KjDamatXY42r1ph7Z3AePZfJ6LKsc/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1bfsw7gnHlHerLXuk7YprUT68liHfcaMxs1rFciA-mEo/edit?gid=0#gid=0
https://archive.org/details/GeneralIndex
https://paperswithcode.com/datasets
https://api.biorxiv.org/
https://docs.openalex.org/how-to-use-the-api/api-overview
http://arxiv.org/help/api
https://www.crossref.org/documentation/retrieve-metadata/rest-api/
https://pubmed.ncbi.nlm.nih.gov/api/
https://www.ncbi.nlm.nih.gov/pmc/tools/oai/
https://support.datacite.org/docs/api
https://info.orcid.org/what-is-orcid/services/public-api/
https://iscc.codes/
https://docs.mlcommons.org/croissant/docs/croissant-spec.html
https://mlcommons.org/working-groups/data/croissant/
https://github.com/ESIPFed/science-on-schema.org
https://cdif.codata.org/
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05258-4
https://min.io/
https://lancedb.github.io/lancedb/
https://docs.boundaryml.com/home
https://huggingface.co/datasets/futurehouse/BixBench
https://sakana.ai/ai-scientist-first-publication/
https://www.futurehouse.org/research
https://www.biorxiv.org/content/10.1101/2025.02.19.639050v1.full.pdf