Important Datasets and Code Repositories

Resource Guide

AI Models for Science & Therapeutics

txGemma:
- Collection of open models for therapeutics development.
- Built on Gemma 2, trained with 7 million examples.
- Available in three sizes (2B, 9B, 27B) with specialized versions.
- 27B model outperforms single-task models in 50 of 66 tasks.
- Designed for further fine-tuning with proprietary data.
- https://developers.googleblog.com/en/introducing-txgemma-open-models-improving-therapeutics-development/
Evo2: Foundational Model for Genome Modeling:
- Trained on 9.3 trillion nucleotides from 128,000 genomes.
- Reported 90% accuracy in detecting disease-causing mutations.
- Can process sequences up to 1 million nucleotides.
- Applications: genetic analysis, disease mutation detection, gene therapy design.
- Open-source with publicly available training data, code, and weights.
- https://arcinstitute.org/news/blog/evo2

AI Agents & Frameworks for Scientific Discovery

AgentRxiv: Collaborative Autonomous Research:
- Centralized platform for autonomous research agents.
- Enables knowledge sharing through similarity-based search.
- Reported 78.2% accuracy on MATH-500 benchmarks (vs 73.8% without platform).
- Demonstrated generalization across different benchmarks and models.
- https://agentrxiv.github.io/
AI Scientist: Automated Scientific Discovery:
- Framework for hypothesis generation, experiment design, and paper writing.
- Produces research papers with minimal human intervention.
- Reported cost-effective ($15 per paper).
- Applications: diffusion models, language modeling, learning dynamics.
- https://arxiv.org/abs/2408.06292
Curie: Rigorous Scientific Experimentation:
- Features Architect Agent for planning and Technician Agents for execution.
- Reported 3.4× improvement in answering experimental questions.
- Enforces experimental discipline while maintaining creativity.
- https://arxiv.org/abs/2502.16069
The Virtual Lab: Nanobody Design:
- Multi-agent collaboration with minimal human input (reported 1.3% of total words).
- Designed 92 nanobodies with >90% expressing as soluble proteins.
- Combines agents with different expertise to solve complex challenges.
- https://www.biorxiv.org/content/10.1101/2024.11.11.623004v1
Aviary: Training Language Agents for Scientific Tasks:
- Open-source agents aiming to match frontier LLMs at lower cost for specific tasks.
- Handles molecular cloning, literature research, protein engineering.
- Uses stochastic computation graph framework.
- https://arxiv.org/html/2412.21154v1
Google AI Co-scientist:
- Multi-agent AI system built with Gemini 2.0.
- Generates and evaluates research hypotheses through iterative reasoning.
- Shown promising results in drug repurposing, liver fibrosis treatment, and antimicrobial resistance.
- Available to research organizations via a Trusted Tester Programme.
- https://storage.googleapis.com/coscientist_paper/ai_coscientist.pdf
Popper: Automated Hypothesis Validation:
- Sequential falsification approach using LLM agents.
- Designs experiments, executes tests, and analyzes results.
- Claimed 10× faster than human scientists with comparable accuracy.
- Maintains strict Type-I error control.
- https://arxiv.org/abs/2502.09858
YesNoError: Scientific Literature Auditing:
- Multi-agent system for detecting errors in scientific papers.
- Checks mathematics, methodology, references, and logical consistency.
- Uses synthetic data pipeline to improve detection accuracy.
- Mentions a token-based economy ($YNE) for requesting audits.
- https://yesnoerror.com/
Talk2Biomodels: Conversational Biological Modeling:
- Natural language interface for exploring biological models.
- Supports SBML format and BioModels database.
- Features time-course simulations, steady-state analysis, parameter scanning.
- Uses retrieval-augmented generation to prevent hallucinations.
- https://www.biorxiv.org/content/10.1101/2025.03.11.642548v1
BioChatter: Biomedical LLM Platform:
- Open-source framework for biomedical LLM applications.
- Integrates knowledge, retrieval-augmented generation, model chaining.
- Designed for privacy-preserving use with local open-source LLMs.
- Connects to BioCypher knowledge graphs.
- https://arxiv.org/abs/2305.06488
Aime: Medical Reasoning System:
- Two-agent architecture: Dialogue Agent and Mx Agent.
- Grounds recommendations in clinical guidelines.
- Performed well on RxQA medication reasoning benchmark.
- Focuses on longitudinal disease management.
- https://research.google/blog/from-diagnosis-to-treatment-advancing-amie-for-longitudinal-disease-management/
Elicit:
- AI-powered research tool for extracting information from academic papers.
- Summarizes key findings, identifies related research.
- Helps users grasp core concepts of complex scientific literature.
- Offers a free trial.
- https://elicit.com/
PaperQA:
- Retrieval Augmented Generation (RAG) tool by Future House.
- Answers questions based on scientific paper collections.
- Helps researchers process large amounts of scientific literature.
- https://github.com/Future-House/paper-qa
BeeARD:
- Aims for automated hypothesis generation and validation at scale.
- Powered by multi-agent AI systems and knowledge graphs.
- Project token associated (link provided).
- https://beeard.ai/
- https://dexscreener.com/base/0xa02567fc557C6a409464EC40480b9F5660a991B3
KGARevion: Knowledge Graph-Based AI Agent:
- Multi-step process for biomedical question answering using KGs.
- Reported 5.2% improvement over baseline approaches.
- Shows strong zero-shot generalization to underrepresented medical contexts.
- https://arxiv.org/abs/2410.04660

Knowledge Graphs: Platforms & Concepts

HALD: Human Aging and Longevity Knowledge Graph:
- Contains 12,000+ entities, 115,000+ relations (as reported).
- Extracted from 340,000 PubMed articles.
- Provides structured exploration of aging and longevity biomarkers.
- https://www.nature.com/articles/s41597-023-02781-0
NebulaGraph: Rhinitis Knowledge Graph Example:
- Demonstrates leveraging LLMs to extract knowledge from Chinese medical records for KG building.
- Outlines a five-step methodology for knowledge extraction.
- ChatGPT-4 achieved 82.75% F1 score in knowledge extraction in this case study.
- https://www.nebula-graph.io/posts/Rhinitis%20Knowledge%20Graph
BioCypher: Knowledge Graph Framework:
- Framework for building KGs, integrates with BioChatter and biomedical datasets.
- Features flexible ontology structures and adaptable output formats.
- Supports multiple formats (RDF, SQL, NetworkX).
- Customizable via YAML configuration.
- https://biocypher.org/latest/
Nanopublication Network:
- System for publishing and sharing self-contained units of scientific information.
- Each "nanopublication" contains assertions, provenance, and publication information.
- Aims to improve transparency and reproducibility of scientific research.
- https://nanopub.net/
Graphiti:
- Builds temporally-aware knowledge graphs for AI agents.
- Models relationships and context that change over time.
- https://github.com/getzep/graphiti
Memary:
- Aims to give AI Agents human-like memory capabilities.
- Tracks entity knowledge, preferences, and chat history in an automatically updating knowledge graph.
- https://github.com/kingjulio8238/Memary
Cognee:
- Python library combining knowledge graphs and RAG.
- Builds evolving semantic memory for AI agents/apps using dynamic KGs.
- https://www.cognee.ai/
Wikidata:
- Free and open knowledge base.
- Queryable via SPARQL endpoint.
- https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial
- https://query.wikidata.org/
OpenAIRE Graph:
- A large-scale research information knowledge graph.
- Connects publications, datasets, software, authors, institutions.
- https://graph.openaire.eu/
Open Research Knowledge Graph (ORKG):
- Platform for structured representation of scientific knowledge.
- Aims to make research contributions machine-actionable.
- https://orkg.org/
Connected Papers:
- Visual tool for exploring research paper connections and discovering related literature.
- Creates graph visualizations based on citations and similarity.
- https://www.connectedpapers.com/
Research Graph:
- Focuses on building a global research collaboration and citation network.
- Connects researchers, institutions, publications, grants, and datasets.
- https://researchgraph.org/
UniProt:
- Comprehensive, high-quality resource for protein sequence and functional information.
- Widely used database in bioinformatics and life sciences.
- https://www.uniprot.org/
Self-Organizing Graphs Reasoning:
- Research suggesting graph reasoning systems evolve towards a critical state.
- Observed a stable Critical Discovery Parameter (-0.03).
- Found 12% of connections form between semantically distant concepts.
- Demonstrates scale-free and small-world properties in emergent graphs.
- https://arxiv.org/abs/2503.18852

Knowledge Graphs: Tools & Databases

KGLab:
- Python library for knowledge graph development and machine learning.
- Provides tools for graph construction, analysis, and ML integration.
- https://derwen.ai/docs/kgl/
Relik:
- Knowledge graph construction and linking tool (entity and relation linking).
- Developed by Sapienza NLP group.
- https://github.com/SapienzaNLP/relik
GLiNER:
- Named Entity Recognition (NER) tool suitable for knowledge graph construction.
- Designed for general-purpose NER with flexible entity type definitions.
- https://github.com/urchade/GLiNER
KuzuDB:
- High-performance, embeddable graph database system.
- Designed for knowledge graph applications, offering Cypher query language support.
- https://kuzudb.com/
pyOxigraph (Oxigraph):
- Python bindings for Oxigraph, an RDF graph database written in Rust.
- Supports SPARQL 1.1 query and update standards.
- https://pypi.org/project/pyoxigraph/
- https://github.com/oxigraph/oxigraph
QLever:
- SPARQL query engine optimized for large knowledge graphs.
- Developed by the University of Freiburg. Known for efficient completion-based querying.
- https://github.com/ad-freiburg/qlever
SHACL via pySHACL:
- Python library for validating RDF graphs against SHACL (Shapes Constraint Language) shapes.
- Used for ensuring data quality and consistency in KGs.
- https://github.com/RDFLib/pySHACL
MorphKG:
- Tool for constructing knowledge graphs from diverse data sources (CSV, JSON, RDBs) using mapping rules (RML, YARRRML).
- https://morph-kgc.readthedocs.io/en/stable/

Data Repositories & Platforms

Dutch Life Sciences Data Portal:
- Comprehensive repository of open Dutch life sciences data hosted by DANS.
- Provides access to datasets for research use.
- https://lifesciences.datastations.nl
Tahoe-100M: Single-Cell Perturbation Atlas:
- Large dataset mapping drug-cell interactions (reportedly ~60,000) across ~50 cancer cell lines.
- Contains data from ~300 million cells.
- Combines natural cell states with deliberately perturbed cells.
- Open-source resource for biological modeling.
- https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1
Croissant Dataverse Metadata Extraction:
- Export of public metadata records from Harvard Dataverse in Croissant format.
- Aims to provide ML-ready dataset descriptions in JSON-LD.
- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DSGAVS
BioXYZ Hackathon Kickoff Presentation - Links to Dataverse Datasets:
- Presentation containing links to FAIR datasets within the Dataverse network.
- Presented by Slava Tykhonov (DANS-KNAW).
- https://docs.google.com/presentation/d/104ZlLGRPhpEWN1KjDamatXY42r1ph7Z3AePZfJ6LKsc/edit?usp=sharing
Dataverse Installations Directory:
- Crowdsourced spreadsheet listing global Dataverse installations.
- Contains metadata about repositories (location, launch year, contact).
- Resource for understanding the Dataverse network ecosystem.
- https://docs.google.com/spreadsheets/d/1bfsw7gnHlHerLXuk7YprUT68liHfcaMxs1rFciA-mEo/edit?gid=0#gid=0
General Index:
- Vast collection of metadata (n-grams) from digitized materials in the Internet Archive.
- Searchable index facilitating research across the archive's library.
- https://archive.org/details/GeneralIndex
Papers with Code Datasets:
- A large, community-curated collection of datasets used in machine learning research.
- Links datasets to papers, code, and benchmarks.
- https://paperswithcode.com/datasets

Data Access APIs

Bioarxiv API:
- Provides programmatic access to pre-print articles from bioRxiv.
- Enables data mining, trend analysis, and application development for biological sciences literature.
- https://api.biorxiv.org/
OpenAlex API:
- Provides access to a comprehensive, open-source index of scholarly works (publications, authors, institutions, concepts, etc.).
- Free alternative to proprietary citation databases.
- https://docs.openalex.org/how-to-use-the-api/api-overview
ArXiv API:
- Provides programmatic access to the arXiv repository of electronic preprints (physics, math, CS, etc.).
- Enables retrieval of metadata and full-text articles.
- http://arxiv.org/help/api
Crossref API:
- Provides access to metadata for scholarly publications via DOIs (citations, abstracts, funding data, etc.).
- Enables retrieval of publication information for research and analysis.
- https://www.crossref.org/documentation/retrieve-metadata/rest-api/
Medline API (PubMed API / E-utilities):
- Provides access to the Medline database of biomedical literature citations and abstracts via NCBI E-utilities.
- Allows retrieval of abstracts, author information, MeSH terms, etc.
- https://pubmed.ncbi.nlm.nih.gov/api/ (Note: This links to the general API page, E-utilities are the primary access method)
PMC API (PubMed Central APIs / OAI):
- Provides access to the PubMed Central (PMC) full-text archive of biomedical and life sciences journal literature.
- Enables retrieval of full-text articles and metadata, often via OAI-PMH or E-utilities.
- https://www.ncbi.nlm.nih.gov/pmc/tools/oai/
DataCite API:
- Provides access to DataCite's metadata registry for research data via DOIs.
- Supports REST, GraphQL, and OAI-PMH interfaces.
- https://support.datacite.org/docs/api
ORCID API:
- Provides access to ORCID registry data (researcher identifiers and profile information).
- Offers public and member APIs for retrieving and (for members) updating ORCID records.
- https://info.orcid.org/what-is-orcid/services/public-api/

Data & Metadata Standards

ISCC Codes:
- International Standard Content Codes: A decentralized standard for content identification.
- Creates digital fingerprints based on content similarity for various digital media.
- https://iscc.codes/
Croissant Specifications:
- A standard metadata format (JSON-LD based) for describing datasets, especially for machine learning.
- Aims to simplify sharing and usage by providing a common structure.
- Enables interoperability between tools and platforms.
- https://docs.mlcommons.org/croissant/docs/croissant-spec.html
MLCommons - Croissant and GeoCroissant:
- Organization developing standards for dataset description (Croissant) and geospatial data (GeoCroissant).
- GeoCroissant is noted as being under development.
- https://mlcommons.org/working-groups/data/croissant/
ESIP Science on Schema.org:
- Guidelines and extensions for using Schema.org vocabulary to describe scientific datasets and research artifacts.
- Promotes FAIR principles for data discovery.
- https://github.com/ESIPFed/science-on-schema.org
CODATA CDIF:
- Cross-Domain Interoperability Framework: A framework being developed by CODATA for indexing scientific data across disciplines.
- https://cdif.codata.org/
BioSchema:
- Community effort extending Schema.org for marking up life sciences data on the web.
- Aims to improve the findability and interoperability of biological data.
- https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05258-4

Supporting Tools & Infrastructure

Minio:
- High-performance, distributed object storage system.
- API compatible with Amazon S3, suitable for large datasets (e.g., ML, backups).
- https://min.io/
LanceDB:
- Embeddable vector database designed for AI applications.
- Optimized for similarity search, vector embeddings, and managing ML data.
- https://lancedb.github.io/lancedb/
BAML:
- Boundary AI Markup Language: A framework or language aiming to simplify AI application development, potentially bridging code and models.
- https://docs.boundaryml.com/home

Benchmarking & Evaluation

BixBench:
- Comprehensive benchmark for evaluating LLM-based agents in bioinformatics tasks.
- Contains 53 real-world analytical scenarios with nearly 300 open-answer questions.
- Tests agents' abilities in complex multi-step analyses.
- Available as a Hugging Face dataset.
- https://huggingface.co/datasets/futurehouse/BixBench

Research Highlights & Organizations

Sakana.ai: First AI-Generated Peer-Reviewed Publication:
- Paper generated by AI Scientist framework passed peer review at an ICLR workshop without human modifications.
- Paper topic: compositional regularization challenges.
- Reported average score above acceptance threshold (6.33).
- Emphasized full transparency with IRB approval.
- https://sakana.ai/ai-scientist-first-publication/
Future House Research:
- Organization focused on AI applications in scientific discovery.
- Works on the automation of scientific processes.
- Aims to accelerate scientific breakthroughs using AI.
- https://www.futurehouse.org/research
Lab-in-the-loop: Therapeutic Antibody Design:
- Research demonstrating combining generative ML models with experimental feedback cycles.
- Reported 3-100× improvement in binding affinity for lead antibody molecules.
- Balances exploration and exploitation of the antibody sequence space.
- https://www.biorxiv.org/content/10.1101/2025.02.19.639050v1.full.pdf

PreviousGuidance from the Judges NextReading List

Last updated 2 months ago