Important Datasets and Code Repositories
Last updated
Last updated
Resource Guide
txGemma:
Collection of open models for therapeutics development.
Built on Gemma 2, trained with 7 million examples.
Available in three sizes (2B, 9B, 27B) with specialized versions.
27B model outperforms single-task models in 50 of 66 tasks.
Designed for further fine-tuning with proprietary data.
Evo2: Foundational Model for Genome Modeling:
Trained on 9.3 trillion nucleotides from 128,000 genomes.
Reported 90% accuracy in detecting disease-causing mutations.
Can process sequences up to 1 million nucleotides.
Applications: genetic analysis, disease mutation detection, gene therapy design.
Open-source with publicly available training data, code, and weights.
AgentRxiv: Collaborative Autonomous Research:
Centralized platform for autonomous research agents.
Enables knowledge sharing through similarity-based search.
Reported 78.2% accuracy on MATH-500 benchmarks (vs 73.8% without platform).
Demonstrated generalization across different benchmarks and models.
AI Scientist: Automated Scientific Discovery:
Framework for hypothesis generation, experiment design, and paper writing.
Produces research papers with minimal human intervention.
Reported cost-effective ($15 per paper).
Applications: diffusion models, language modeling, learning dynamics.
Curie: Rigorous Scientific Experimentation:
Features Architect Agent for planning and Technician Agents for execution.
Reported 3.4× improvement in answering experimental questions.
Enforces experimental discipline while maintaining creativity.
The Virtual Lab: Nanobody Design:
Multi-agent collaboration with minimal human input (reported 1.3% of total words).
Designed 92 nanobodies with >90% expressing as soluble proteins.
Combines agents with different expertise to solve complex challenges.
Aviary: Training Language Agents for Scientific Tasks:
Open-source agents aiming to match frontier LLMs at lower cost for specific tasks.
Handles molecular cloning, literature research, protein engineering.
Uses stochastic computation graph framework.
Google AI Co-scientist:
Multi-agent AI system built with Gemini 2.0.
Generates and evaluates research hypotheses through iterative reasoning.
Shown promising results in drug repurposing, liver fibrosis treatment, and antimicrobial resistance.
Available to research organizations via a Trusted Tester Programme.
Popper: Automated Hypothesis Validation:
Sequential falsification approach using LLM agents.
Designs experiments, executes tests, and analyzes results.
Claimed 10× faster than human scientists with comparable accuracy.
Maintains strict Type-I error control.
YesNoError: Scientific Literature Auditing:
Multi-agent system for detecting errors in scientific papers.
Checks mathematics, methodology, references, and logical consistency.
Uses synthetic data pipeline to improve detection accuracy.
Mentions a token-based economy ($YNE) for requesting audits.
Talk2Biomodels: Conversational Biological Modeling:
Natural language interface for exploring biological models.
Supports SBML format and BioModels database.
Features time-course simulations, steady-state analysis, parameter scanning.
Uses retrieval-augmented generation to prevent hallucinations.
BioChatter: Biomedical LLM Platform:
Open-source framework for biomedical LLM applications.
Integrates knowledge, retrieval-augmented generation, model chaining.
Designed for privacy-preserving use with local open-source LLMs.
Connects to BioCypher knowledge graphs.
Aime: Medical Reasoning System:
Two-agent architecture: Dialogue Agent and Mx Agent.
Grounds recommendations in clinical guidelines.
Performed well on RxQA medication reasoning benchmark.
Focuses on longitudinal disease management.
Elicit:
AI-powered research tool for extracting information from academic papers.
Summarizes key findings, identifies related research.
Helps users grasp core concepts of complex scientific literature.
Offers a free trial.
PaperQA:
Retrieval Augmented Generation (RAG) tool by Future House.
Answers questions based on scientific paper collections.
Helps researchers process large amounts of scientific literature.
BeeARD:
Aims for automated hypothesis generation and validation at scale.
Powered by multi-agent AI systems and knowledge graphs.
Project token associated (link provided).
KGARevion: Knowledge Graph-Based AI Agent:
Multi-step process for biomedical question answering using KGs.
Reported 5.2% improvement over baseline approaches.
Shows strong zero-shot generalization to underrepresented medical contexts.
HALD: Human Aging and Longevity Knowledge Graph:
Contains 12,000+ entities, 115,000+ relations (as reported).
Extracted from 340,000 PubMed articles.
Provides structured exploration of aging and longevity biomarkers.
NebulaGraph: Rhinitis Knowledge Graph Example:
Demonstrates leveraging LLMs to extract knowledge from Chinese medical records for KG building.
Outlines a five-step methodology for knowledge extraction.
ChatGPT-4 achieved 82.75% F1 score in knowledge extraction in this case study.
BioCypher: Knowledge Graph Framework:
Framework for building KGs, integrates with BioChatter and biomedical datasets.
Features flexible ontology structures and adaptable output formats.
Supports multiple formats (RDF, SQL, NetworkX).
Customizable via YAML configuration.
Nanopublication Network:
System for publishing and sharing self-contained units of scientific information.
Each "nanopublication" contains assertions, provenance, and publication information.
Aims to improve transparency and reproducibility of scientific research.
Graphiti:
Builds temporally-aware knowledge graphs for AI agents.
Models relationships and context that change over time.
Memary:
Aims to give AI Agents human-like memory capabilities.
Tracks entity knowledge, preferences, and chat history in an automatically updating knowledge graph.
Cognee:
Python library combining knowledge graphs and RAG.
Builds evolving semantic memory for AI agents/apps using dynamic KGs.
Wikidata:
Free and open knowledge base.
Queryable via SPARQL endpoint.
OpenAIRE Graph:
A large-scale research information knowledge graph.
Connects publications, datasets, software, authors, institutions.
Open Research Knowledge Graph (ORKG):
Platform for structured representation of scientific knowledge.
Aims to make research contributions machine-actionable.
Connected Papers:
Visual tool for exploring research paper connections and discovering related literature.
Creates graph visualizations based on citations and similarity.
Research Graph:
Focuses on building a global research collaboration and citation network.
Connects researchers, institutions, publications, grants, and datasets.
UniProt:
Comprehensive, high-quality resource for protein sequence and functional information.
Widely used database in bioinformatics and life sciences.
Self-Organizing Graphs Reasoning:
Research suggesting graph reasoning systems evolve towards a critical state.
Observed a stable Critical Discovery Parameter (-0.03).
Found 12% of connections form between semantically distant concepts.
Demonstrates scale-free and small-world properties in emergent graphs.
KGLab:
Python library for knowledge graph development and machine learning.
Provides tools for graph construction, analysis, and ML integration.
Relik:
Knowledge graph construction and linking tool (entity and relation linking).
Developed by Sapienza NLP group.
GLiNER:
Named Entity Recognition (NER) tool suitable for knowledge graph construction.
Designed for general-purpose NER with flexible entity type definitions.
KuzuDB:
High-performance, embeddable graph database system.
Designed for knowledge graph applications, offering Cypher query language support.
pyOxigraph (Oxigraph):
Python bindings for Oxigraph, an RDF graph database written in Rust.
Supports SPARQL 1.1 query and update standards.
QLever:
SPARQL query engine optimized for large knowledge graphs.
Developed by the University of Freiburg. Known for efficient completion-based querying.
SHACL via pySHACL:
Python library for validating RDF graphs against SHACL (Shapes Constraint Language) shapes.
Used for ensuring data quality and consistency in KGs.
MorphKG:
Tool for constructing knowledge graphs from diverse data sources (CSV, JSON, RDBs) using mapping rules (RML, YARRRML).
Dutch Life Sciences Data Portal:
Comprehensive repository of open Dutch life sciences data hosted by DANS.
Provides access to datasets for research use.
Tahoe-100M: Single-Cell Perturbation Atlas:
Large dataset mapping drug-cell interactions (reportedly ~60,000) across ~50 cancer cell lines.
Contains data from ~300 million cells.
Combines natural cell states with deliberately perturbed cells.
Open-source resource for biological modeling.
Croissant Dataverse Metadata Extraction:
Export of public metadata records from Harvard Dataverse in Croissant format.
Aims to provide ML-ready dataset descriptions in JSON-LD.
BioXYZ Hackathon Kickoff Presentation - Links to Dataverse Datasets:
Presentation containing links to FAIR datasets within the Dataverse network.
Presented by Slava Tykhonov (DANS-KNAW).
Dataverse Installations Directory:
Crowdsourced spreadsheet listing global Dataverse installations.
Contains metadata about repositories (location, launch year, contact).
Resource for understanding the Dataverse network ecosystem.
General Index:
Vast collection of metadata (n-grams) from digitized materials in the Internet Archive.
Searchable index facilitating research across the archive's library.
Papers with Code Datasets:
A large, community-curated collection of datasets used in machine learning research.
Links datasets to papers, code, and benchmarks.
Bioarxiv API:
Provides programmatic access to pre-print articles from bioRxiv.
Enables data mining, trend analysis, and application development for biological sciences literature.
OpenAlex API:
Provides access to a comprehensive, open-source index of scholarly works (publications, authors, institutions, concepts, etc.).
Free alternative to proprietary citation databases.
ArXiv API:
Provides programmatic access to the arXiv repository of electronic preprints (physics, math, CS, etc.).
Enables retrieval of metadata and full-text articles.
Crossref API:
Provides access to metadata for scholarly publications via DOIs (citations, abstracts, funding data, etc.).
Enables retrieval of publication information for research and analysis.
Medline API (PubMed API / E-utilities):
Provides access to the Medline database of biomedical literature citations and abstracts via NCBI E-utilities.
Allows retrieval of abstracts, author information, MeSH terms, etc.
PMC API (PubMed Central APIs / OAI):
Provides access to the PubMed Central (PMC) full-text archive of biomedical and life sciences journal literature.
Enables retrieval of full-text articles and metadata, often via OAI-PMH or E-utilities.
DataCite API:
Provides access to DataCite's metadata registry for research data via DOIs.
Supports REST, GraphQL, and OAI-PMH interfaces.
ORCID API:
Provides access to ORCID registry data (researcher identifiers and profile information).
Offers public and member APIs for retrieving and (for members) updating ORCID records.
ISCC Codes:
International Standard Content Codes: A decentralized standard for content identification.
Creates digital fingerprints based on content similarity for various digital media.
Croissant Specifications:
A standard metadata format (JSON-LD based) for describing datasets, especially for machine learning.
Aims to simplify sharing and usage by providing a common structure.
Enables interoperability between tools and platforms.
MLCommons - Croissant and GeoCroissant:
Organization developing standards for dataset description (Croissant) and geospatial data (GeoCroissant).
GeoCroissant is noted as being under development.
ESIP Science on Schema.org:
Guidelines and extensions for using Schema.org vocabulary to describe scientific datasets and research artifacts.
Promotes FAIR principles for data discovery.
CODATA CDIF:
Cross-Domain Interoperability Framework: A framework being developed by CODATA for indexing scientific data across disciplines.
BioSchema:
Community effort extending Schema.org for marking up life sciences data on the web.
Aims to improve the findability and interoperability of biological data.
Minio:
High-performance, distributed object storage system.
API compatible with Amazon S3, suitable for large datasets (e.g., ML, backups).
LanceDB:
Embeddable vector database designed for AI applications.
Optimized for similarity search, vector embeddings, and managing ML data.
BAML:
Boundary AI Markup Language: A framework or language aiming to simplify AI application development, potentially bridging code and models.
BixBench:
Comprehensive benchmark for evaluating LLM-based agents in bioinformatics tasks.
Contains 53 real-world analytical scenarios with nearly 300 open-answer questions.
Tests agents' abilities in complex multi-step analyses.
Available as a Hugging Face dataset.
Sakana.ai: First AI-Generated Peer-Reviewed Publication:
Paper generated by AI Scientist framework passed peer review at an ICLR workshop without human modifications.
Paper topic: compositional regularization challenges.
Reported average score above acceptance threshold (6.33).
Emphasized full transparency with IRB approval.
Future House Research:
Organization focused on AI applications in scientific discovery.
Works on the automation of scientific processes.
Aims to accelerate scientific breakthroughs using AI.
Lab-in-the-loop: Therapeutic Antibody Design:
Research demonstrating combining generative ML models with experimental feedback cycles.
Reported 3-100× improvement in binding affinity for lead antibody molecules.
Balances exploration and exploitation of the antibody sequence space.
(Note: This links to the general API page, E-utilities are the primary access method)