Important Datasets and Code Repositories
Resource Guide
Open-Source ML Models
txGemma:
Collection of open models for therapeutics development.
Built on Gemma 2, trained with 7 million examples.
Available in three sizes (2B, 9B, 27B) with specialized versions.
27B model outperforms single-task models in 50 of 66 tasks.
Designed for further fine-tuning with proprietary data.
Evo2: Foundational Model for Genome Modeling:
Trained on 9.3 trillion nucleotides from 128,000 genomes.
90% accuracy in detecting disease-causing mutations.
Can process sequences up to 1 million nucleotides.
Applications: genetic analysis, disease mutation detection, gene therapy design.
Open-source with publicly available training data, code, and weights.
Multi-Agent Frameworks
AgentRxiv: Collaborative Autonomous Research:
Centralized platform for autonomous research agents.
Enables knowledge sharing through similarity-based search.
78.2% accuracy on MATH-500 benchmarks (vs 73.8% without platform).
Demonstrated generalization across different benchmarks and models.
AI Scientist: Automated Scientific Discovery:
Framework for hypothesis generation, experiment design, and paper writing.
Produces research papers with minimal human intervention.
Cost-effective ($15 per paper).
Applications: diffusion models, language modeling, learning dynamics.
Curie: Rigorous Scientific Experimentation:
Features Architect Agent for planning and Technician Agents for execution.
3.4× improvement in answering experimental questions.
Enforces experimental discipline while maintaining creativity.
The Virtual Lab: Nanobody Design:
Multi-agent collaboration with minimal human input (1.3% of total words).
Designed 92 nanobodies with >90% expressing as soluble proteins.
Combines agents with different expertise to solve complex challenges.
Aviary: Training Language Agents for Scientific Tasks:
Open-source agents matching frontier LLMs at lower cost.
Handles molecular cloning, literature research, protein engineering.
Uses stochastic computation graph framework.
Google AI Co-scientist:
Multi-agent AI system built with Gemini 2.0.
Generates and evaluates research hypotheses through iterative reasoning.
Promising results in drug repurposing, liver fibrosis treatment, and antimicrobial resistance.
Available to research organizations via a Trusted Tester Programme.
Popper: Automated Hypothesis Validation:
Sequential falsification approach using LLM agents.
Designs experiments, executes tests, and analyzes results.
10× faster than human scientists with comparable accuracy.
Maintains strict Type-I error control.
YesNoError: Scientific Literature Auditing:
Multi-agent system for detecting errors in scientific papers.
Checks mathematics, methodology, references, and logical consistency.
Uses synthetic data pipeline to improve detection accuracy.
Token-based economy ($YNE) for requesting audits.
Talk2Biomodels: Conversational Biological Modeling:
Natural language interface for exploring biological models.
Supports SBML format and BioModels database.
Features time-course simulations, steady-state analysis, parameter scanning.
Uses retrieval-augmented generation to prevent hallucinations.
Knowledge Graphs and Databases
HALD: Human Aging and Longevity Knowledge Graph:
Contains 12,000+ entities, 115,000+ relations.
Extracted from 340,000 PubMed articles.
Provides structured exploration of aging and longevity biomarkers.
NebulaGraph: Rhinitis Knowledge Graph:
Leverages LLMs to extract knowledge from Chinese medical records.
Five-step methodology for knowledge extraction.
ChatGPT-4 achieved 82.75% F1 score in knowledge extraction.
BioCypher: Knowledge Graph Framework:
Integrates with BioChatter and biomedical datasets.
Flexible ontology structures and adaptable output formats.
Supports multiple formats (RDF, SQL, NetworkX).
Customizable via YAML configuration.
KGARevion: Knowledge Graph-Based AI Agent:
Multi-step process for biomedical question answering.
5.2% improvement over baseline approaches.
Strong zero-shot generalization to underrepresented medical contexts.
Nanopublication Network:
System for publishing and sharing self-contained units of scientific information.
Assertions, provenance, and publication information in each "nanopublication".
Improves transparency and reproducibility of scientific research.
Graphiti:
Builds temporally-aware knowledge graphs for AI agents that change over time with evolving relationships and context.
Memary:
Gives AI Agents human-like memory capabilities. Tracks entity knowledge, preferences, and chat history in a knowledge graph that automatically updates as your agent interacts with users.
Cognee:
Python library that brings together knowledge graphs and RAG to build evolving semantic memory for AI agents and apps. Uses dynamic knowledge graphs to maintain relationships between different pieces of information.
AI Platforms & Tools
BioChatter: Biomedical LLM Platform:
Open-source framework for biomedical LLM applications.
Integrates knowledge, retrieval-augmented generation, model chaining.
Privacy-preserving with local open-source LLMs.
Connects to BioCypher knowledge graphs.
Aime: Medical Reasoning System:
Two-agent architecture: Dialogue Agent and Mx Agent.
Grounds recommendations in clinical guidelines.
Performed well on RxQA medication reasoning benchmark.
Elicit:
AI-powered research tool for extracting information from academic papers.
Summarizes key findings, identifies related research.
Helps users grasp core concepts of complex scientific literature.
Offers free trial.
PaperQA:
Retrieval Augmented Generation (RAG) tool by Future House.
Answers questions based on scientific paper collections.
Helps researchers process large amounts of scientific literature.
BixBench:
Comprehensive benchmark for LLM-based agents in bioinformatics.
53 real-world analytical scenarios with nearly 300 open-answer questions.
Tests agents' abilities to perform complex multi-step analyses.
Available on Hugging Face.
BeeARD:
Automated hypothesis generation and validation at scale.
Powered by multi-agent AI systems and knowledge graphs.
Project token can be found at https://dexscreener.com/base/0xa02567fc557C6a409464EC40480b9F5660a991B3
Datasets
Tahoe-100M: Single-Cell Perturbation Atlas:
Maps 60,000 drug-cell interactions across 50 cancer cell lines.
300 million cells total (larger than existing public resources).
Combines natural cell states with deliberately perturbed cells.
Open-source to catalyze biological modeling.
Croissant Dataverse Metadata Extraction:
Export of public metadata records in Croissant format for ML-ready datasets.
JSON-LD format.
Peer-Reviewed AI Research
Sakana.ai: First AI-Generated Peer-Reviewed Publication:
Passed peer review at ICLR workshop without human modifications.
Paper on compositional regularization challenges.
Scored above acceptance threshold (6.33 average).
Full transparency with IRB approval.
Self-Organizing Graphs Reasoning:
Graph reasoning systems evolve towards critical state.
Critical Discovery Parameter stabilizes at -0.03.
12% of connections form between semantically distant concepts.
Demonstrates scale-free and small-world properties.
Future House Research:
Organization dedicated to AI applications in scientific discovery.
Focuses on automation of scientific processes.
Aims to accelerate scientific breakthroughs through AI.
Drug Discovery & Therapeutic Design
Lab-in-the-loop: Therapeutic Antibody Design:
Combines generative ML models with experimental feedback.
3-100× improvement in binding affinity for lead molecules.
Balances exploration and exploitation of antibody sequence space.
APIs and Data Access
Bioarxiv API:
Programmatic access to pre-print articles in biological sciences.
Enables data mining, trend analysis, and application development.
Helps researchers stay current in rapidly evolving fields.
OpenAlex API:
Comprehensive, open-source index of scholarly works.
Metadata about publications, authors, institutions, and concepts.
Free alternative to proprietary citation databases.
General Index:
Vast collection of metadata from digitized materials.
Searchable index for Internet Archive's library.
Facilitates research across diverse topics.
ArXiv API:
Programmatic access to the arXiv repository of electronic preprints.
Enables retrieval of metadata and full-text articles.
Useful for research in physics, mathematics, computer science, and related fields.
Crossref API:
Provides metadata for scholarly publications, including DOI, citations, and abstracts.
Enables retrieval of publication information for research and analysis.
Medline API (PubMed API):
Access to the Medline database of biomedical literature citations.
Allows retrieval of abstracts, author information, and MeSH terms.
PMC API:
Access to the PubMed Central (PMC) full-text archive of biomedical and life sciences journal literature.
Enables retrieval of full-text articles and metadata.
Standards and Specifications
ISCC Codes:
International Standard Content Codes.
Decentralized standard for content addressing built on content similarity.
Creates digital fingerprints for any type of digital content.
Croissant Specifications:
Standard format for describing datasets, particularly for machine learning.
Simplifies sharing and using datasets through common metadata structure.
Enables interoperability between different tools and platforms.
Terms and Ontologies
ICD10:
International Classification of Diseases, 10th Revision.
Standard diagnostic tool for epidemiology, health management and clinical purposes.
LOINC:
Logical Observation Identifiers Names and Codes.
Universal identifiers for medical laboratory observations.
SNOMED:
Systematized Nomenclature of Medicine - Clinical Terms.
Comprehensive clinical healthcare terminology.
OMOP:
Observational Medical Outcomes Partnership.
Common Data Model for observational health data.
PCORNet:
Patient-Centered Outcomes Research Network.
National network of patient-centered clinical research networks.
Bioontology:
Portal for biomedical ontologies and terminologies.
Tools
Neo4j:
Graph database management system.
Good for storing and querying connected data.
GQL:
Graph Query Language.
Used for querying graph databases.
GraphRAG:
Tool for creating initial graph structures.
iGraph:
R package for network analysis.
Used for graph manipulation in R.
Networkx:
Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
Last updated