Important Datasets and Code Repositories

Resource Guide

AI Models for Science & Therapeutics

  • txGemma:

  • Evo2: Foundational Model for Genome Modeling:

    • Trained on 9.3 trillion nucleotides from 128,000 genomes.

    • Reported 90% accuracy in detecting disease-causing mutations.

    • Can process sequences up to 1 million nucleotides.

    • Applications: genetic analysis, disease mutation detection, gene therapy design.

    • Open-source with publicly available training data, code, and weights.

AI Agents & Frameworks for Scientific Discovery

  • AgentRxiv: Collaborative Autonomous Research:

    • Centralized platform for autonomous research agents.

    • Enables knowledge sharing through similarity-based search.

    • Reported 78.2% accuracy on MATH-500 benchmarks (vs 73.8% without platform).

    • Demonstrated generalization across different benchmarks and models.

  • AI Scientist: Automated Scientific Discovery:

    • Framework for hypothesis generation, experiment design, and paper writing.

    • Produces research papers with minimal human intervention.

    • Reported cost-effective ($15 per paper).

    • Applications: diffusion models, language modeling, learning dynamics.

  • Curie: Rigorous Scientific Experimentation:

    • Features Architect Agent for planning and Technician Agents for execution.

    • Reported 3.4× improvement in answering experimental questions.

    • Enforces experimental discipline while maintaining creativity.

  • The Virtual Lab: Nanobody Design:

  • Aviary: Training Language Agents for Scientific Tasks:

    • Open-source agents aiming to match frontier LLMs at lower cost for specific tasks.

    • Handles molecular cloning, literature research, protein engineering.

    • Uses stochastic computation graph framework.

  • Google AI Co-scientist:

    • Multi-agent AI system built with Gemini 2.0.

    • Generates and evaluates research hypotheses through iterative reasoning.

    • Shown promising results in drug repurposing, liver fibrosis treatment, and antimicrobial resistance.

    • Available to research organizations via a Trusted Tester Programme.

  • Popper: Automated Hypothesis Validation:

    • Sequential falsification approach using LLM agents.

    • Designs experiments, executes tests, and analyzes results.

    • Claimed 10× faster than human scientists with comparable accuracy.

    • Maintains strict Type-I error control.

  • YesNoError: Scientific Literature Auditing:

    • Multi-agent system for detecting errors in scientific papers.

    • Checks mathematics, methodology, references, and logical consistency.

    • Uses synthetic data pipeline to improve detection accuracy.

    • Mentions a token-based economy ($YNE) for requesting audits.

  • Talk2Biomodels: Conversational Biological Modeling:

    • Natural language interface for exploring biological models.

    • Supports SBML format and BioModels database.

    • Features time-course simulations, steady-state analysis, parameter scanning.

    • Uses retrieval-augmented generation to prevent hallucinations.

  • BioChatter: Biomedical LLM Platform:

    • Open-source framework for biomedical LLM applications.

    • Integrates knowledge, retrieval-augmented generation, model chaining.

    • Designed for privacy-preserving use with local open-source LLMs.

    • Connects to BioCypher knowledge graphs.

  • Aime: Medical Reasoning System:

  • Elicit:

    • AI-powered research tool for extracting information from academic papers.

    • Summarizes key findings, identifies related research.

    • Helps users grasp core concepts of complex scientific literature.

    • Offers a free trial.

  • PaperQA:

    • Retrieval Augmented Generation (RAG) tool by Future House.

    • Answers questions based on scientific paper collections.

    • Helps researchers process large amounts of scientific literature.

  • BeeARD:

  • KGARevion: Knowledge Graph-Based AI Agent:

    • Multi-step process for biomedical question answering using KGs.

    • Reported 5.2% improvement over baseline approaches.

    • Shows strong zero-shot generalization to underrepresented medical contexts.

Knowledge Graphs: Platforms & Concepts

  • HALD: Human Aging and Longevity Knowledge Graph:

  • NebulaGraph: Rhinitis Knowledge Graph Example:

  • BioCypher: Knowledge Graph Framework:

    • Framework for building KGs, integrates with BioChatter and biomedical datasets.

    • Features flexible ontology structures and adaptable output formats.

    • Supports multiple formats (RDF, SQL, NetworkX).

    • Customizable via YAML configuration.

  • Nanopublication Network:

    • System for publishing and sharing self-contained units of scientific information.

    • Each "nanopublication" contains assertions, provenance, and publication information.

    • Aims to improve transparency and reproducibility of scientific research.

  • Graphiti:

  • Memary:

  • Cognee:

    • Python library combining knowledge graphs and RAG.

    • Builds evolving semantic memory for AI agents/apps using dynamic KGs.

  • Wikidata:

  • OpenAIRE Graph:

    • A large-scale research information knowledge graph.

    • Connects publications, datasets, software, authors, institutions.

  • Open Research Knowledge Graph (ORKG):

    • Platform for structured representation of scientific knowledge.

    • Aims to make research contributions machine-actionable.

  • Connected Papers:

    • Visual tool for exploring research paper connections and discovering related literature.

    • Creates graph visualizations based on citations and similarity.

  • Research Graph:

    • Focuses on building a global research collaboration and citation network.

    • Connects researchers, institutions, publications, grants, and datasets.

  • UniProt:

    • Comprehensive, high-quality resource for protein sequence and functional information.

    • Widely used database in bioinformatics and life sciences.

  • Self-Organizing Graphs Reasoning:

    • Research suggesting graph reasoning systems evolve towards a critical state.

    • Observed a stable Critical Discovery Parameter (-0.03).

    • Found 12% of connections form between semantically distant concepts.

    • Demonstrates scale-free and small-world properties in emergent graphs.

Knowledge Graphs: Tools & Databases

Data Repositories & Platforms

Data Access APIs

  • Bioarxiv API:

    • Provides programmatic access to pre-print articles from bioRxiv.

    • Enables data mining, trend analysis, and application development for biological sciences literature.

  • OpenAlex API:

  • ArXiv API:

    • Provides programmatic access to the arXiv repository of electronic preprints (physics, math, CS, etc.).

    • Enables retrieval of metadata and full-text articles.

  • Crossref API:

  • Medline API (PubMed API / E-utilities):

    • Provides access to the Medline database of biomedical literature citations and abstracts via NCBI E-utilities.

    • Allows retrieval of abstracts, author information, MeSH terms, etc.

    • https://pubmed.ncbi.nlm.nih.gov/api/ (Note: This links to the general API page, E-utilities are the primary access method)

  • PMC API (PubMed Central APIs / OAI):

    • Provides access to the PubMed Central (PMC) full-text archive of biomedical and life sciences journal literature.

    • Enables retrieval of full-text articles and metadata, often via OAI-PMH or E-utilities.

  • DataCite API:

  • ORCID API:

Data & Metadata Standards

Supporting Tools & Infrastructure

  • Minio:

    • High-performance, distributed object storage system.

    • API compatible with Amazon S3, suitable for large datasets (e.g., ML, backups).

  • LanceDB:

  • BAML:

    • Boundary AI Markup Language: A framework or language aiming to simplify AI application development, potentially bridging code and models.

Benchmarking & Evaluation

  • BixBench:

    • Comprehensive benchmark for evaluating LLM-based agents in bioinformatics tasks.

    • Contains 53 real-world analytical scenarios with nearly 300 open-answer questions.

    • Tests agents' abilities in complex multi-step analyses.

    • Available as a Hugging Face dataset.

Research Highlights & Organizations

  • Sakana.ai: First AI-Generated Peer-Reviewed Publication:

    • Paper generated by AI Scientist framework passed peer review at an ICLR workshop without human modifications.

    • Paper topic: compositional regularization challenges.

    • Reported average score above acceptance threshold (6.33).

    • Emphasized full transparency with IRB approval.

  • Future House Research:

    • Organization focused on AI applications in scientific discovery.

    • Works on the automation of scientific processes.

    • Aims to accelerate scientific breakthroughs using AI.

  • Lab-in-the-loop: Therapeutic Antibody Design:

Last updated