🧬

Under Review

Bioinformatics

Use computation to extract meaningful information from biological data—where analysis and interpretation are the bottleneck, not data generation.

Contribute to Bioinformatics

Part 1

Industry Landscape

At its core, bioinformatics is simple: use computation to extract meaningful information from biological data. In a modern genomics lab, wet lab work is ~20% of the timeline, while ~80% is computational analysis.

$14–18B

Global market size (2024)

$40–60B

Projected size (2033)

11–15%

CAGR (typical estimates)

36+ PB

NCBI SRA total scale

Three-Layer Industry Structure

Application Layer

Who uses it

Pharma/Biotech(~42%)Target discovery, companion Dx, trial stratification

Clinical Diagnostics(~14% CAGR)Rare disease, tumor genotyping, pharmacogenomics

Academic Research(~30%)Basic discovery, method development

Agrigenomics(Emerging)Breeding, resistance gene screening

Public Health(Growing)Pathogen surveillance, microbiome

Analysis Layer

What is done

GenomicsWGS/WES variant calling

TranscriptomicsRNA-seq, differential expression

ProteomicsMass spec, protein quantification

Single-cellscRNA-seq, spatial omics

MetagenomicsMicrobial communities

EpigenomicsChIP-seq, methylation

Infrastructure Layer

What is used

SequencersIllumina, PacBio, ONT

ComputeHPC clusters, cloud (AWS/GCP)

DatabasesNCBI, EMBL-EBI, Ensembl

ToolsGATK, STAR, Scanpy, DESeq2

Workflow EnginesNextflow, Snakemake, WDL

Bioinformatics is a technology layer that spans many application domains—nearly every step is software-mediated.

Key Players by Layer

🧬

Sequencing Platforms

IlluminaDominant short reads; NovaSeq X; DRAGEN
PacBioHiFi long reads; Revio/Vega systems
Oxford NanoporeUltra-long reads (100kb+), real-time
MGI (BGI)DNBSEQ platform; global expansion

🔬

Analysis Ecosystems

Broad InstituteGATK + Terra/Firecloud + Picard
EMBL-EBIEnsembl + VEP (European infrastructure)
NCBISRA + GenBank + ClinVar + PubMed
nf-core70+ production-grade Nextflow pipelines

☁️

Cloud Platforms

Terra/FirecloudBroad, Google Cloud; WDL workflows
DNAnexusAWS; FedRAMP compliance; clinical
Seven BridgesAWS, CWL; NCI Cancer Genomics Cloud

The Database Ecosystem: Bioinformatics' "Knowledge Graph"

A key difference from many other industries: core work depends on a large public database network. These databases are both inputs (annotation sources) and verification references.

Database	Operator	Scale	Role
NCBI SRA	NIH/NCBI	36+ PB, 9M+ experiments	Largest raw sequencing repository
GenBank	NCBI	250M+ sequences	Nucleotide sequence reference
UniProt	EBI/SIB/PIR	250M+ proteins	Protein sequences + annotation
PDB	RCSB	~220k structures	3D structure reference
ClinVar	NCBI	2M+ submissions	Clinical variant interpretation
gnomAD	Broad	800k+ individuals	Population allele frequencies
Ensembl	EMBL-EBI	Many organisms	Genome annotation + VEP
GEO	NCBI	1M+ samples	Gene expression datasets

Many bioinformatics tasks are fundamentally "tool execution + database querying." An agent must orchestrate APIs and cross-validate outputs against authoritative references.

Part 2

Why Bioinformatics Fits Agents' Last Exam

A shared trait across bioinformatics roles: nearly all work happens on a computer—terminal commands, Python/R scripts, notebooks, and visualization tools. This makes the domain naturally amenable to agent simulation and evaluation.

Software-Mediated

Nearly every step is software-mediated: CLI tools, scripts, pipelines, APIs

Gold Standards

GIAB, SEQC/MAQC, reference atlases provide authoritative benchmarks

Massive Public Data

36+ PB in NCBI SRA alone; unlimited synthetic generation possible

Standardized Pipelines

nf-core provides 70+ production-grade reference implementations

Deterministic Evaluation

Precision/recall/F1 against truth sets; correlation metrics; ARI scores

API Orchestration

Tasks combine tool execution + database querying + rule application

Part 3

Core Roles & Daily Work

Representative roles in bioinformatics—each with distinct skill profiles and workflow patterns.

Role Profiles

Bioinformatics Scientist

The most general title with broadest coverage

Typical Work

•Design and run NGS pipelines (variant calling, RNA-seq, ChIP-seq)
•Write Python/R scripts for processing and visualization
•Use CLI tools (GATK, STAR, SAMtools)

Environment: Linux servers / HPC / cloud; terminal + Jupyter/RStudio

Computational Biologist

Method-development and modeling oriented (academia-leaning)

Typical Work

•Develop new algorithms and statistical models
•Build ML models for biological prediction
•Large-scale integration (multi-omics)

Environment: HPC / cloud; Python/R + academic tools

Genomic Data Analyst

Execution and delivery focused; processes the largest data volume

Typical Work

•Run standardized pipelines at scale: FASTQ → VCF / expression matrices
•Execute community-standard pipelines (nf-core/sarek, nf-core/rnaseq)
•QC: FastQC → MultiQC → decide pass/fail

Environment: HPC / cloud; workflow engines (Nextflow/Snakemake)

Clinical Bioinformatician

Clinical-facing with strict compliance and validation constraints

Typical Work

•Run validated clinical-grade pipelines (parameters often locked)
•Annotate and filter variants: VEP/ANNOVAR + ClinVar/gnomAD
•Generate clinical reports using ACMG/AMP standards

Environment: Validated pipelines; compliance-focused infrastructure

Single-cell / Spatial Omics Analyst

Emerging specialization driven by single-cell technologies

Typical Work

•10x workflows: Cell Ranger → Scanpy/Seurat
•Clustering, dimensionality reduction (UMAP/t-SNE), DE
•Cell type annotation via reference atlases

Environment: Python (Scanpy) / R (Seurat); GPU for large datasets

Metagenomics Analyst

Microbial community and environmental genomics specialist

Typical Work

•FASTQ → taxonomy profiling → diversity analysis
•Shotgun metagenomics and 16S rRNA analysis
•Functional annotation and pathway analysis

Environment: QIIME2, MetaPhlAn, Kraken; HPC for large cohorts

Roles ↔ Workflows Mapping

Role	Most Typical Daily Workflow	Frequency
Genomic data analyst	FASTQ → alignment → variant calling → VCF	Daily
Transcriptomics analyst	FASTQ → alignment → differential expression → enrichment	Multiple times/week
Single-cell analyst	FASTQ → preprocessing → clustering → cell annotation	Multiple times/week
Clinical bioinformatician	VCF → annotation → pathogenic filtering → report	Daily
Computational biologist	Integration / modeling / new tool development	Project-based
Metagenomics analyst	FASTQ → taxonomy profiling → diversity analysis	Weekly

Typical Team Structure

Typical Genomics Center Team

PI / DirectorSets direction and strategy

2–3

Senior bioinformaticiansDesign pipelines, solve hard failures

3–5

Data analystsRun pipelines, deliver results (daily throughput engine)

1–2

Software engineersMaintain infrastructure, deployments

1–2

Clinical bioinformaticiansCompliance-bound execution (if clinical)

The data-analyst layer is the most standardized, high-frequency, and benchmarkable.

Part 4

Representative Workflows

Four benchmarkable workflows representing the core logic of the industry. Each occurs frequently in practice and covers distinct capability dimensions.

Selection Criteria

Representativeness

Is this the "bread-and-butter" of the field?

Software Dependence

Must use multiple professional tools?

Deterministic Eval

Gold standard / objective scoring?

Data Scalability

Can generate large-scale test cases?

Coverage

Spans distinct skill types?

Workflow Overview

#	Workflow	What it Represents	Skills Covered
1	WGS Variant Calling	Genomics core; highest-frequency	CLI orchestration; large-file handling
2	RNA-seq Differential Expression	Transcriptomics core; second highest	Alignment + counting + statistics; R programming
3	scRNA-seq Clustering	Fastest-growing frontier	Python data science; DR/clustering; visualization
4	Clinical Variant Annotation	Clinical translation surface	API orchestration; database querying; rule-based reasoning

Genomics★★★

WGS Variant Calling Pipeline

The "Hello World" of bioinformatics and the most common mature production workflow. Many labs run this daily. GATK Best Practices is widely adopted, and Genome in a Bottle (GIAB) provides gold standards. If you can pick only one workflow to represent the industry, this is it.

Transcriptomics★★★

RNA-seq Differential Expression

The second most common workflow after variant calling. Nearly any study about gene function uses RNA-seq. Key differences: cross-language orchestration (CLI alignment + R statistics), evaluation focuses on agreement/association rather than exact matching.

Single-cell★★☆

Single-cell RNA-seq Analysis

The fastest-growing area in bioinformatics (Nature Methods "Method of the Year" repeatedly). Toolchain differs substantially from bulk RNA-seq (Scanpy/Seurat vs DESeq2). Tests Python-centric data science capabilities: dimensionality reduction, clustering, visualization.

Clinical★★★

Clinical Variant Annotation & Interpretation

The most important clinical translation surface—turning sequencing outputs into actionable interpretation. Unlike other workflows, the core is database querying + rule application. Tests API orchestration and structured rule-based decision making (ACMG/AMP reasoning).

Part 5

Review Agent Architecture

A two-layer validation system enables automated, reproducible evaluation.

Two-Layer Validation

Layer 1: Deterministic Rules

Automated (No LLM Required)

•File format validation: VCF → vcf-validator; BAM → ValidateSamFile
•Output completeness: required columns present, files not truncated
•QC threshold checks: mapping rate > 80%, dup rate < 20%

Layer 2: Gold-Standard Comparison

Primary Scoring

•Workflow 1: hap.py → precision/recall/F1
•Workflow 2: Spearman correlation vs qPCR truth
•Workflow 3: ARI vs reference annotation
•Workflow 4: Pathogenic variant in top-N candidates

Primary Metrics by Workflow

Workflow	Primary Metric	Target
WGS Variant Calling Pipeline	SNVs: F1 > 99.5% (precision > 99.9%, recall > 99.5%)	★★★★★
RNA-seq Differential Expression	Spearman correlation ρ > 0.85 vs SEQC/MAQC qPCR truth	★★★★★
Single-cell RNA-seq Analysis	Adjusted Rand Index (ARI) > 0.7 vs reference cell-type labels	★★★★★
Clinical Variant Annotation & Interpretation	Implanted known pathogenic variant appears in top 10 candidates (ideally top 3)	★★★★★

Core Tools & Infrastructure

GATKBWA-MEM2SAMtoolsSTARDESeq2ScanpyCell RangerNextflownf-coreEnsembl VEPClinVargnomADhap.py

Contribute to Bioinformatics

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Submit Landscape Understanding

Help us map roles, workflows, and tools in bioinformatics. Share your perspective on the industry structure.

Submit a Workflow

Describe a specific professional task with tools, inputs, outputs, and how success is verified.

Our Commitments to Contributors

Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
Partner Review: Industry partners can review and approve task specifications before public release.
Data Control: Contributors can exclude sensitive or proprietary data from submissions.