🧬
Under Review

Bioinformatics

Use computation to extract meaningful information from biological data—where analysis and interpretation are the bottleneck, not data generation.

Contribute to Bioinformatics
Part 1

Industry Landscape

At its core, bioinformatics is simple: use computation to extract meaningful information from biological data. In a modern genomics lab, wet lab work is ~20% of the timeline, while ~80% is computational analysis.

$14–18B
Global market size (2024)
$40–60B
Projected size (2033)
11–15%
CAGR (typical estimates)
36+ PB
NCBI SRA total scale

Three-Layer Industry Structure

Application Layer

Who uses it

Pharma/Biotech(~42%)Target discovery, companion Dx, trial stratification
Clinical Diagnostics(~14% CAGR)Rare disease, tumor genotyping, pharmacogenomics
Academic Research(~30%)Basic discovery, method development
Agrigenomics(Emerging)Breeding, resistance gene screening
Public Health(Growing)Pathogen surveillance, microbiome

Analysis Layer

What is done

GenomicsWGS/WES variant calling
TranscriptomicsRNA-seq, differential expression
ProteomicsMass spec, protein quantification
Single-cellscRNA-seq, spatial omics
MetagenomicsMicrobial communities
EpigenomicsChIP-seq, methylation

Infrastructure Layer

What is used

SequencersIllumina, PacBio, ONT
ComputeHPC clusters, cloud (AWS/GCP)
DatabasesNCBI, EMBL-EBI, Ensembl
ToolsGATK, STAR, Scanpy, DESeq2
Workflow EnginesNextflow, Snakemake, WDL

Bioinformatics is a technology layer that spans many application domains—nearly every step is software-mediated.

Key Players by Layer

🧬

Sequencing Platforms

  • IlluminaDominant short reads; NovaSeq X; DRAGEN
  • PacBioHiFi long reads; Revio/Vega systems
  • Oxford NanoporeUltra-long reads (100kb+), real-time
  • MGI (BGI)DNBSEQ platform; global expansion
🔬

Analysis Ecosystems

  • Broad InstituteGATK + Terra/Firecloud + Picard
  • EMBL-EBIEnsembl + VEP (European infrastructure)
  • NCBISRA + GenBank + ClinVar + PubMed
  • nf-core70+ production-grade Nextflow pipelines
☁️

Cloud Platforms

  • Terra/FirecloudBroad, Google Cloud; WDL workflows
  • DNAnexusAWS; FedRAMP compliance; clinical
  • Seven BridgesAWS, CWL; NCI Cancer Genomics Cloud

The Database Ecosystem: Bioinformatics' "Knowledge Graph"

A key difference from many other industries: core work depends on a large public database network. These databases are both inputs (annotation sources) and verification references.

DatabaseOperatorScaleRole
NCBI SRANIH/NCBI36+ PB, 9M+ experimentsLargest raw sequencing repository
GenBankNCBI250M+ sequencesNucleotide sequence reference
UniProtEBI/SIB/PIR250M+ proteinsProtein sequences + annotation
PDBRCSB~220k structures3D structure reference
ClinVarNCBI2M+ submissionsClinical variant interpretation
gnomADBroad800k+ individualsPopulation allele frequencies
EnsemblEMBL-EBIMany organismsGenome annotation + VEP
GEONCBI1M+ samplesGene expression datasets

Many bioinformatics tasks are fundamentally "tool execution + database querying." An agent must orchestrate APIs and cross-validate outputs against authoritative references.

Part 2

Why Bioinformatics Fits AgentHLE

A shared trait across bioinformatics roles: nearly all work happens on a computer—terminal commands, Python/R scripts, notebooks, and visualization tools. This makes the domain naturally amenable to agent simulation and evaluation.

Software-Mediated

Nearly every step is software-mediated: CLI tools, scripts, pipelines, APIs

Gold Standards

GIAB, SEQC/MAQC, reference atlases provide authoritative benchmarks

Massive Public Data

36+ PB in NCBI SRA alone; unlimited synthetic generation possible

Standardized Pipelines

nf-core provides 70+ production-grade reference implementations

Deterministic Evaluation

Precision/recall/F1 against truth sets; correlation metrics; ARI scores

API Orchestration

Tasks combine tool execution + database querying + rule application

Part 3

Core Roles & Daily Work

Representative roles in bioinformatics—each with distinct skill profiles and workflow patterns.

Role Profiles

Bioinformatics Scientist

The most general title with broadest coverage

Typical Work
  • Design and run NGS pipelines (variant calling, RNA-seq, ChIP-seq)
  • Write Python/R scripts for processing and visualization
  • Use CLI tools (GATK, STAR, SAMtools)
Environment: Linux servers / HPC / cloud; terminal + Jupyter/RStudio

Computational Biologist

Method-development and modeling oriented (academia-leaning)

Typical Work
  • Develop new algorithms and statistical models
  • Build ML models for biological prediction
  • Large-scale integration (multi-omics)
Environment: HPC / cloud; Python/R + academic tools

Genomic Data Analyst

Execution and delivery focused; processes the largest data volume

Typical Work
  • Run standardized pipelines at scale: FASTQ → VCF / expression matrices
  • Execute community-standard pipelines (nf-core/sarek, nf-core/rnaseq)
  • QC: FastQC → MultiQC → decide pass/fail
Environment: HPC / cloud; workflow engines (Nextflow/Snakemake)

Clinical Bioinformatician

Clinical-facing with strict compliance and validation constraints

Typical Work
  • Run validated clinical-grade pipelines (parameters often locked)
  • Annotate and filter variants: VEP/ANNOVAR + ClinVar/gnomAD
  • Generate clinical reports using ACMG/AMP standards
Environment: Validated pipelines; compliance-focused infrastructure

Single-cell / Spatial Omics Analyst

Emerging specialization driven by single-cell technologies

Typical Work
  • 10x workflows: Cell Ranger → Scanpy/Seurat
  • Clustering, dimensionality reduction (UMAP/t-SNE), DE
  • Cell type annotation via reference atlases
Environment: Python (Scanpy) / R (Seurat); GPU for large datasets

Metagenomics Analyst

Microbial community and environmental genomics specialist

Typical Work
  • FASTQ → taxonomy profiling → diversity analysis
  • Shotgun metagenomics and 16S rRNA analysis
  • Functional annotation and pathway analysis
Environment: QIIME2, MetaPhlAn, Kraken; HPC for large cohorts

Roles ↔ Workflows Mapping

RoleMost Typical Daily WorkflowFrequency
Genomic data analystFASTQ → alignment → variant calling → VCFDaily
Transcriptomics analystFASTQ → alignment → differential expression → enrichmentMultiple times/week
Single-cell analystFASTQ → preprocessing → clustering → cell annotationMultiple times/week
Clinical bioinformaticianVCF → annotation → pathogenic filtering → reportDaily
Computational biologistIntegration / modeling / new tool developmentProject-based
Metagenomics analystFASTQ → taxonomy profiling → diversity analysisWeekly

Typical Team Structure

Typical Genomics Center Team

1
PI / DirectorSets direction and strategy
2–3
Senior bioinformaticiansDesign pipelines, solve hard failures
3–5
Data analystsRun pipelines, deliver results (daily throughput engine)
1–2
Software engineersMaintain infrastructure, deployments
1–2
Clinical bioinformaticiansCompliance-bound execution (if clinical)

The data-analyst layer is the most standardized, high-frequency, and benchmarkable.

Part 4

Representative Workflows

Four benchmarkable workflows representing the core logic of the industry. Each occurs frequently in practice and covers distinct capability dimensions.

Selection Criteria

Representativeness

Is this the "bread-and-butter" of the field?

Software Dependence

Must use multiple professional tools?

Deterministic Eval

Gold standard / objective scoring?

Data Scalability

Can generate large-scale test cases?

Coverage

Spans distinct skill types?

Workflow Overview

#WorkflowWhat it RepresentsSkills Covered
1WGS Variant CallingGenomics core; highest-frequencyCLI orchestration; large-file handling
2RNA-seq Differential ExpressionTranscriptomics core; second highestAlignment + counting + statistics; R programming
3scRNA-seq ClusteringFastest-growing frontierPython data science; DR/clustering; visualization
4Clinical Variant AnnotationClinical translation surfaceAPI orchestration; database querying; rule-based reasoning
Genomics★★★

WGS Variant Calling Pipeline

The "Hello World" of bioinformatics and the most common mature production workflow. Many labs run this daily. GATK Best Practices is widely adopted, and Genome in a Bottle (GIAB) provides gold standards. If you can pick only one workflow to represent the industry, this is it.

Transcriptomics★★★

RNA-seq Differential Expression

The second most common workflow after variant calling. Nearly any study about gene function uses RNA-seq. Key differences: cross-language orchestration (CLI alignment + R statistics), evaluation focuses on agreement/association rather than exact matching.

Single-cell★★

Single-cell RNA-seq Analysis

The fastest-growing area in bioinformatics (Nature Methods "Method of the Year" repeatedly). Toolchain differs substantially from bulk RNA-seq (Scanpy/Seurat vs DESeq2). Tests Python-centric data science capabilities: dimensionality reduction, clustering, visualization.

Clinical★★★

Clinical Variant Annotation & Interpretation

The most important clinical translation surface—turning sequencing outputs into actionable interpretation. Unlike other workflows, the core is database querying + rule application. Tests API orchestration and structured rule-based decision making (ACMG/AMP reasoning).

Part 5

Review Agent Architecture

A two-layer validation system enables automated, reproducible evaluation.

Two-Layer Validation

Layer 1: Deterministic Rules

Automated (No LLM Required)

  • File format validation: VCF → vcf-validator; BAM → ValidateSamFile
  • Output completeness: required columns present, files not truncated
  • QC threshold checks: mapping rate > 80%, dup rate < 20%
Layer 2: Gold-Standard Comparison

Primary Scoring

  • Workflow 1: hap.py → precision/recall/F1
  • Workflow 2: Spearman correlation vs qPCR truth
  • Workflow 3: ARI vs reference annotation
  • Workflow 4: Pathogenic variant in top-N candidates

Primary Metrics by Workflow

WorkflowPrimary MetricTarget
WGS Variant Calling PipelineSNVs: F1 > 99.5% (precision > 99.9%, recall > 99.5%)★★★★
RNA-seq Differential ExpressionSpearman correlation ρ > 0.85 vs SEQC/MAQC qPCR truth★★★★
Single-cell RNA-seq AnalysisAdjusted Rand Index (ARI) > 0.7 vs reference cell-type labels★★★★
Clinical Variant Annotation & InterpretationImplanted known pathogenic variant appears in top 10 candidates (ideally top 3)★★★★

Core Tools & Infrastructure

GATKBWA-MEM2SAMtoolsSTARDESeq2ScanpyCell RangerNextflownf-coreEnsembl VEPClinVargnomADhap.py

Contribute to Bioinformatics

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Our Commitments to Contributors

  • Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
  • Partner Review: Industry partners can review and approve task specifications before public release.
  • Data Control: Contributors can exclude sensitive or proprietary data from submissions.