Bioinformatics
Use computation to extract meaningful information from biological data—where analysis and interpretation are the bottleneck, not data generation.
Industry Landscape
At its core, bioinformatics is simple: use computation to extract meaningful information from biological data. In a modern genomics lab, wet lab work is ~20% of the timeline, while ~80% is computational analysis.
Three-Layer Industry Structure
Application Layer
Who uses it
Analysis Layer
What is done
Infrastructure Layer
What is used
Bioinformatics is a technology layer that spans many application domains—nearly every step is software-mediated.
Key Players by Layer
Sequencing Platforms
- IlluminaDominant short reads; NovaSeq X; DRAGEN
- PacBioHiFi long reads; Revio/Vega systems
- Oxford NanoporeUltra-long reads (100kb+), real-time
- MGI (BGI)DNBSEQ platform; global expansion
Analysis Ecosystems
- Broad InstituteGATK + Terra/Firecloud + Picard
- EMBL-EBIEnsembl + VEP (European infrastructure)
- NCBISRA + GenBank + ClinVar + PubMed
- nf-core70+ production-grade Nextflow pipelines
Cloud Platforms
- Terra/FirecloudBroad, Google Cloud; WDL workflows
- DNAnexusAWS; FedRAMP compliance; clinical
- Seven BridgesAWS, CWL; NCI Cancer Genomics Cloud
The Database Ecosystem: Bioinformatics' "Knowledge Graph"
A key difference from many other industries: core work depends on a large public database network. These databases are both inputs (annotation sources) and verification references.
| Database | Operator | Scale | Role |
|---|---|---|---|
| NCBI SRA | NIH/NCBI | 36+ PB, 9M+ experiments | Largest raw sequencing repository |
| GenBank | NCBI | 250M+ sequences | Nucleotide sequence reference |
| UniProt | EBI/SIB/PIR | 250M+ proteins | Protein sequences + annotation |
| PDB | RCSB | ~220k structures | 3D structure reference |
| ClinVar | NCBI | 2M+ submissions | Clinical variant interpretation |
| gnomAD | Broad | 800k+ individuals | Population allele frequencies |
| Ensembl | EMBL-EBI | Many organisms | Genome annotation + VEP |
| GEO | NCBI | 1M+ samples | Gene expression datasets |
Many bioinformatics tasks are fundamentally "tool execution + database querying." An agent must orchestrate APIs and cross-validate outputs against authoritative references.
Why Bioinformatics Fits AgentHLE
A shared trait across bioinformatics roles: nearly all work happens on a computer—terminal commands, Python/R scripts, notebooks, and visualization tools. This makes the domain naturally amenable to agent simulation and evaluation.
Software-Mediated
Nearly every step is software-mediated: CLI tools, scripts, pipelines, APIs
Gold Standards
GIAB, SEQC/MAQC, reference atlases provide authoritative benchmarks
Massive Public Data
36+ PB in NCBI SRA alone; unlimited synthetic generation possible
Standardized Pipelines
nf-core provides 70+ production-grade reference implementations
Deterministic Evaluation
Precision/recall/F1 against truth sets; correlation metrics; ARI scores
API Orchestration
Tasks combine tool execution + database querying + rule application
Core Roles & Daily Work
Representative roles in bioinformatics—each with distinct skill profiles and workflow patterns.
Role Profiles
Bioinformatics Scientist
The most general title with broadest coverage
- •Design and run NGS pipelines (variant calling, RNA-seq, ChIP-seq)
- •Write Python/R scripts for processing and visualization
- •Use CLI tools (GATK, STAR, SAMtools)
Computational Biologist
Method-development and modeling oriented (academia-leaning)
- •Develop new algorithms and statistical models
- •Build ML models for biological prediction
- •Large-scale integration (multi-omics)
Genomic Data Analyst
Execution and delivery focused; processes the largest data volume
- •Run standardized pipelines at scale: FASTQ → VCF / expression matrices
- •Execute community-standard pipelines (nf-core/sarek, nf-core/rnaseq)
- •QC: FastQC → MultiQC → decide pass/fail
Clinical Bioinformatician
Clinical-facing with strict compliance and validation constraints
- •Run validated clinical-grade pipelines (parameters often locked)
- •Annotate and filter variants: VEP/ANNOVAR + ClinVar/gnomAD
- •Generate clinical reports using ACMG/AMP standards
Single-cell / Spatial Omics Analyst
Emerging specialization driven by single-cell technologies
- •10x workflows: Cell Ranger → Scanpy/Seurat
- •Clustering, dimensionality reduction (UMAP/t-SNE), DE
- •Cell type annotation via reference atlases
Metagenomics Analyst
Microbial community and environmental genomics specialist
- •FASTQ → taxonomy profiling → diversity analysis
- •Shotgun metagenomics and 16S rRNA analysis
- •Functional annotation and pathway analysis
Roles ↔ Workflows Mapping
| Role | Most Typical Daily Workflow | Frequency |
|---|---|---|
| Genomic data analyst | FASTQ → alignment → variant calling → VCF | Daily |
| Transcriptomics analyst | FASTQ → alignment → differential expression → enrichment | Multiple times/week |
| Single-cell analyst | FASTQ → preprocessing → clustering → cell annotation | Multiple times/week |
| Clinical bioinformatician | VCF → annotation → pathogenic filtering → report | Daily |
| Computational biologist | Integration / modeling / new tool development | Project-based |
| Metagenomics analyst | FASTQ → taxonomy profiling → diversity analysis | Weekly |
Typical Team Structure
Typical Genomics Center Team
The data-analyst layer is the most standardized, high-frequency, and benchmarkable.
Representative Workflows
Four benchmarkable workflows representing the core logic of the industry. Each occurs frequently in practice and covers distinct capability dimensions.
Selection Criteria
Representativeness
Is this the "bread-and-butter" of the field?
Software Dependence
Must use multiple professional tools?
Deterministic Eval
Gold standard / objective scoring?
Data Scalability
Can generate large-scale test cases?
Coverage
Spans distinct skill types?
Workflow Overview
| # | Workflow | What it Represents | Skills Covered |
|---|---|---|---|
| 1 | WGS Variant Calling | Genomics core; highest-frequency | CLI orchestration; large-file handling |
| 2 | RNA-seq Differential Expression | Transcriptomics core; second highest | Alignment + counting + statistics; R programming |
| 3 | scRNA-seq Clustering | Fastest-growing frontier | Python data science; DR/clustering; visualization |
| 4 | Clinical Variant Annotation | Clinical translation surface | API orchestration; database querying; rule-based reasoning |
WGS Variant Calling Pipeline
The "Hello World" of bioinformatics and the most common mature production workflow. Many labs run this daily. GATK Best Practices is widely adopted, and Genome in a Bottle (GIAB) provides gold standards. If you can pick only one workflow to represent the industry, this is it.
RNA-seq Differential Expression
The second most common workflow after variant calling. Nearly any study about gene function uses RNA-seq. Key differences: cross-language orchestration (CLI alignment + R statistics), evaluation focuses on agreement/association rather than exact matching.
Single-cell RNA-seq Analysis
The fastest-growing area in bioinformatics (Nature Methods "Method of the Year" repeatedly). Toolchain differs substantially from bulk RNA-seq (Scanpy/Seurat vs DESeq2). Tests Python-centric data science capabilities: dimensionality reduction, clustering, visualization.
Clinical Variant Annotation & Interpretation
The most important clinical translation surface—turning sequencing outputs into actionable interpretation. Unlike other workflows, the core is database querying + rule application. Tests API orchestration and structured rule-based decision making (ACMG/AMP reasoning).
Review Agent Architecture
A two-layer validation system enables automated, reproducible evaluation.
Two-Layer Validation
Automated (No LLM Required)
- •File format validation: VCF → vcf-validator; BAM → ValidateSamFile
- •Output completeness: required columns present, files not truncated
- •QC threshold checks: mapping rate > 80%, dup rate < 20%
Primary Scoring
- •Workflow 1: hap.py → precision/recall/F1
- •Workflow 2: Spearman correlation vs qPCR truth
- •Workflow 3: ARI vs reference annotation
- •Workflow 4: Pathogenic variant in top-N candidates
Primary Metrics by Workflow
| Workflow | Primary Metric | Target |
|---|---|---|
| WGS Variant Calling Pipeline | SNVs: F1 > 99.5% (precision > 99.9%, recall > 99.5%) | ★★★★★ |
| RNA-seq Differential Expression | Spearman correlation ρ > 0.85 vs SEQC/MAQC qPCR truth | ★★★★★ |
| Single-cell RNA-seq Analysis | Adjusted Rand Index (ARI) > 0.7 vs reference cell-type labels | ★★★★★ |
| Clinical Variant Annotation & Interpretation | Implanted known pathogenic variant appears in top 10 candidates (ideally top 3) | ★★★★★ |
Core Tools & Infrastructure
Contribute to Bioinformatics
We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:
Submit Landscape Understanding
Help us map roles, workflows, and tools in bioinformatics. Share your perspective on the industry structure.
Submit a Workflow
Describe a specific professional task with tools, inputs, outputs, and how success is verified.
Our Commitments to Contributors
- Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
- Partner Review: Industry partners can review and approve task specifications before public release.
- Data Control: Contributors can exclude sensitive or proprietary data from submissions.