🧠
Under ReviewXinyang Han (UCB), Yiyou Sun (UCB)

Neuroimaging

Standardized artifacts (BIDS/NIfTI), repeatable QC failure modes, and audit-ready deliverables make neuroimaging ideal for execution-focused agent benchmarks.

Contribute to Neuroimaging
Part 1

Neuroimaging Overview

Neuroimaging work is a closed loop: a scientific question becomes a protocol, protocols produce data, and data becomes decisions through computation. The most scalable work is downstream on computers: preprocessing, QC, analysis, and reporting can be standardized, versioned, and audited.

Two Layers of Analysis

Layer ANeuroimaging Core

Standardized artifacts (DICOM → NIfTI, BIDS datasets, derivatives) with repeatable failure modes.

Layer BBroader Neuroscience

Adjacent data families (ephys, behavior, omics) following the same engineering logic: standardization → core ops → QC + evidence.

Operational Maturity vs. Sensitivity

Neuroimaging sits at the intersection of radiology, neuroscience, pharma/clinical trials, and medical AI. A practical lens for "what is benchmarkable" is:

Benchmarkability:Maturity high enough + sensitivity real enough = clear task contract

Structural MRI Pipelines

Maturity:High
Sensitivity:Manageable

Segmentation, regional volumetrics, cortical thickness

Known Failure Modes
  • Over/under skull stripping
  • Surface reconstruction failures

fMRI / Connectomics

Maturity:Moderate
Sensitivity:Higher

Sensitive to motion, denoising, confound regression, statistical assumptions

Known Failure Modes
  • Same dataset → different connectivity matrices under different preprocessing

Diffusion MRI / Tractography

Maturity:Mixed
Sensitivity:Method-dependent

Tensor metrics (FA/MD) stable under QC; tractography varies by algorithm

Known Failure Modes
  • Tractography-derived connectomes vary substantially by algorithm and parameters

PET Quantification

Maturity:Mixed
Sensitivity:Protocol-sensitive

Results depend on reconstruction, registration quality, reference region choices

Known Failure Modes
  • SUVR normalization makes reference-region selection a source of variance

For benchmarking: We want maturity high enough to write a clear task contract, but sensitivity real enough that evidence and QC matter.

End-to-End Processing Chain

From problem definition to deliverables—showing where computer-completable work emerges.

Computer-completable
On-site / Physical
1

Problem & Study Design

Computer

Endpoints, confounds, sample size

2

Acquisition (On-Site)

Protocol execution and participant workflow

3

Data Intake & Standardization

Computer

De-ID + DICOM→BIDS conversion

4

Preprocessing + QC

Computer

Correction, registration, segmentation

5

Analysis & Modeling

Computer

GLM, connectivity, prediction

6

Deliverables

Computer

Figures, tables, audit packages

Modality View: What "Correct" Looks Like

Each modality has characteristic outputs and well-known QC failure modes.

🧠

Structural MRI

Correct output:

Skull stripping, segmentation

QC failures:
  • WM/GM boundary errors
📊

fMRI (rest/task)

Correct output:

Motion correction, denoising

QC failures:
  • High motion
  • Carpet-plot artifacts
🔗

Diffusion MRI

Correct output:

Eddy correction, tensor fitting

QC failures:
  • Gradient table errors
☢️

PET

Correct output:

Partial volume correction, normalization

QC failures:
  • MRI↔PET registration issues

EEG/MEG

Correct output:

Artifact removal, epoching

QC failures:
  • Channel dropout
  • Muscle/eye artifacts

Key Roles

Understanding who owns which decisions—and where handoff failures occur.

Role handoffs are a common source of hidden failures. Good practice shows up as explicit artifacts (schemas, manifests, configs).

1

PI / Clinical Researcher

Problem framing and interpretation boundaries

Key Artifacts
Study protocolsAnalysis plansInterpretation guidelines
Handoff Risk

Ambiguous endpoints propagate through pipeline

2

MR Physics / Scanning Team

Protocol design and harmonization

Key Artifacts
Acquisition protocolsPhantom QCSite harmonization specs
Handoff Risk

Protocol drift across sites/time

3

Neuroinformatics / Data Engineering

De-ID, BIDS metadata, provenance, access control

Key Artifacts
BIDS datasetsData dictionariesProvenance logsAccess manifests
Handoff Risk

Metadata inconsistencies, broken provenance

4

Imaging Analyst / Scientist

Preprocessing, QC, ROI/statistics, visualization

Key Artifacts
DerivativesQC reportsStatistical mapsFigures
Handoff Risk

Undocumented parameter choices

5

ML / Software Engineer

Automation, platform pipelines, productization

Key Artifacts
Pipeline configsContainer imagesCI/CD workflows
Handoff Risk

Version/environment drift

Part 2

Where LLM Agents Fit

Identifying high-confidence surfaces where AI agents can reliably execute specifiable, reviewable tasks—while respecting the "human-last" boundary for high-stakes decisions.

The "Human-Last" Boundary in Neuroimaging

Neuroimaging has two hard constraints:

  • Outputs are sensitive to engineering choices (preprocessing/model configs), so correctness must be evidenced.
  • Many settings are audit- and privacy-sensitive (clinical trials, regulated environments), so provenance matters.
Agents can reliably execute what humans can specify and verify, while humans retain judgment for ambiguous interpretation and high-stakes scientific/clinical decisions.
High-Confidence Agent Surfaces (v1)Execution-heavy tasks

Data Intake + Standardization

High confidence
  • De-ID checks
  • DICOM→BIDS conversion
  • BIDS validation
  • Metadata consistency

Preprocessing Wrappers + QC Triage

High confidence
  • Run standardized pipelines
  • Detect common failures
  • Produce evidence packs

Analysis Extraction

High confidence
  • ROI summaries
  • Mask resampling and header consistency
  • Basic report tables/plots

Deliverable Packaging

High confidence
  • Structured results
  • Reproducible configs
  • Minimal, reviewable figures
What to Avoid in v1 Benchmarks
Operating scanners or on-site workflows

Physical + institutional variability

Open-ended scientific interpretation

E.g., causal claims without narrow acceptance criteria

Review Agents: Deterministic + Evidence-Based

Benchmarking should focus on raw input and raw output, not narrative.

Layer 1Deterministic Validator
  • • File existence
  • • NIfTI header consistency (shape/affine)
  • • Numeric tolerances
  • • JSON/CSV schema checks
Layer 2Evidence-Based Reviewer
  • • Screenshot consistency (view instructions honored)
  • • Visual artifact detection
  • • Coherence between tables and figures
Part 3

Example Tasks

Benchmarkable tasks that evaluate end-to-end execution on a computer: given raw files and constraints, the agent must use real tools to produce verifiable deliverables (Raw Input → Raw Output).

Design Principles

1

Must use software

Tasks require real tools/scripts; pure LLM reasoning is not acceptable.

2

Raw Input → Raw Output

Define I/O and acceptance criteria; the agent chooses the path.

3

Operational scoring

Prefer deterministic validation (headers, tolerances, schemas) plus evidence-based checks.

4

Scalable data

Prioritize public datasets (BIDS) and synthetic perturbations.

5

Evidence packs

A task is only useful if a reviewer can audit it quickly.

Core Tasks (5)

Analysis Extraction

ROI Statistics Summary from a Voxelwise Map

Convert a voxelwise statistical map into a compact ROI-level summary for downstream reporting and sanity checks.

Data Preprocessing

Resample an ROI Mask to Match a Target Grid (Label-Safe)

A common failure in ROI analysis is grid mismatch; label masks must be resampled with nearest-neighbor interpolation to preserve integer labels.

QC Triage

Skull-Stripping QC (Select Best Mask + Evidence)

Skull-stripping errors corrupt downstream segmentation and statistics; logs are insufficient, and a decision requires visual QC.

Deliverable Packaging

Surface Visualization Export (Workbench Scene → PNG)

Surface workflows (HCP/fsLR) require reproducible exports (same map index, view angle, color scale).

QC Triage

Registration/Normalization QC (Tri-Planar Overlay Screenshots)

Misregistration can "pass" computationally but invalidates results; overlay QC catches gross failures.

Recommended Tool Stack

Standards-compliant tools for neuroimaging data handling, processing, and visualization.

BIDS
DICOM
NIfTI
MRIQC
fMRIPrep
FreeSurfer
FSL
AFNI
ANTs
nibabel
nilearn
ITK-SNAP
Connectome Workbench

Layer B (Optional Extension): Broader Neuroscience Tasks

The same engineering pattern applies to adjacent modalities with different standard artifacts:

Ephys / Calcium Imaging

Standardization via NWB; tasks like spike-sorting curation based on objective metrics.

Behavior Video / Kinematics

Pose-tracking QC (confidence drops, swaps) and timestamp alignment.

Omics / Spatial Transcriptomics

QC of read depth/mitochondrial fraction and reproducible cell-type label transfer.

Task Comparison Summary

TaskRequired SoftwareIndustry Rep.ScoringCategory
ROI Statistics Summarynibabel/nilearn★★★★★★★★★★Analysis
Resample ROI Masknibabel + ANTs/FSL★★★★☆★★★★★Preprocessing
Skull-Stripping QCViewer/screenshot★★★★★★★★★☆QC
Surface Visualization ExportWorkbench★★★☆☆★★★★★Deliverable
Registration QCFSLeyes/ITK-SNAP★★★★★★★★★☆QC

Contribute to Neuroimaging

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Our Commitments to Contributors

  • Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
  • Partner Review: Industry partners can review and approve task specifications before public release.
  • Data Control: Contributors can exclude sensitive or proprietary data from submissions.