🧠

Under ReviewXinyang Han (UCB), Yiyou Sun (UCB)

Neuroimaging

Standardized artifacts (BIDS/NIfTI), repeatable QC failure modes, and audit-ready deliverables make neuroimaging ideal for execution-focused agent benchmarks.

Contribute to Neuroimaging

Part 1

Neuroimaging Overview

Neuroimaging work is a closed loop: a scientific question becomes a protocol, protocols produce data, and data becomes decisions through computation. The most scalable work is downstream on computers: preprocessing, QC, analysis, and reporting can be standardized, versioned, and audited.

Two Layers of Analysis

Layer ANeuroimaging Core

Standardized artifacts (DICOM → NIfTI, BIDS datasets, derivatives) with repeatable failure modes.

Layer BBroader Neuroscience

Adjacent data families (ephys, behavior, omics) following the same engineering logic: standardization → core ops → QC + evidence.

Operational Maturity vs. Sensitivity

Neuroimaging sits at the intersection of radiology, neuroscience, pharma/clinical trials, and medical AI. A practical lens for "what is benchmarkable" is:

Benchmarkability:Maturity high enough + sensitivity real enough = clear task contract

Structural MRI Pipelines

Maturity:High

Sensitivity:Manageable

Segmentation, regional volumetrics, cortical thickness

Known Failure Modes

⚠Over/under skull stripping
⚠Surface reconstruction failures

fMRI / Connectomics

Maturity:Moderate

Sensitivity:Higher

Sensitive to motion, denoising, confound regression, statistical assumptions

Known Failure Modes

⚠Same dataset → different connectivity matrices under different preprocessing

Diffusion MRI / Tractography

Maturity:Mixed

Sensitivity:Method-dependent

Tensor metrics (FA/MD) stable under QC; tractography varies by algorithm

Known Failure Modes

⚠Tractography-derived connectomes vary substantially by algorithm and parameters

PET Quantification

Maturity:Mixed

Sensitivity:Protocol-sensitive

Results depend on reconstruction, registration quality, reference region choices

Known Failure Modes

⚠SUVR normalization makes reference-region selection a source of variance

For benchmarking: We want maturity high enough to write a clear task contract, but sensitivity real enough that evidence and QC matter.

End-to-End Processing Chain

From problem definition to deliverables—showing where computer-completable work emerges.

Computer-completable

On-site / Physical

Problem & Study Design

Endpoints, confounds, sample size

Protocol docsPower analyses

Acquisition (On-Site)

Protocol execution and participant workflow

Raw DICOMScanner logs

Data Intake & Standardization

De-ID + DICOM→BIDS conversion

BIDS datasetValidator reportsProvenance

Preprocessing + QC

Correction, registration, segmentation

DerivativesHTML QC reports

Analysis & Modeling

GLM, connectivity, prediction

Stat mapsMatricesROI tables

Deliverables

Figures, tables, audit packages

FiguresTablesAudit packages

Problem & Study Design

Computer

Endpoints, confounds, sample size

Acquisition (On-Site)

Skull stripping, segmentation

QC failures:

✕WM/GM boundary errors

📊

fMRI (rest/task)

Correct output:

Motion correction, denoising

QC failures:

✕High motion
✕Carpet-plot artifacts

🔗

Diffusion MRI

Correct output:

Eddy correction, tensor fitting

QC failures:

✕Gradient table errors

☢️

PET

Correct output:

Partial volume correction, normalization

QC failures:

✕MRI↔PET registration issues

⚡

EEG/MEG

Correct output:

Artifact removal, epoching

QC failures:

✕Channel dropout
✕Muscle/eye artifacts

Pipeline configsContainer imagesCI/CD workflows

Handoff Risk

⚠ Version/environment drift

Part 2

Where LLM Agents Fit

Identifying high-confidence surfaces where AI agents can reliably execute specifiable, reviewable tasks—while respecting the "human-last" boundary for high-stakes decisions.

The "Human-Last" Boundary in Neuroimaging

Neuroimaging has two hard constraints:

•Outputs are sensitive to engineering choices (preprocessing/model configs), so correctness must be evidenced.
•Many settings are audit- and privacy-sensitive (clinical trials, regulated environments), so provenance matters.

Agents can reliably execute what humans can specify and verify, while humans retain judgment for ambiguous interpretation and high-stakes scientific/clinical decisions.

High-Confidence Agent Surfaces (v1)Execution-heavy tasks

Data Intake + Standardization

High confidence

De-ID checks
DICOM→BIDS conversion
BIDS validation
Metadata consistency

Preprocessing Wrappers + QC Triage

High confidence

Run standardized pipelines
Detect common failures
Produce evidence packs

Analysis Extraction

High confidence

ROI summaries
Mask resampling and header consistency
Basic report tables/plots

Deliverable Packaging

High confidence

Structured results
Reproducible configs
Minimal, reviewable figures

What to Avoid in v1 Benchmarks

Operating scanners or on-site workflows

Physical + institutional variability

Open-ended scientific interpretation

E.g., causal claims without narrow acceptance criteria

Review Agents: Deterministic + Evidence-Based

Benchmarking should focus on raw input and raw output, not narrative.

Layer 1Deterministic Validator

• File existence
• NIfTI header consistency (shape/affine)
• Numeric tolerances
• JSON/CSV schema checks

Layer 2Evidence-Based Reviewer

• Screenshot consistency (view instructions honored)
• Visual artifact detection
• Coherence between tables and figures

Part 3

Example Tasks

Benchmarkable tasks that evaluate end-to-end execution on a computer: given raw files and constraints, the agent must use real tools to produce verifiable deliverables (Raw Input → Raw Output).

Design Principles

Must use software

Tasks require real tools/scripts; pure LLM reasoning is not acceptable.

Raw Input → Raw Output

Define I/O and acceptance criteria; the agent chooses the path.

Operational scoring

Prefer deterministic validation (headers, tolerances, schemas) plus evidence-based checks.

Scalable data

Prioritize public datasets (BIDS) and synthetic perturbations.

Evidence packs

A task is only useful if a reviewer can audit it quickly.

Core Tasks (5)

Analysis Extraction

ROI Statistics Summary from a Voxelwise Map

Convert a voxelwise statistical map into a compact ROI-level summary for downstream reporting and sanity checks.

Data Preprocessing

Resample an ROI Mask to Match a Target Grid (Label-Safe)

A common failure in ROI analysis is grid mismatch; label masks must be resampled with nearest-neighbor interpolation to preserve integer labels.

QC Triage

Skull-Stripping QC (Select Best Mask + Evidence)

Skull-stripping errors corrupt downstream segmentation and statistics; logs are insufficient, and a decision requires visual QC.

Deliverable Packaging

Surface Visualization Export (Workbench Scene → PNG)

Surface workflows (HCP/fsLR) require reproducible exports (same map index, view angle, color scale).

QC Triage

Registration/Normalization QC (Tri-Planar Overlay Screenshots)

Misregistration can "pass" computationally but invalidates results; overlay QC catches gross failures.

Recommended Tool Stack

Standards-compliant tools for neuroimaging data handling, processing, and visualization.

BIDS

DICOM

NIfTI

MRIQC

fMRIPrep

FreeSurfer

FSL

AFNI

ANTs

nibabel

nilearn

ITK-SNAP

Connectome Workbench

Layer B (Optional Extension): Broader Neuroscience Tasks

The same engineering pattern applies to adjacent modalities with different standard artifacts:

Ephys / Calcium Imaging

Standardization via NWB; tasks like spike-sorting curation based on objective metrics.

Behavior Video / Kinematics

Pose-tracking QC (confidence drops, swaps) and timestamp alignment.

Omics / Spatial Transcriptomics

QC of read depth/mitochondrial fraction and reproducible cell-type label transfer.

Task Comparison Summary

Task	Required Software	Industry Rep.	Scoring	Category
ROI Statistics Summary	nibabel/nilearn	★★★★★	★★★★★	Analysis
Resample ROI Mask	nibabel + ANTs/FSL	★★★★☆	★★★★★	Preprocessing
Skull-Stripping QC	Viewer/screenshot	★★★★★	★★★★☆	QC
Surface Visualization Export	Workbench	★★★☆☆	★★★★★	Deliverable
Registration QC	FSLeyes/ITK-SNAP	★★★★★	★★★★☆	QC

Contribute to Neuroimaging

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Submit Landscape Understanding

Help us map modalities, pipelines, tools, and standards in neuroimaging. Share your perspective on the industry structure.

Submit a Workflow

Describe a specific professional task with tools, inputs, outputs, and how success is verified.

Our Commitments to Contributors

Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
Partner Review: Industry partners can review and approve task specifications before public release.
Data Control: Contributors can exclude sensitive or proprietary data from submissions.