⚛️

Under Review

Physics

Pipeline-driven research with verification-heavy workflows and professional software toolchains.

Part 1

Physics Research Overview

Physics research is one of the cleanest "agent-evaluable" scientific domains because much of the work is pipeline-driven, produces digital artifacts, and has strong norms around verification and reproducibility.

Why "physics-capable" is not yet "research-capable"

Despite rapid progress, current AI systems are not yet reliable at autonomous, frontier physics research. The main bottleneck is not access to facts; it is the reliability of end-to-end scientific execution:

Long chains of symbolic manipulation and bookkeeping

Strict conventions (normalizations, sign conventions, schemes)

Correct use of professional software (CAS, generators, solvers)

Reproducible verification (equivalence checks, stability checks)

Common Failure Modes(what breaks in practice)

🔧

Tool Correctness

The system describes a method but fails to run the correct toolchain or produces invalid artifacts.

Agent claims to use FeynRules but outputs malformed UFO model that won't load in MadGraph.

📐

Convention Discipline

Small scheme/normalization/branch mistakes invalidate conclusions.

Correct amplitude formula but wrong sign convention (FDH vs CDR) leads to 10% error in cross-section.

✓

Verification Gap

Research demands reproducible checks (symbolic equivalence, phase-space sampling, data–MC comparison), not plausible-looking expressions.

Agent produces reasonable-looking integral reduction but no cross-check step—errors discovered only at publication review.

Key insight:These failure modes matter more than "knowing physics facts." The bottleneck is not access to knowledge—it's the reliability of end-to-end scientific execution.

Part 2

Where LLM Agents Fit

Analysis of roles and representative workflows in physics research, with focus on where AI agents can reliably contribute to execution and evidence packaging.

The "human-last" boundary in physics

"Agents can reliably execute what humans already know how to specify and verify, but humans retain judgment for ambiguous, open-ended discovery decisions."

This makes execution-heavy, tool-driven, reviewable workflows the best v1 slice.

High-confidence agent surfaces (v1)

The most benchmarkable work product in physics is often not"derive a new theory", but:

Running standardized toolchains (setup → execute → extract → package)

Producing auditable artifacts (config files, model files, notebooks, plots)

Performing deterministic checks (equivalence, stability, regression)

Writing structured validation reports (what was run, what changed, passed/failed)

A practical decomposition (boards)

Many physics workflows can be understood as handoffs across three "boards":

Workflow Decomposition

Compute

Symbolic/algebraic derivations, solvers, numerical pipelines

Digital Artifacts

Symbolic expressionsAmplitude filesReduction tablesCoefficient exports

Agent fit: CAS manipulation, IBP reductions, symbolic verification

Simulate

Event generation, sampling, sweeps, parameter scans

Digital Artifacts

Config filesEvent samplesMonte Carlo outputsParameter grids

Agent fit: Generator setup, event chain execution, scan automation

Validate

Regression-to-reference, cross-tool checks, uncertainty/stability checks

Digital Artifacts

Comparison plotsValidation reportsSystematic tablesQC logs

Agent fit: Automated regression, equivalence checks, reproducibility verification

Compute

Symbolic/algebraic derivations, solvers, numerical pipelines

Digital Artifacts

Symbolic expressionsAmplitude filesReduction tablesCoefficient exports

Agent fit: CAS manipulation, IBP reductions, symbolic verification

Simulate

Event generation, sampling, sweeps, parameter scans

Digital Artifacts

Config filesEvent samplesMonte Carlo outputsParameter grids

Agent fit: Generator setup, event chain execution, scan automation

Validate

Regression-to-reference, cross-tool checks, uncertainty/stability checks

Digital Artifacts

Comparison plotsValidation reportsSystematic tablesQC logs

Agent fit: Automated regression, equivalence checks, reproducibility verification

Central Pattern

Raw input → Professional software → Verifiable output
This aligns with benchmark design where tasks require software use, evaluation compares outputs to ground truth, and criteria are operational.

Part 3

Example Tasks (Canonical Workflows)

Benchmarkable tasks that evaluate end-to-end execution on a computer: given raw inputs and constraints, the agent must use real tools to produce deliverables that a reviewer can verify.

Design Principles

1Must use software

Tasks require tools/software/scripts; pure LLM reasoning is not acceptable.

2Raw Input → Raw Output

Define I/O and acceptance criteria; the agent chooses the path.

3Operational scoring

Validate deterministically or with tight numeric tolerances.

4Convention-aware

Explicitly specify scheme/normalization/branches when relevant.

5Verification-first

Require a check step (cross-tool, regression, stability) as part of completion.

Canonical Tasks (6)

High-energy process calculation (analytic amplitudes)

One-loop helicity amplitudes for e⁺e⁻ → qq̄g, including scheme-specific decomposition into universal singular operators plus finite remainders.

BSM model-to-simulation chain (Lagrangian → events)

A simplified dark matter model (spin-1 mediator) exported to a UFO model file, simulated through an event chain, and analyzed with standard frameworks.

Multi-loop integral reduction (IBP → masters)

Reduce a two-loop QCD integral family to master integrals using integration-by-parts (IBP) relations.

EFT operator basis and running (SMEFT-like)

Dimension-six operator analysis (basis construction, redundancy removal, and RG evolution).

Conformal bootstrap (crossing → SDP solve)

Numerical bootstrap bounds for the 3D Ising universality class via large semidefinite programs.

Data/MC validation and reinterpretation

Reproduce an experimental analysis at particle level and compare theory predictions across generator/systematic settings.

Implication for Evaluation

The tasks above share a common structure: raw input → professional software → verifiable output. This enables strict evaluation by checking artifacts (expressions, model files, event samples, histograms) against ground truth or tool-mediated equivalence.

Professional Tools in Physics Research

MathematicaSymPyJupyterROOTPython (NumPy/SciPy)

Contribute to Physics

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Submit Landscape Understanding

Help us map subfields, tools, and workflows in physics research. Share your perspective on the domain structure.

Submit a Workflow

Describe a specific professional task with tools, inputs, outputs, and how success is verified.

Our Commitments to Contributors

Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
Partner Review: Industry partners can review and approve task specifications before public release.
Data Control: Contributors can exclude sensitive or proprietary data from submissions.