Physics
Pipeline-driven research with verification-heavy workflows and professional software toolchains.
Physics Research Overview
Physics research is one of the cleanest "agent-evaluable" scientific domains because much of the work is pipeline-driven, produces digital artifacts, and has strong norms around verification and reproducibility.
Why "physics-capable" is not yet "research-capable"
Despite rapid progress, current AI systems are not yet reliable at autonomous, frontier physics research. The main bottleneck is not access to facts; it is the reliability of end-to-end scientific execution:
Tool Correctness
The system describes a method but fails to run the correct toolchain or produces invalid artifacts.
Agent claims to use FeynRules but outputs malformed UFO model that won't load in MadGraph.
Convention Discipline
Small scheme/normalization/branch mistakes invalidate conclusions.
Correct amplitude formula but wrong sign convention (FDH vs CDR) leads to 10% error in cross-section.
Verification Gap
Research demands reproducible checks (symbolic equivalence, phase-space sampling, data–MC comparison), not plausible-looking expressions.
Agent produces reasonable-looking integral reduction but no cross-check step—errors discovered only at publication review.
Key insight: These failure modes matter more than "knowing physics facts." The bottleneck is not access to knowledge—it's the reliability of end-to-end scientific execution.
Where LLM Agents Fit
Analysis of roles and representative workflows in physics research, with focus on where AI agents can reliably contribute to execution and evidence packaging.
The "human-last" boundary in physics
"Agents can reliably execute what humans already know how to specify and verify, but humans retain judgment for ambiguous, open-ended discovery decisions."
This makes execution-heavy, tool-driven, reviewable workflows the best v1 slice.
High-confidence agent surfaces (v1)
The most benchmarkable work product in physics is often not "derive a new theory", but:
A practical decomposition (boards)
Many physics workflows can be understood as handoffs across three "boards":
Compute
Symbolic/algebraic derivations, solvers, numerical pipelines
Simulate
Event generation, sampling, sweeps, parameter scans
Validate
Regression-to-reference, cross-tool checks, uncertainty/stability checks
Compute
Symbolic/algebraic derivations, solvers, numerical pipelines
Simulate
Event generation, sampling, sweeps, parameter scans
Validate
Regression-to-reference, cross-tool checks, uncertainty/stability checks
Raw input → Professional software → Verifiable output
This aligns with benchmark design where tasks require software use, evaluation compares outputs to ground truth, and criteria are operational.
Example Tasks (Canonical Workflows)
Benchmarkable tasks that evaluate end-to-end execution on a computer: given raw inputs and constraints, the agent must use real tools to produce deliverables that a reviewer can verify.
Design Principles
Tasks require tools/software/scripts; pure LLM reasoning is not acceptable.
Define I/O and acceptance criteria; the agent chooses the path.
Validate deterministically or with tight numeric tolerances.
Explicitly specify scheme/normalization/branches when relevant.
Require a check step (cross-tool, regression, stability) as part of completion.
Canonical Tasks (6)
High-energy process calculation (analytic amplitudes)
One-loop helicity amplitudes for e⁺e⁻ → qq̄g, including scheme-specific decomposition into universal singular operators plus finite remainders.
BSM model-to-simulation chain (Lagrangian → events)
A simplified dark matter model (spin-1 mediator) exported to a UFO model file, simulated through an event chain, and analyzed with standard frameworks.
Multi-loop integral reduction (IBP → masters)
Reduce a two-loop QCD integral family to master integrals using integration-by-parts (IBP) relations.
EFT operator basis and running (SMEFT-like)
Dimension-six operator analysis (basis construction, redundancy removal, and RG evolution).
Conformal bootstrap (crossing → SDP solve)
Numerical bootstrap bounds for the 3D Ising universality class via large semidefinite programs.
Data/MC validation and reinterpretation
Reproduce an experimental analysis at particle level and compare theory predictions across generator/systematic settings.
Implication for Evaluation
The tasks above share a common structure: raw input → professional software → verifiable output. This enables strict evaluation by checking artifacts (expressions, model files, event samples, histograms) against ground truth or tool-mediated equivalence.
Professional Tools in Physics Research
Contribute to Physics
We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:
Submit Landscape Understanding
Help us map subfields, tools, and workflows in physics research. Share your perspective on the domain structure.
Submit a Workflow
Describe a specific professional task with tools, inputs, outputs, and how success is verified.
Our Commitments to Contributors
- Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
- Partner Review: Industry partners can review and approve task specifications before public release.
- Data Control: Contributors can exclude sensitive or proprietary data from submissions.