Supply Chain
From procurement to delivery—workflow-driven ERP operations with deterministic, system-state verification.
Supply Chain Overview
Supply chain management spans the full lifecycle from raw-material sourcing to finished-goods delivery. It is the operational backbone of manufacturing, retail, and technology companies.
Why Supply Chain is Ideal for AI Agent Evaluation
Repeating operational loops (daily/weekly) rather than one-off research.
ERP records, shipment documents, invoices, receipts—standardized and auditable.
Many tasks reduce to rule checks, reconciliation, or system-state validation.
Real work happens in ERP/WMS/TMS tools; "just reasoning" is not sufficient.
The Three Core Roles (and ERP Alignment)
Supply chain operations center on three role clusters that map cleanly to ERP modules. In North America, Oracle ERP is dominant; SAP is more common in Europe.
Procurement
Buyer / Supply Chain Manager
- •Supplier sourcing
- •PO creation & receiving
- •Invoice matching & payment
Planning
Planner / Demand Planner
- •Forecasting & MRP runs
- •Production scheduling
- •Order release
Logistics
Logistics Coordinator
- •Shipment execution
- •Document creation
- •Carrier booking & tracking
How the Roles Connect (Shared-System Closed Loop)
Execution Environment: Odoo vs Enterprise ERP
For a public, reproducible benchmark, Odoo Community is a pragmatic choice: deployable, scriptable, conceptually aligned with Oracle/SAP, and amenable to deterministic verification.
Free, Docker-friendly, JSON-RPC + REST
Enterprise license, REST + auth
Enterprise license, OData v2/v4
Where LLM Agents Fit
Supply chain has a strong "human-last" boundary—agents can reliably execute what humans can specify and verify.
"Agents can reliably execute what humans can specify and verify, while humans retain judgment for ambiguous trade-offs and negotiation-heavy decisions."
This naturally favors benchmarks where success is based on ERP state transitions,deterministic rule checks, and auditable artifacts.
High-Confidence v1 Surfaces
3-way matching, approvals, exception handling
MRP scheduler execution, planned-order comparison, release actions
Cross-document consistency and compliance completeness
Design Principles for Workflow Tasks
Tasks require real tool/API/ERP operations; pure "read JSON and answer" is not acceptable.
Define only inputs, outputs, and acceptance criteria; the agent chooses how to execute.
Review Agent validates via deterministic rules and system-state checks.
Prefer demo data + parameterized synthesis; avoid hand-labeling.
Representative Workflows (Example Tasks)
Three canonical workflows that cover the three core roles. Each is defined in AgentHLE style: end-to-end execution on a computer with real ERP tools.
Procure-to-Pay (P2P) — A Buyer's Day
P2P is the operational loop from requisition to payment. The critical control point is 3-way matchin...
MRP Run + Order Release — A Planner's Day
MRP is the planning engine in manufacturing. The daily loop: receive demand signals → run MRP in ERP...
International Shipping Execution + Documents — A Logistics Coordinator's Day
For logistics coordinators, the core work is execution and document correctness. In global trade, a ...
Data Collection & Scaling
All three workflows can produce ground truth by system computation and rules, not manual labeling.
Four Data Strategies (AgentHLE Framing)
| Data Type | Definition | Supply Chain Fit |
|---|---|---|
| Sea-level public data | Massive public corpora online | Not applicable (ERP data is private) |
| Parameterized synthetic | Define knobs; generate many cases | MRP ★★★★★(BOM + inventory + params) |
| Template + synthetic | Standard templates with controlled variation | Shipping docs ★★★★(HS code + templates) |
| ERP testing data | Demo/test datasets + synthetic perturbations | P2P ★★★★(demo POs + injected exceptions) |
P2P Scaling
Demo ERP data + synthetic exception injection
MRP Scaling
Parameterized synthetic (system-computed ground truth)
Shipping Docs Scaling
Template + synthetic with standards
Key Scaling Principle
All three workflows can produce ground truth by system computation and rules, not manual labeling:
ERP matching outputs and record states
Scheduler output
Consistency math + compliance rules + system lookup
Review Agent Design
Three-layer validation architecture for automated evaluation.
Validation Pipeline
Data Completeness
Required fields present; ERP records created successfully.
Rule Correctness
Matching rules, MRP outputs, cross-document consistency checks.
Domain Compliance
Approvals, HS code validity, dangerous goods completeness.
Review Agent Operations by Workflow
P2P Review Agent
- •Query ERP to confirm PO/Receipt/Invoice records exist
- •Read matching status and compute line-level deltas
- •Validate tolerance application and approval routing
MRP Review Agent
- •Run MRP scheduler independently for reference output
- •Compare planned orders item-by-item (qty and date)
- •Verify released MO/PO records exist in system
Shipping Docs Review Agent
- •Cross-check consistency (weight, pieces, value)
- •Validate HS codes against reference DB
- •Validate DG fields completeness when applicable
Summary
Why These Three Workflows
To evaluate whether an AI agent can perform real supply chain work, we test the three daily "core loops":
| P2P | MRP | Shipping Docs | |
|---|---|---|---|
| Primary Role | Buyer | Planner | Logistics Coordinator |
| ERP Modules | Purchase + Accounting | Manufacturing + MRP | Inventory + Shipping |
| Scoring Determinism | High (rules) | High (deterministic compute) | High (consistency + compliance) |
| Must Use Software | ✓ ERP full loop | ✓ Scheduler + release | ✓ Master data + doc gen |
| Scaling Potential | ★★★★ | ★★★★★ | ★★★★ |
Immediate Next Steps
- Stand up an Odoo Docker environment with demo data and validate the three workflows end-to-end.
- Generate 20–30 representative cases per workflow and record ground truth.
- Implement the Review Agent checks and integrate scoring.
Professional Tools in Supply Chain
Contribute to Supply Chain
We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:
Submit Landscape Understanding
Help us map ERP modules, roles, and workflows in supply chain operations. Share your perspective on the domain structure.
Submit a Workflow
Describe a specific professional task with tools, inputs, outputs, and how success is verified.
Our Commitments to Contributors
- Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
- Partner Review: Industry partners can review and approve task specifications before public release.
- Data Control: Contributors can exclude sensitive or proprietary data from submissions.