🚚
Under Review

Supply Chain

From procurement to delivery—workflow-driven ERP operations with deterministic, system-state verification.

Contribute to Supply Chain
Part 1

Supply Chain Overview

Supply chain management spans the full lifecycle from raw-material sourcing to finished-goods delivery. It is the operational backbone of manufacturing, retail, and technology companies.

~$2.3T
U.S. logistics costs (~8.7% of GDP)
$7B → $192B
AI in supply chain (2024 → 2034)
~39%
Projected CAGR

Why Supply Chain is Ideal for AI Agent Evaluation

Highly Workflow-Driven

Repeating operational loops (daily/weekly) rather than one-off research.

Artifact-Rich

ERP records, shipment documents, invoices, receipts—standardized and auditable.

Deterministic-Scoring Friendly

Many tasks reduce to rule checks, reconciliation, or system-state validation.

Software-Native

Real work happens in ERP/WMS/TMS tools; "just reasoning" is not sufficient.

The Three Core Roles (and ERP Alignment)

Supply chain operations center on three role clusters that map cleanly to ERP modules. In North America, Oracle ERP is dominant; SAP is more common in Europe.

Procurement

Buyer / Supply Chain Manager

  • •Supplier sourcing
  • •PO creation & receiving
  • •Invoice matching & payment
ERP: Procurement + Accounting

Planning

Planner / Demand Planner

  • •Forecasting & MRP runs
  • •Production scheduling
  • •Order release
ERP: Manufacturing + MRP

Logistics

Logistics Coordinator

  • •Shipment execution
  • •Document creation
  • •Carrier booking & tracking
ERP: Inventory + Shipping

How the Roles Connect (Shared-System Closed Loop)

Demand / Sales
PlannerMRP → Schedule
BuyerPO → Receive → Pay
LogisticsShip → Track
Factory / Customer
All three roles operate against the same ERP database — inventory and financial state must be consistent across modules

Execution Environment: Odoo vs Enterprise ERP

For a public, reproducible benchmark, Odoo Community is a pragmatic choice: deployable, scriptable, conceptually aligned with Oracle/SAP, and amenable to deterministic verification.

Odoo Community

Free, Docker-friendly, JSON-RPC + REST

Oracle ERP Cloud

Enterprise license, REST + auth

SAP S/4HANA

Enterprise license, OData v2/v4

Part 2

Where LLM Agents Fit

Supply chain has a strong "human-last" boundary—agents can reliably execute what humans can specify and verify.

"Agents can reliably execute what humans can specify and verify, while humans retain judgment for ambiguous trade-offs and negotiation-heavy decisions."

This naturally favors benchmarks where success is based on ERP state transitions,deterministic rule checks, and auditable artifacts.

High-Confidence v1 Surfaces

Procurement Controls

3-way matching, approvals, exception handling

Planning Runs

MRP scheduler execution, planned-order comparison, release actions

Logistics Documents

Cross-document consistency and compliance completeness

Design Principles for Workflow Tasks

1Must use software

Tasks require real tool/API/ERP operations; pure "read JSON and answer" is not acceptable.

2Raw Input → Raw Output

Define only inputs, outputs, and acceptance criteria; the agent chooses how to execute.

3Operational scoring

Review Agent validates via deterministic rules and system-state checks.

4Scalable data

Prefer demo data + parameterized synthesis; avoid hand-labeling.

Part 3

Representative Workflows (Example Tasks)

Three canonical workflows that cover the three core roles. Each is defined in AgentHLE style: end-to-end execution on a computer with real ERP tools.

Procurement

Procure-to-Pay (P2P) — A Buyer's Day

P2P is the operational loop from requisition to payment. The critical control point is 3-way matchin...

Planning

MRP Run + Order Release — A Planner's Day

MRP is the planning engine in manufacturing. The daily loop: receive demand signals → run MRP in ERP...

Logistics

International Shipping Execution + Documents — A Logistics Coordinator's Day

For logistics coordinators, the core work is execution and document correctness. In global trade, a ...

Part 4

Data Collection & Scaling

All three workflows can produce ground truth by system computation and rules, not manual labeling.

Four Data Strategies (AgentHLE Framing)

Data TypeDefinitionSupply Chain Fit
Sea-level public dataMassive public corpora onlineNot applicable (ERP data is private)
Parameterized syntheticDefine knobs; generate many casesMRP ★★★★★(BOM + inventory + params)
Template + syntheticStandard templates with controlled variationShipping docs ★★★★(HS code + templates)
ERP testing dataDemo/test datasets + synthetic perturbationsP2P ★★★★(demo POs + injected exceptions)

P2P Scaling

Demo ERP data + synthetic exception injection

Match typeperfect / qty diff / price diff / both
Delta sizewithin tol / out of tol / boundary
PO line count1 line / multi-line (5–20)
Exception typepartial / over / return / replacement
Ground truth: pre-execute operations in ERP; record match outputs

MRP Scaling

Parameterized synthetic (system-computed ground truth)

BOM depth1 → 5 levels
Item count5 → 200+
Demand patternsingle / multi-batch / seasonal
Inventory stateample / partial / full shortage
Ground truth: set up ERP state → run MRP scheduler → planned orders

Shipping Docs Scaling

Template + synthetic with standards

Goods typegeneral / DG / cold chain / oversized
Modeocean FCL/LCL / air / multimodal
IncotermsFOB / CIF / EXW / DDP
LaneCN→US / CN→EU / US→EU / SEA→US
Standards: HS code (WCO), Incoterms 2020, IMDG/IMO DG reqs

Key Scaling Principle

All three workflows can produce ground truth by system computation and rules, not manual labeling:

P2P

ERP matching outputs and record states

MRP

Scheduler output

Docs

Consistency math + compliance rules + system lookup

You often need 20–30 representative cases per workflow to start; representativeness matters more than sheer volume in v1.
Part 5

Review Agent Design

Three-layer validation architecture for automated evaluation.

Validation Pipeline

Layer 1

Data Completeness

Required fields present; ERP records created successfully.

Layer 2

Rule Correctness

Matching rules, MRP outputs, cross-document consistency checks.

Layer 3

Domain Compliance

Approvals, HS code validity, dangerous goods completeness.

Review Agent Operations by Workflow

P2P Review Agent

  • •Query ERP to confirm PO/Receipt/Invoice records exist
  • •Read matching status and compute line-level deltas
  • •Validate tolerance application and approval routing

MRP Review Agent

  • •Run MRP scheduler independently for reference output
  • •Compare planned orders item-by-item (qty and date)
  • •Verify released MO/PO records exist in system

Shipping Docs Review Agent

  • •Cross-check consistency (weight, pieces, value)
  • •Validate HS codes against reference DB
  • •Validate DG fields completeness when applicable
Part 6

Summary

Why These Three Workflows

To evaluate whether an AI agent can perform real supply chain work, we test the three daily "core loops":

P2PMRPShipping Docs
Primary RoleBuyerPlannerLogistics Coordinator
ERP ModulesPurchase + AccountingManufacturing + MRPInventory + Shipping
Scoring DeterminismHigh (rules)High (deterministic compute)High (consistency + compliance)
Must Use Software✓ ERP full loop✓ Scheduler + release✓ Master data + doc gen
Scaling Potential★★★★★★★★★★★★★

Immediate Next Steps

  • Stand up an Odoo Docker environment with demo data and validate the three workflows end-to-end.
  • Generate 20–30 representative cases per workflow and record ground truth.
  • Implement the Review Agent checks and integrate scoring.

Professional Tools in Supply Chain

OdooSAP S/4HANAOracle ERPWMS / TMSExcel

Contribute to Supply Chain

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Our Commitments to Contributors

  • Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
  • Partner Review: Industry partners can review and approve task specifications before public release.
  • Data Control: Contributors can exclude sensitive or proprietary data from submissions.