Media Production
Software-heavy, standard-driven workflows producing verifiable artifacts—ideal for AI agent evaluation through encoded deliverables, metadata reports, and quality measurements.
Post-production work is increasingly defined by strict delivery specifications (codec/container/color/loudness) and tool-mediated pipelines, which makes many tasks objectively scorable and automation-friendly.
Industry Landscape
The global post-production market spans video editing, VFX, audio post, color grading, and animation— driven by streaming competition and strict technical requirements.
Service Type Breakdown
Value Chain Stack
Content Platforms
Streaming competition drives post-production load
Netflix, Disney+, Amazon Prime, Apple TV+, HBO/Max
Production Houses
Direct customers (~46% of post demand)
Major studios, Independent film, Advertising, Game cinematics
Post-Production Services
VFX studios, full-service post, audio specialists
ILM, Wētā FX, DNEG, Framestore, MPC, Technicolor
Tools & Infrastructure
What an agent must operate
FFmpeg, Blender, Nuke, DaVinci Resolve, After Effects
Key Players
Content Platforms (Demand Drivers)
| Platform | Annual Spend | Post Demand |
|---|---|---|
| Netflix | ~$17B | 40+ languages, strict 4K HDR |
| Disney+ / Marvel | ~$33B | VFX-heavy franchises, virtual production |
| Amazon Prime | ~$19B | Large series, $100M+ VFX budgets |
| Apple TV+ | ~$9B | Dolby Vision/Atmos requirements |
| HBO / Max | ~$23B | HBO quality bar, top-tier standards |
VFX Studios (Highest Tech Density)
| Studio | HQ | Strengths |
|---|---|---|
| ILM | San Francisco | VFX pioneer; StageCraft virtual production |
| Wētā FX | Wellington | Digital characters; large-scale simulation |
| DNEG | London | 19 sites, 9,000+ staff; virtual production |
| Framestore | London | Character animation; advertising VFX |
| MPC | London | Creature/crowd simulation |
| Digital Domain | Los Angeles | Digital humans; facial capture |
Tools & Infrastructure
This layer is most directly relevant for AgentHLE: benchmark tasks must require real software operation.
Commercial Tool Ecosystem
| Tool | Vendor | Function | Position |
|---|---|---|---|
| Nuke | Foundry | VFX compositing | De facto high-end standard |
| DaVinci Resolve | Blackmagic | Color/edit/audio/VFX | Color standard; strong value |
| Premiere Pro | Adobe | Editing | Most widely used NLE |
| After Effects | Adobe | Motion graphics | Motion graphics standard |
| Pro Tools | Avid | Audio post | Audio post standard |
| Houdini | SideFX | Procedural VFX/sim | High-end simulation |
| Maya | Autodesk | 3D modeling/animation | Film animation standard |
Free/Open AlternativesBenchmark-Friendly
| Tool | Equivalent | API | Headless | Agent Suitability |
|---|---|---|---|---|
| FFmpeg | Codec core | Native CLI | ✅ | ⭐⭐⭐⭐⭐ |
| Blender | Maya + C4D + (some AE) | Full bpy Python | ✅ -b | ⭐⭐⭐⭐⭐ |
| Natron | Nuke | Python + PyPlugs | ✅ | ⭐⭐⭐⭐ |
| Audacity | Basic Pro Tools | mod-script-pipe | ⚠️ GUI | ⭐⭐⭐ |
| GIMP | Photoshop | Script-Fu / Python | ✅ -i | ⭐⭐⭐⭐ |
| ImageMagick | Batch Photoshop | Native CLI | ✅ | ⭐⭐⭐⭐⭐ |
Key observation: FFmpeg and Blender are "super tools" for benchmarks—FFmpeg covers most audio/video transforms with strong determinism, and Blender covers 3D + compositing + rendering with full automation API and headless mode.
AI Adoption (2024–2025)
In Production Today
- •Automatic transcription/subtitling (Whisper-class, WER <5%)
- •Shot detection and asset management (auto tagging, face recognition)
- •Denoise/restore (audio RNNoise; video stabilization)
- •Assisted color matching (Resolve "Color Match" style)
- •AI-assisted roto/matting (~85% labor reduction)
Broad Adoption
- •AI-assisted editing (rough-cut automation)
- •AI voice synthesis/dubbing for localization
- •Virtual production + real-time rendering (LED/ICVFX)
- •AI-driven VFX (de-aging, beauty, background gen)
Frontier
- •NeRF / 3D Gaussian splatting (2D→3D reconstruction)
- •Video generation models (studio adoption cautious)
- •Automated QC pipelines (AI-flagged issues)
Industry Standards
Post-production is unusually standardized—ideal for deterministic evaluation.
Color Management
- ACES — Academy Color Encoding System (film pipeline)
- Rec. 709 / sRGB — SDR broadcast/web
- Rec. 2020 + PQ — HDR10 / HLG delivery
- DCI-P3 — Digital cinema projection
Audio Loudness
- EBU R128 — -23 LUFS (EU broadcast)
- ATSC A/85 — -24 LKFS (US broadcast)
- Streaming — ~-14 LUFS (varies by platform)
- Dolby Atmos — Object-based immersive audio
Delivery Specs
- Netflix — IMF packaging; Dolby Vision; loudness
- YouTube — H.264/H.265; ~-14 LUFS
- DCP (Cinema) — JPEG2000; XYZ color; 24fps
Why Media Fits AgentHLE
Media post-production offers unique advantages for AI agent benchmarking—and some challenges to navigate.
Advantages
Challenges
Prioritize technical, objectively scorable workflows (transcode/delivery, color pipeline correctness, compositing with reference outputs) and avoid purely creative judgments. Use an open toolchain (FFmpeg + Blender + Natron) to reduce licensing constraints.
Three Core Workflows
These workflows are defined in AgentHLE style: end-to-end execution on a computer using real tools, producing artifacts that a reviewer can verify.
Workflow Overview
| # | Workflow | What it Represents | Data Scale |
|---|---|---|---|
| 1 | Transcoding & Multi-Platform Delivery | Delivery — Post-Production Engineer | ★★★★★ |
| 2 | Color Grading & Color Pipeline Correctness | Color — Colorist / DI Artist | ★★★★ |
| 3 | VFX Multi-Layer Compositing | VFX — Compositor | ★★★★ |
Transcoding & Multi-Platform Delivery
Given a high-quality master (ProRes/DNxHR), produce deliverables that match strict platform delivery specs, and output machine-verifiable evidence. This is typically the last step before publication and is highly deterministic.
Color Grading & Color Pipeline Correctness
Convert LOG/RAW footage to target display standards (Rec.709 SDR or PQ HDR) using a correct color pipeline (often ACES), apply primary correction, and optionally match shots. Key distinction from LLM: executing the correct IDT → ACES → ODT chain requires real tools.
VFX Multi-Layer Compositing
Combine multiple layers (background plate, CG elements, mattes/roto, FX passes) into a final shot using node-based compositing. Tests multi-file management, correct alpha/premultiplication, blend modes, and verification via reference outputs.
Review Agent Architecture
A three-layer validation system enables automated, reproducible evaluation of agent outputs.
Three-Layer Reviewer Architecture
Fully Automated (No LLM)
- •Tools: FFprobe, MediaInfo, JSON schema validators
- •Output: Boolean pass/fail (any violation fails)
- •Checks: Codec, resolution, fps, bitrate, duration, audio, container
Quantitative Scoring
- •Tools: FFmpeg libvmaf, SSIM/PSNR filters, ebur128
- •Output: Numeric score (0–100) with pass/excellent thresholds
- •Metrics: Video quality, audio compliance, color consistency
LLM-Assisted (Supplement)
- •Tools: Keyframe extraction, visual comparisons
- •Output: Confidence score + explanation
- •Uses: Edge quality, grading plausibility, anomaly detection
Workflow-Specific Weights
Transcoding & Delivery
Color Grading
VFX Compositing
| Dimension | Delivery | Color | Compositing |
|---|---|---|---|
| L1 Focus | Spec compliance dominates | Color metadata correct | Format compliance |
| L2 Focus | VMAF/SSIM | Color metrics dominate | SSIM/VMAF dominate |
| L3 Focus | Edge cases | Scope/waveform evidence | Edge realism/evidence |
| Core L1 Checks | codec/resolution/fps/bitrate/duration/audio/container | color metadata/bit depth/legal range | res/fps/frame count/alpha |
| Core L2 Metrics | VMAF>85; loudness ±0.5 LU | histogram/ΔE; skin-line ±5° | SSIM>0.90; VMAF>85; edge metrics |
| Core L3 Evidence | first/last frame screenshots | waveform + vectorscope screenshots | node graph + intermediate layer outputs |
Example Verification Commands
ffprobe -v quiet -print_format json -show_format -show_streams output.mp4
ffmpeg -i reference.mov -i output.mp4 \ -lavfi "libvmaf=log_path=vmaf.json:log_fmt=json" -f null -
ffmpeg -i output.mp4 -af "ebur128=peak=true" -f null - 2>&1 | grep -E "I:|LRA:|Peak:"
ffmpeg -i output.mp4 -vf "signalstats=stat=tout+vrep+brng,metadata=mode=print" -f null -
ffmpeg -i output.mp4 -vf "waveform=mode=column:mirror=1:components=7:display=overlay" -frames:v 1 waveform.png
ffmpeg -i output.mp4 -vf "vectorscope=mode=color4" -frames:v 1 vectorscope.png
Core Tools & Infrastructure
Contribute to Media Production
We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:
Submit Landscape Understanding
Help us map roles, workflows, and tools in media post-production. Share your perspective on the industry structure.
Submit a Workflow
Describe a specific professional task with tools, inputs, outputs, and how success is verified.
Our Commitments to Contributors
- Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
- Partner Review: Industry partners can review and approve task specifications before public release.
- Data Control: Contributors can exclude sensitive or proprietary data from submissions.