🎬
Under Review

Media Production

Software-heavy, standard-driven workflows producing verifiable artifacts—ideal for AI agent evaluation through encoded deliverables, metadata reports, and quality measurements.

Contribute to Media Production

Post-production work is increasingly defined by strict delivery specifications (codec/container/color/loudness) and tool-mediated pipelines, which makes many tasks objectively scorable and automation-friendly.

Part 1

Industry Landscape

The global post-production market spans video editing, VFX, audio post, color grading, and animation— driven by streaming competition and strict technical requirements.

~$26B
Post-production market (2024)
→ $74B by 2034 (CAGR ~11%)
~$10.5B
VFX market (2023)
→ $28B by 2033 (CAGR ~10.7%)
~$161.5B
Broader video production
Pre → Post combined
35%
TV & streaming share
Largest application segment

Service Type Breakdown

Video Editing
32–40%
VFX
25–28%
Audio Post
~15%
Color Grading / DI
~12%
Animation
~8%

Value Chain Stack

Content Platforms

Streaming competition drives post-production load

Netflix, Disney+, Amazon Prime, Apple TV+, HBO/Max

Production Houses

Direct customers (~46% of post demand)

Major studios, Independent film, Advertising, Game cinematics

Post-Production Services

VFX studios, full-service post, audio specialists

ILM, Wētā FX, DNEG, Framestore, MPC, Technicolor

Tools & Infrastructure

What an agent must operate

FFmpeg, Blender, Nuke, DaVinci Resolve, After Effects

Key Players

Content Platforms (Demand Drivers)

PlatformAnnual SpendPost Demand
Netflix~$17B40+ languages, strict 4K HDR
Disney+ / Marvel~$33BVFX-heavy franchises, virtual production
Amazon Prime~$19BLarge series, $100M+ VFX budgets
Apple TV+~$9BDolby Vision/Atmos requirements
HBO / Max~$23BHBO quality bar, top-tier standards

VFX Studios (Highest Tech Density)

StudioHQStrengths
ILMSan FranciscoVFX pioneer; StageCraft virtual production
Wētā FXWellingtonDigital characters; large-scale simulation
DNEGLondon19 sites, 9,000+ staff; virtual production
FramestoreLondonCharacter animation; advertising VFX
MPCLondonCreature/crowd simulation
Digital DomainLos AngelesDigital humans; facial capture

Tools & Infrastructure

This layer is most directly relevant for AgentHLE: benchmark tasks must require real software operation.

Commercial Tool Ecosystem

ToolVendorFunctionPosition
NukeFoundryVFX compositingDe facto high-end standard
DaVinci ResolveBlackmagicColor/edit/audio/VFXColor standard; strong value
Premiere ProAdobeEditingMost widely used NLE
After EffectsAdobeMotion graphicsMotion graphics standard
Pro ToolsAvidAudio postAudio post standard
HoudiniSideFXProcedural VFX/simHigh-end simulation
MayaAutodesk3D modeling/animationFilm animation standard

Free/Open AlternativesBenchmark-Friendly

ToolEquivalentAPIHeadlessAgent Suitability
FFmpegCodec coreNative CLI⭐⭐⭐⭐⭐
BlenderMaya + C4D + (some AE)Full bpy Python✅ -b⭐⭐⭐⭐⭐
NatronNukePython + PyPlugs⭐⭐⭐⭐
AudacityBasic Pro Toolsmod-script-pipe⚠️ GUI⭐⭐⭐
GIMPPhotoshopScript-Fu / Python✅ -i⭐⭐⭐⭐
ImageMagickBatch PhotoshopNative CLI⭐⭐⭐⭐⭐

Key observation: FFmpeg and Blender are "super tools" for benchmarks—FFmpeg covers most audio/video transforms with strong determinism, and Blender covers 3D + compositing + rendering with full automation API and headless mode.

AI Adoption (2024–2025)

In Production Today

  • Automatic transcription/subtitling (Whisper-class, WER <5%)
  • Shot detection and asset management (auto tagging, face recognition)
  • Denoise/restore (audio RNNoise; video stabilization)
  • Assisted color matching (Resolve "Color Match" style)
  • AI-assisted roto/matting (~85% labor reduction)

Broad Adoption

  • AI-assisted editing (rough-cut automation)
  • AI voice synthesis/dubbing for localization
  • Virtual production + real-time rendering (LED/ICVFX)
  • AI-driven VFX (de-aging, beauty, background gen)

Frontier

  • NeRF / 3D Gaussian splatting (2D→3D reconstruction)
  • Video generation models (studio adoption cautious)
  • Automated QC pipelines (AI-flagged issues)

Industry Standards

Post-production is unusually standardized—ideal for deterministic evaluation.

Color Management

  • ACES — Academy Color Encoding System (film pipeline)
  • Rec. 709 / sRGB — SDR broadcast/web
  • Rec. 2020 + PQ — HDR10 / HLG delivery
  • DCI-P3 — Digital cinema projection

Audio Loudness

  • EBU R128 — -23 LUFS (EU broadcast)
  • ATSC A/85 — -24 LKFS (US broadcast)
  • Streaming — ~-14 LUFS (varies by platform)
  • Dolby Atmos — Object-based immersive audio

Delivery Specs

  • Netflix — IMF packaging; Dolby Vision; loudness
  • YouTube — H.264/H.265; ~-14 LUFS
  • DCP (Cinema) — JPEG2000; XYZ color; 24fps
Part 2

Why Media Fits AgentHLE

Media post-production offers unique advantages for AI agent benchmarking—and some challenges to navigate.

Advantages

Software-Native Workflows
Nearly all work happens on computers
Scriptable Toolchain
FFmpeg/Blender/Natron have mature CLI/API surfaces
Objective Metrics
VMAF/SSIM/LUFS enable automated evaluation
Strong Standards
Delivery specs are explicit: codec/bitrate/color/loudness
Scalable Data
Open movies + synthetic test signals available
High AI Relevance
Industry is actively adopting AI tooling

Challenges

Creative Subjectivity
Editing rhythm / "look" can be subjective
Commercial Tool Barriers
Top tools can be expensive (Nuke/Pro Tools)
Copyright Constraints
Raw studio assets are rarely public
Pipeline Complexity
Multi-tool coordination and format conversions
Strategy

Prioritize technical, objectively scorable workflows (transcode/delivery, color pipeline correctness, compositing with reference outputs) and avoid purely creative judgments. Use an open toolchain (FFmpeg + Blender + Natron) to reduce licensing constraints.

Part 3

Three Core Workflows

These workflows are defined in AgentHLE style: end-to-end execution on a computer using real tools, producing artifacts that a reviewer can verify.

Workflow Overview

#WorkflowWhat it RepresentsData Scale
1Transcoding & Multi-Platform DeliveryDeliveryPost-Production Engineer★★★★★
2Color Grading & Color Pipeline CorrectnessColorColorist / DI Artist★★★★
3VFX Multi-Layer CompositingVFXCompositor★★★★
Delivery★★★

Transcoding & Multi-Platform Delivery

Given a high-quality master (ProRes/DNxHR), produce deliverables that match strict platform delivery specs, and output machine-verifiable evidence. This is typically the last step before publication and is highly deterministic.

Color★★★

Color Grading & Color Pipeline Correctness

Convert LOG/RAW footage to target display standards (Rec.709 SDR or PQ HDR) using a correct color pipeline (often ACES), apply primary correction, and optionally match shots. Key distinction from LLM: executing the correct IDT → ACES → ODT chain requires real tools.

VFX★★★

VFX Multi-Layer Compositing

Combine multiple layers (background plate, CG elements, mattes/roto, FX passes) into a final shot using node-based compositing. Tests multi-file management, correct alpha/premultiplication, blend modes, and verification via reference outputs.

Part 4

Review Agent Architecture

A three-layer validation system enables automated, reproducible evaluation of agent outputs.

Three-Layer Reviewer Architecture

Review Agent (orchestrator)
├─ L1: Deterministic rules (pass/fail)
FFprobe/MediaInfo, schema validation, exact spec checks
├─ L2: Quantitative metrics (scored)
VMAF/SSIM/PSNR, LUFS/true-peak, histogram/ΔE, etc.
└─ L3: Evidence validation (LLM-assisted, optional)
keyframe comparisons, node-graph screenshots, anomaly detection
L1 — Deterministic Rules

Fully Automated (No LLM)

  • Tools: FFprobe, MediaInfo, JSON schema validators
  • Output: Boolean pass/fail (any violation fails)
  • Checks: Codec, resolution, fps, bitrate, duration, audio, container
L2 — Metrics Calculator

Quantitative Scoring

  • Tools: FFmpeg libvmaf, SSIM/PSNR filters, ebur128
  • Output: Numeric score (0–100) with pass/excellent thresholds
  • Metrics: Video quality, audio compliance, color consistency
L3 — Evidence Verifier

LLM-Assisted (Supplement)

  • Tools: Keyframe extraction, visual comparisons
  • Output: Confidence score + explanation
  • Uses: Edge quality, grading plausibility, anomaly detection

Workflow-Specific Weights

Transcoding & Delivery

L1
60%
L2
30%
L3
10%

Color Grading

L1
30%
L2
50%
L3
20%

VFX Compositing

L1
20%
L2
60%
L3
20%
DimensionDeliveryColorCompositing
L1 FocusSpec compliance dominatesColor metadata correctFormat compliance
L2 FocusVMAF/SSIMColor metrics dominateSSIM/VMAF dominate
L3 FocusEdge casesScope/waveform evidenceEdge realism/evidence
Core L1 Checkscodec/resolution/fps/bitrate/duration/audio/containercolor metadata/bit depth/legal rangeres/fps/frame count/alpha
Core L2 MetricsVMAF>85; loudness ±0.5 LUhistogram/ΔE; skin-line ±5°SSIM>0.90; VMAF>85; edge metrics
Core L3 Evidencefirst/last frame screenshotswaveform + vectorscope screenshotsnode graph + intermediate layer outputs

Example Verification Commands

Metadata Extraction
ffprobe -v quiet -print_format json -show_format -show_streams output.mp4
VMAF Quality Score
ffmpeg -i reference.mov -i output.mp4 \
  -lavfi "libvmaf=log_path=vmaf.json:log_fmt=json" -f null -
EBU R128 Loudness
ffmpeg -i output.mp4 -af "ebur128=peak=true" -f null - 2>&1 | grep -E "I:|LRA:|Peak:"
Legal Range Detection
ffmpeg -i output.mp4 -vf "signalstats=stat=tout+vrep+brng,metadata=mode=print" -f null -
Waveform Export
ffmpeg -i output.mp4 -vf "waveform=mode=column:mirror=1:components=7:display=overlay" -frames:v 1 waveform.png
Vectorscope Export
ffmpeg -i output.mp4 -vf "vectorscope=mode=color4" -frames:v 1 vectorscope.png

Core Tools & Infrastructure

FFmpegBlenderNatronDaVinci ResolveNukeAfter EffectsPremiere ProPro ToolsAudacityOpenColorIOMediaInfoFFprobe

Contribute to Media Production

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Our Commitments to Contributors

  • Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
  • Partner Review: Industry partners can review and approve task specifications before public release.
  • Data Control: Contributors can exclude sensitive or proprietary data from submissions.