🎬

Under Review

Media Production

Software-heavy, standard-driven workflows producing verifiable artifacts—ideal for AI agent evaluation through encoded deliverables, metadata reports, and quality measurements.

Contribute to Media Production

Post-production work is increasingly defined by strict delivery specifications (codec/container/color/loudness) and tool-mediated pipelines, which makes many tasks objectively scorable and automation-friendly.

Part 1

Industry Landscape

The global post-production market spans video editing, VFX, audio post, color grading, and animation— driven by streaming competition and strict technical requirements.

~$26B

Post-production market (2024)

→ $74B by 2034 (CAGR ~11%)

~$10.5B

VFX market (2023)

→ $28B by 2033 (CAGR ~10.7%)

~$161.5B

Broader video production

Pre → Post combined

35%

TV & streaming share

Largest application segment

Service Type Breakdown

Video Editing

32–40%

VFX

25–28%

Audio Post

~15%

Color Grading / DI

~12%

Animation

~8%

Value Chain Stack

Content Platforms

Streaming competition drives post-production load

Netflix, Disney+, Amazon Prime, Apple TV+, HBO/Max

Production Houses

Direct customers (~46% of post demand)

Major studios, Independent film, Advertising, Game cinematics

Post-Production Services

VFX studios, full-service post, audio specialists

ILM, Wētā FX, DNEG, Framestore, MPC, Technicolor

Tools & Infrastructure

What an agent must operate

FFmpeg, Blender, Nuke, DaVinci Resolve, After Effects

Key Players

Content Platforms (Demand Drivers)

Platform	Annual Spend	Post Demand
Netflix	~$17B	40+ languages, strict 4K HDR
Disney+ / Marvel	~$33B	VFX-heavy franchises, virtual production
Amazon Prime	~$19B	Large series, $100M+ VFX budgets
Apple TV+	~$9B	Dolby Vision/Atmos requirements
HBO / Max	~$23B	HBO quality bar, top-tier standards

VFX Studios (Highest Tech Density)

Studio	HQ	Strengths
ILM	San Francisco	VFX pioneer; StageCraft virtual production
Wētā FX	Wellington	Digital characters; large-scale simulation
DNEG	London	19 sites, 9,000+ staff; virtual production
Framestore	London	Character animation; advertising VFX
MPC	London	Creature/crowd simulation
Digital Domain	Los Angeles	Digital humans; facial capture

Tools & Infrastructure

This layer is most directly relevant for Agents' Last Exam: benchmark tasks must require real software operation.

Commercial Tool Ecosystem

Tool	Vendor	Function	Position
Nuke	Foundry	VFX compositing	De facto high-end standard
DaVinci Resolve	Blackmagic	Color/edit/audio/VFX	Color standard; strong value
Premiere Pro	Adobe	Editing	Most widely used NLE
After Effects	Adobe	Motion graphics	Motion graphics standard
Pro Tools	Avid	Audio post	Audio post standard
Houdini	SideFX	Procedural VFX/sim	High-end simulation
Maya	Autodesk	3D modeling/animation	Film animation standard

Free/Open AlternativesBenchmark-Friendly

Tool	Equivalent	API	Headless	Agent Suitability
FFmpeg	Codec core	Native CLI	✅	⭐⭐⭐⭐⭐
Blender	Maya + C4D + (some AE)	Full bpy Python	✅ -b	⭐⭐⭐⭐⭐
Natron	Nuke	Python + PyPlugs	✅	⭐⭐⭐⭐
Audacity	Basic Pro Tools	mod-script-pipe	⚠️ GUI	⭐⭐⭐
GIMP	Photoshop	Script-Fu / Python	✅ -i	⭐⭐⭐⭐
ImageMagick	Batch Photoshop	Native CLI	✅	⭐⭐⭐⭐⭐

Key observation:FFmpeg and Blender are "super tools" for benchmarks—FFmpeg covers most audio/video transforms with strong determinism, and Blender covers 3D + compositing + rendering with full automation API and headless mode.

AI Adoption (2024–2025)

In Production Today

•Automatic transcription/subtitling (Whisper-class, WER <5%)
•Shot detection and asset management (auto tagging, face recognition)
•Denoise/restore (audio RNNoise; video stabilization)
•Assisted color matching (Resolve "Color Match" style)
•AI-assisted roto/matting (~85% labor reduction)

Broad Adoption

•AI-assisted editing (rough-cut automation)
•AI voice synthesis/dubbing for localization
•Virtual production + real-time rendering (LED/ICVFX)
•AI-driven VFX (de-aging, beauty, background gen)

Frontier

•NeRF / 3D Gaussian splatting (2D→3D reconstruction)
•Video generation models (studio adoption cautious)
•Automated QC pipelines (AI-flagged issues)

Industry Standards

Post-production is unusually standardized—ideal for deterministic evaluation.

Color Management

ACES — Academy Color Encoding System (film pipeline)
Rec. 709 / sRGB — SDR broadcast/web
Rec. 2020 + PQ — HDR10 / HLG delivery
DCI-P3 — Digital cinema projection

Audio Loudness

EBU R128 — -23 LUFS (EU broadcast)
ATSC A/85 — -24 LKFS (US broadcast)
Streaming — ~-14 LUFS (varies by platform)
Dolby Atmos — Object-based immersive audio

Delivery Specs

Netflix — IMF packaging; Dolby Vision; loudness
YouTube — H.264/H.265; ~-14 LUFS
DCP (Cinema) — JPEG2000; XYZ color; 24fps

Part 2

Why Media Fits Agents' Last Exam

Media post-production offers unique advantages for AI agent benchmarking—and some challenges to navigate.

Advantages

Software-Native Workflows

Nearly all work happens on computers

Scriptable Toolchain

FFmpeg/Blender/Natron have mature CLI/API surfaces

Objective Metrics

VMAF/SSIM/LUFS enable automated evaluation

Strong Standards

Delivery specs are explicit: codec/bitrate/color/loudness

Scalable Data

Open movies + synthetic test signals available

High AI Relevance

Industry is actively adopting AI tooling

Challenges

Creative Subjectivity

Editing rhythm / "look" can be subjective

Commercial Tool Barriers

Top tools can be expensive (Nuke/Pro Tools)

Raw studio assets are rarely public

Pipeline Complexity

Multi-tool coordination and format conversions

Strategy

Prioritize technical, objectively scorable workflows (transcode/delivery, color pipeline correctness, compositing with reference outputs) and avoid purely creative judgments. Use an open toolchain (FFmpeg + Blender + Natron) to reduce licensing constraints.

Part 3

Three Core Workflows

These workflows are defined in Agents' Last Exam style: end-to-end execution on a computer using real tools, producing artifacts that a reviewer can verify.

Workflow Overview

#	Workflow	What it Represents	Data Scale
1	Transcoding & Multi-Platform Delivery	Delivery — Post-Production Engineer	★★★★★
2	Color Grading & Color Pipeline Correctness	Color — Colorist / DI Artist	★★★★
3	VFX Multi-Layer Compositing	VFX — Compositor	★★★★

Delivery★★★

Transcoding & Multi-Platform Delivery

Given a high-quality master (ProRes/DNxHR), produce deliverables that match strict platform delivery specs, and output machine-verifiable evidence. This is typically the last step before publication and is highly deterministic.

Color★★★

Color Grading & Color Pipeline Correctness

Convert LOG/RAW footage to target display standards (Rec.709 SDR or PQ HDR) using a correct color pipeline (often ACES), apply primary correction, and optionally match shots. Key distinction from LLM: executing the correct IDT → ACES → ODT chain requires real tools.

VFX★★★

VFX Multi-Layer Compositing

Combine multiple layers (background plate, CG elements, mattes/roto, FX passes) into a final shot using node-based compositing. Tests multi-file management, correct alpha/premultiplication, blend modes, and verification via reference outputs.

Part 4

Review Agent Architecture

A three-layer validation system enables automated, reproducible evaluation of agent outputs.

Three-Layer Reviewer Architecture

Review Agent (orchestrator)

├─ L1: Deterministic rules (pass/fail)

FFprobe/MediaInfo, schema validation, exact spec checks

├─ L2: Quantitative metrics (scored)

VMAF/SSIM/PSNR, LUFS/true-peak, histogram/ΔE, etc.

└─ L3: Evidence validation (LLM-assisted, optional)

keyframe comparisons, node-graph screenshots, anomaly detection

L1 — Deterministic Rules

Fully Automated (No LLM)

•Tools: FFprobe, MediaInfo, JSON schema validators
•Output: Boolean pass/fail (any violation fails)
•Checks: Codec, resolution, fps, bitrate, duration, audio, container

L2 — Metrics Calculator

Quantitative Scoring

•Tools: FFmpeg libvmaf, SSIM/PSNR filters, ebur128
•Output: Numeric score (0–100) with pass/excellent thresholds
•Metrics: Video quality, audio compliance, color consistency

L3 — Evidence Verifier

LLM-Assisted (Supplement)

•Tools: Keyframe extraction, visual comparisons
•Output: Confidence score + explanation
•Uses: Edge quality, grading plausibility, anomaly detection

Workflow-Specific Weights

Transcoding & Delivery

60%

30%

10%

Color Grading

30%

50%

20%

VFX Compositing

20%

60%

20%

Dimension	Delivery	Color	Compositing
L1 Focus	Spec compliance dominates	Color metadata correct	Format compliance
L2 Focus	VMAF/SSIM	Color metrics dominate	SSIM/VMAF dominate
L3 Focus	Edge cases	Scope/waveform evidence	Edge realism/evidence
Core L1 Checks	codec/resolution/fps/bitrate/duration/audio/container	color metadata/bit depth/legal range	res/fps/frame count/alpha
Core L2 Metrics	VMAF>85; loudness ±0.5 LU	histogram/ΔE; skin-line ±5°	SSIM>0.90; VMAF>85; edge metrics
Core L3 Evidence	first/last frame screenshots	waveform + vectorscope screenshots	node graph + intermediate layer outputs

Example Verification Commands

Metadata Extraction

ffprobe -v quiet -print_format json -show_format -show_streams output.mp4

VMAF Quality Score

ffmpeg -i reference.mov -i output.mp4 \
  -lavfi "libvmaf=log_path=vmaf.json:log_fmt=json" -f null -

EBU R128 Loudness

ffmpeg -i output.mp4 -af "ebur128=peak=true" -f null - 2>&1 | grep -E "I:|LRA:|Peak:"

Legal Range Detection

ffmpeg -i output.mp4 -vf "signalstats=stat=tout+vrep+brng,metadata=mode=print" -f null -

Waveform Export

ffmpeg -i output.mp4 -vf "waveform=mode=column:mirror=1:components=7:display=overlay" -frames:v 1 waveform.png

Vectorscope Export

ffmpeg -i output.mp4 -vf "vectorscope=mode=color4" -frames:v 1 vectorscope.png

Core Tools & Infrastructure

FFmpegBlenderNatronDaVinci ResolveNukeAfter EffectsPremiere ProPro ToolsAudacityOpenColorIOMediaInfoFFprobe

Contribute to Media Production

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Submit Landscape Understanding

Help us map roles, workflows, and tools in media post-production. Share your perspective on the industry structure.

Submit a Workflow

Describe a specific professional task with tools, inputs, outputs, and how success is verified.

Our Commitments to Contributors

Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
Partner Review: Industry partners can review and approve task specifications before public release.
Data Control: Contributors can exclude sensitive or proprietary data from submissions.