Game Development
Tool-heavy, fully digital workflows with abundant verifiable artifacts—ideal for AI agent execution benchmarks.
Industry Overview
Game development is one of the most promising domains for AI agent evaluation because it is tool-heavy, fully digital, and produces abundant verifiable artifacts.
Industry Segments
$50M–$300M+ budgets, 3–7 year cycles, 100–1000+ teams
Largest by revenue, F2P + IAP, fast iteration, LiveOps heavy
1–20 teams, innovation-heavy, early AI adopters
Unity, Unreal, Godot, FMOD, Wwise, and more
Team Structure & Roles
| Tier | Team Size | Timeline | Budget | Notes |
|---|---|---|---|---|
| Indie | 1–20 | 6 months – 3 years | $0–$2M | People wear many hats |
| AA | 10–50 | 1–3 years | $1M–$20M | Partial specialization |
| AAA | 100–500+ | 3–7 years | $50M–$300M+ | Deep specialization + outsourcing |
| Mega | 500–1000+ | 5–8 years | $200M+ | Multi-studio global coordination |
Core Role Families
Art
Engineering
Design
QA
Audio
Production
Development Lifecycle
Concept
Concept doc, market analysis
Pre-production
Prototypes, vertical slice, GDD, art style bible
Production
Assets at scale, gameplay systems, levels
Alpha / Beta
Feature complete, content complete, optimization
Launch
Platform cert, marketing, day-1 patch
LiveOps
Seasons/DLC, balancing, bugfixes
Concept
Concept doc, market analysis
Pre-production
Prototypes, vertical slice, GDD, art style bible
Production
Assets at scale, gameplay systems, levels
Alpha / Beta
Feature complete, content complete, optimization
Launch
Platform cert, marketing, day-1 patch
LiveOps
Seasons/DLC, balancing, bugfixes
For GaaS titles, LiveOps can dominate lifetime revenue and workload.
Why Game Development Fits AgentHLE
Game development offers unique advantages for AI agent benchmarking—combining creative and technical workflows with deterministic verification.
Fully Digital Workflows
Most work is done in software: DCC tools, engines, spreadsheets, build systems
Role Diversity
Covers creative (art), logic (engineering), systems (design), verification (QA), hybrid (TA)
Accessible Toolchain
Many core tools have free/open variants: Blender, Godot, Krita, FMOD indie tier
Standardized Artifacts
Common formats: FBX/OBJ/GLB, textures, shaders, scenes, scripts, logs
Verifiability
Many outputs can be checked automatically: mesh stats, compilation, tests, pass/fail
Public Data
Large corpora: Objaverse, OpenGameArt, shader repos, open-source projects
Technology Ecosystem
Game Engines
DCC Tools (Digital Content Creation)
Workflow Coverage
Ten candidate workflows covering five major role families: Art, Engineering/TA, Design, QA, and Audio. All can be built around free or accessible tooling.
Coverage by Role Family
| Area | Workflows | Count |
|---|---|---|
| Art | Concept Art, 3D Modeling, Animation, VFX | 4 |
| Engineering / TA | Shader Authoring, Gameplay Programming | 2 |
| Design | Level Blockout, Balance/Economy | 2 |
| QA | Automation Testing | 1 |
| Audio | Audio Integration | 1 |
Example Tasks
Benchmarkable workflows defined in Raw Input → Raw Output form. Each requires real tool execution—not just "describing" or "answering."
★★★ / ★★☆Core Workflows (7)
3D Character Modeling Pipeline
Create a game-ready 3D character from reference, including UV layout and PBR textures.
Game UI Automation Testing
Write and run automated tests for a game's UI flow using Airtest/Poco.
Balance Tables and Simulation
Design game balance tables and validate them through simulation to meet target metrics.
Level Blockout (Grayboxing)
Create a playable level blockout from a design document, with navigation and collision.
Gameplay Programming
Implement game mechanics from a design spec in a working, runnable scene.
Game Audio Integration
Author audio events in FMOD and integrate them into a game engine.
Rigging and Character Animation
Rig a character mesh and create a set of animations for game use.
Particle/VFX Creation
Create a visual effect using the engine's particle system to match a reference.
★☆☆Alternative Workflows (3)
Concept Art Reproduction
An artist reproduces a reference image to validate technical skills and tool proficiency.
Shader Authoring and Debugging
Write a GLSL shader to match a target visual effect, ensuring it compiles and renders correctly.
Scoring & Review Agent Architecture
A two-layer validation system enables automated, reproducible evaluation.
Validation Pipeline
High Automation (100%)
- •File format and integrity checks
- •Compile/load success verification
- •Numeric constraints (polycount, frames, budgets)
- •Structural compliance (hierarchy, nodes, events)
High Automation (80–100%)
- •Image similarity (SSIM/LPIPS) for renders
- •Runtime execution with logs
- •Geometry/physics validation
- •Simulation and statistical checks
Automation Level by Workflow
| Automation | Workflows | Why |
|---|---|---|
| 100% | QA Automation, Balance Sim | Deterministic pass/fail, re-run simulation |
| ~95% | 3D Modeling, Gameplay Programming | Mesh checks, compile + functional tests |
| ~90% | Shader, Level Blockout, Rigging | Compile + SSIM, pathing/collision, rig/loop checks |
| ~85% | Concept Art, Audio Integration | SSIM/LPIPS, events + waveform |
| ~75% | VFX/Particles | Performance is objective, visual similarity can be subjective |
Task Comparison Summary
| Workflow | Software | Rep. | Auto | Scale |
|---|---|---|---|---|
| Concept Art Reproduction | Krita | ★☆☆ | ~75% | ★★★★ |
| 3D Character Modeling Pipeline | Blender | ★★☆ | ~95% | ★★★★ |
| Shader Authoring and Debugging | Shadertoy | ★☆☆ | ~90% | ★★★★ |
| Game UI Automation Testing | Airtest | ★★☆ | 100% | ★★★★ |
| Balance Tables and Simulation | Python | ★★☆ | 100% | ★★★★ |
| Level Blockout (Grayboxing) | Godot | ★★★ | ~90% | ★★★★ |
| Gameplay Programming | Godot | ★★★ | ~95% | ★★★★ |
Recommended Tools (All Free/Open-Source)
Contribute to Game Development
We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:
Submit Landscape Understanding
Help us map roles, workflows, and tools in game development. Share your perspective on the industry structure.
Submit a Workflow
Describe a specific professional task with tools, inputs, outputs, and how success is verified.
Our Commitments to Contributors
- Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
- Partner Review: Industry partners can review and approve task specifications before public release.
- Data Control: Contributors can exclude sensitive or proprietary data from submissions.