🎮
Under Review

Game Development

Tool-heavy, fully digital workflows with abundant verifiable artifacts—ideal for AI agent execution benchmarks.

Contribute to Game Development
Part 1

Industry Overview

Game development is one of the most promising domains for AI agent evaluation because it is tool-heavy, fully digital, and produces abundant verifiable artifacts.

$188B+
2024 global games revenue
~46%
Mobile platform share
95%+
Digital distribution

Industry Segments

Console/PC (AAA)

$50M–$300M+ budgets, 3–7 year cycles, 100–1000+ teams

Mobile

Largest by revenue, F2P + IAP, fast iteration, LiveOps heavy

Indie

1–20 teams, innovation-heavy, early AI adopters

Engines/Middleware

Unity, Unreal, Godot, FMOD, Wwise, and more

Team Structure & Roles

TierTeam SizeTimelineBudgetNotes
Indie1–206 months – 3 years$0–$2MPeople wear many hats
AA10–501–3 years$1M–$20MPartial specialization
AAA100–500+3–7 years$50M–$300M+Deep specialization + outsourcing
Mega500–1000+5–8 years$200M+Multi-studio global coordination

Core Role Families

Art

Concept Artist3D ModelerAnimatorRiggerVFX ArtistLighting ArtistUI/UX

Engineering

Engine ProgrammerGameplay ProgrammerTechnical ArtistTools ProgrammerNetwork Engineer

Design

Game DesignerLevel DesignerSystems/Economy DesignerNarrative Designer

QA

QA TesterQA Automation/SDETPerformance Tester

Audio

Sound DesignerAudio ProgrammerComposer

Production

ProducerProject Manager

Development Lifecycle

Concept

1–3 monthsTeam: 1–3

Concept doc, market analysis

Pre-production

6–12 monthsTeam: 10–30

Prototypes, vertical slice, GDD, art style bible

Production

1–3+ yearsTeam: Full

Assets at scale, gameplay systems, levels

Alpha / Beta

3–6 monthsTeam: Full + QA

Feature complete, content complete, optimization

Launch

—Team: —

Platform cert, marketing, day-1 patch

LiveOps

Multi-yearTeam: Live team

Seasons/DLC, balancing, bugfixes

For GaaS titles, LiveOps can dominate lifetime revenue and workload.

Part 2

Why Game Development Fits Agents' Last Exam

Game development offers unique advantages for AI agent benchmarking—combining creative and technical workflows with deterministic verification.

Fully Digital Workflows

Most work is done in software: DCC tools, engines, spreadsheets, build systems

Role Diversity

Covers creative (art), logic (engineering), systems (design), verification (QA), hybrid (TA)

Accessible Toolchain

Many core tools have free/open variants: Blender, Godot, Krita, FMOD indie tier

Standardized Artifacts

Common formats: FBX/OBJ/GLB, textures, shaders, scenes, scripts, logs

Verifiability

Many outputs can be checked automatically: mesh stats, compilation, tests, pass/fail

Public Data

Large corpora: Objaverse, OpenGameArt, shader repos, open-source projects

Technology Ecosystem

Game Engines

UnityGeneral-purpose, strong on mobile
Unreal Engine 5High-fidelity AAA, strong rendering
GodotOpen-source, lightweight 2D/3D (MIT)
GameMaker2D-focused, freemium

DCC Tools (Digital Content Creation)

3D ModelingBlender (GPL), Maya, 3ds Max
2D PaintingKrita (GPL), Photoshop
TexturingSubstance Painter, Material Maker
AudioFMOD Studio, Wwise, Audacity
Part 3

Workflow Coverage

Ten candidate workflows covering five major role families: Art, Engineering/TA, Design, QA, and Audio. All can be built around free or accessible tooling.

Coverage by Role Family

AreaWorkflowsCount
ArtConcept Art, 3D Modeling, Animation, VFX4
Engineering / TAShader Authoring, Gameplay Programming2
DesignLevel Blockout, Balance/Economy2
QAAutomation Testing1
AudioAudio Integration1
Part 4

Example Tasks

Benchmarkable workflows defined in Raw Input → Raw Outputform. Each requires real tool execution—not just "describing" or "answering."

★★★ / ★★☆Core Workflows (7)

Art (3D)★★☆

3D Character Modeling Pipeline

Create a game-ready 3D character from reference, including UV layout and PBR textures.

QA★★☆

Game UI Automation Testing

Write and run automated tests for a game's UI flow using Airtest/Poco.

Design (Systems)★★☆

Balance Tables and Simulation

Design game balance tables and validate them through simulation to meet target metrics.

Design (Levels)★★★

Level Blockout (Grayboxing)

Create a playable level blockout from a design document, with navigation and collision.

Engineering★★★

Gameplay Programming

Implement game mechanics from a design spec in a working, runnable scene.

Audio★★★

Game Audio Integration

Author audio events in FMOD and integrate them into a game engine.

Art (Animation)★★★

Rigging and Character Animation

Rig a character mesh and create a set of animations for game use.

Art (VFX)★★☆

Particle/VFX Creation

Create a visual effect using the engine's particle system to match a reference.

★☆☆Alternative Workflows (3)

Art (2D)★☆☆

Concept Art Reproduction

An artist reproduces a reference image to validate technical skills and tool proficiency.

Technical Art★☆☆

Shader Authoring and Debugging

Write a GLSL shader to match a target visual effect, ensuring it compiles and renders correctly.

Part 5

Scoring & Review Agent Architecture

A two-layer validation system enables automated, reproducible evaluation.

Validation Pipeline

Layer 1: Deterministic Rules

High Automation (100%)

  • •File format and integrity checks
  • •Compile/load success verification
  • •Numeric constraints (polycount, frames, budgets)
  • •Structural compliance (hierarchy, nodes, events)
Layer 2: Evidence-Based

High Automation (80–100%)

  • •Image similarity (SSIM/LPIPS) for renders
  • •Runtime execution with logs
  • •Geometry/physics validation
  • •Simulation and statistical checks

Automation Level by Workflow

AutomationWorkflowsWhy
100%QA Automation, Balance SimDeterministic pass/fail, re-run simulation
~95%3D Modeling, Gameplay ProgrammingMesh checks, compile + functional tests
~90%Shader, Level Blockout, RiggingCompile + SSIM, pathing/collision, rig/loop checks
~85%Concept Art, Audio IntegrationSSIM/LPIPS, events + waveform
~75%VFX/ParticlesPerformance is objective, visual similarity can be subjective

Task Comparison Summary

WorkflowSoftwareRep.AutoScale
Concept Art ReproductionKrita★☆☆~75%★★★★
3D Character Modeling PipelineBlender★★☆~95%★★★★
Shader Authoring and DebuggingShadertoy★☆☆~90%★★★★
Game UI Automation TestingAirtest★★☆100%★★★★
Balance Tables and SimulationPython★★☆100%★★★★
Level Blockout (Grayboxing)Godot★★★~90%★★★★
Gameplay ProgrammingGodot★★★~95%★★★★

Recommended Tools (All Free/Open-Source)

GodotUnityBlenderKritaFMOD StudioAirtest/PocoShadertoyPythonopenpyxl

Contribute to Game Development

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Our Commitments to Contributors

  • Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
  • Partner Review: Industry partners can review and approve task specifications before public release.
  • Data Control: Contributors can exclude sensitive or proprietary data from submissions.