🎮
Under Review

Game Development

Tool-heavy, fully digital workflows with abundant verifiable artifacts—ideal for AI agent execution benchmarks.

Contribute to Game Development
Part 1

Industry Overview

Game development is one of the most promising domains for AI agent evaluation because it is tool-heavy, fully digital, and produces abundant verifiable artifacts.

$188B+
2024 global games revenue
~46%
Mobile platform share
95%+
Digital distribution

Industry Segments

Console/PC (AAA)

$50M–$300M+ budgets, 3–7 year cycles, 100–1000+ teams

Mobile

Largest by revenue, F2P + IAP, fast iteration, LiveOps heavy

Indie

1–20 teams, innovation-heavy, early AI adopters

Engines/Middleware

Unity, Unreal, Godot, FMOD, Wwise, and more

Team Structure & Roles

TierTeam SizeTimelineBudgetNotes
Indie1–206 months – 3 years$0–$2MPeople wear many hats
AA10–501–3 years$1M–$20MPartial specialization
AAA100–500+3–7 years$50M–$300M+Deep specialization + outsourcing
Mega500–1000+5–8 years$200M+Multi-studio global coordination

Core Role Families

Art

Concept Artist3D ModelerAnimatorRiggerVFX ArtistLighting ArtistUI/UX

Engineering

Engine ProgrammerGameplay ProgrammerTechnical ArtistTools ProgrammerNetwork Engineer

Design

Game DesignerLevel DesignerSystems/Economy DesignerNarrative Designer

QA

QA TesterQA Automation/SDETPerformance Tester

Audio

Sound DesignerAudio ProgrammerComposer

Production

ProducerProject Manager

Development Lifecycle

Concept

1–3 monthsTeam: 1–3

Concept doc, market analysis

Pre-production

6–12 monthsTeam: 10–30

Prototypes, vertical slice, GDD, art style bible

Production

1–3+ yearsTeam: Full

Assets at scale, gameplay systems, levels

Alpha / Beta

3–6 monthsTeam: Full + QA

Feature complete, content complete, optimization

Launch

—Team: —

Platform cert, marketing, day-1 patch

LiveOps

Multi-yearTeam: Live team

Seasons/DLC, balancing, bugfixes

For GaaS titles, LiveOps can dominate lifetime revenue and workload.

Part 2

Why Game Development Fits AgentHLE

Game development offers unique advantages for AI agent benchmarking—combining creative and technical workflows with deterministic verification.

Fully Digital Workflows

Most work is done in software: DCC tools, engines, spreadsheets, build systems

Role Diversity

Covers creative (art), logic (engineering), systems (design), verification (QA), hybrid (TA)

Accessible Toolchain

Many core tools have free/open variants: Blender, Godot, Krita, FMOD indie tier

Standardized Artifacts

Common formats: FBX/OBJ/GLB, textures, shaders, scenes, scripts, logs

Verifiability

Many outputs can be checked automatically: mesh stats, compilation, tests, pass/fail

Public Data

Large corpora: Objaverse, OpenGameArt, shader repos, open-source projects

Technology Ecosystem

Game Engines

UnityGeneral-purpose, strong on mobile
Unreal Engine 5High-fidelity AAA, strong rendering
GodotOpen-source, lightweight 2D/3D (MIT)
GameMaker2D-focused, freemium

DCC Tools (Digital Content Creation)

3D ModelingBlender (GPL), Maya, 3ds Max
2D PaintingKrita (GPL), Photoshop
TexturingSubstance Painter, Material Maker
AudioFMOD Studio, Wwise, Audacity
Part 3

Workflow Coverage

Ten candidate workflows covering five major role families: Art, Engineering/TA, Design, QA, and Audio. All can be built around free or accessible tooling.

Coverage by Role Family

AreaWorkflowsCount
ArtConcept Art, 3D Modeling, Animation, VFX4
Engineering / TAShader Authoring, Gameplay Programming2
DesignLevel Blockout, Balance/Economy2
QAAutomation Testing1
AudioAudio Integration1
Part 4

Example Tasks

Benchmarkable workflows defined in Raw Input → Raw Output form. Each requires real tool execution—not just "describing" or "answering."

★★★ / ★★☆Core Workflows (7)

Art (3D)★★☆

3D Character Modeling Pipeline

Create a game-ready 3D character from reference, including UV layout and PBR textures.

QA★★☆

Game UI Automation Testing

Write and run automated tests for a game's UI flow using Airtest/Poco.

Design (Systems)★★☆

Balance Tables and Simulation

Design game balance tables and validate them through simulation to meet target metrics.

Design (Levels)★★★

Level Blockout (Grayboxing)

Create a playable level blockout from a design document, with navigation and collision.

Engineering★★★

Gameplay Programming

Implement game mechanics from a design spec in a working, runnable scene.

Audio★★★

Game Audio Integration

Author audio events in FMOD and integrate them into a game engine.

Art (Animation)★★★

Rigging and Character Animation

Rig a character mesh and create a set of animations for game use.

Art (VFX)★★☆

Particle/VFX Creation

Create a visual effect using the engine's particle system to match a reference.

★☆☆Alternative Workflows (3)

Art (2D)★☆☆

Concept Art Reproduction

An artist reproduces a reference image to validate technical skills and tool proficiency.

Technical Art★☆☆

Shader Authoring and Debugging

Write a GLSL shader to match a target visual effect, ensuring it compiles and renders correctly.

Part 5

Scoring & Review Agent Architecture

A two-layer validation system enables automated, reproducible evaluation.

Validation Pipeline

Layer 1: Deterministic Rules

High Automation (100%)

  • •File format and integrity checks
  • •Compile/load success verification
  • •Numeric constraints (polycount, frames, budgets)
  • •Structural compliance (hierarchy, nodes, events)
Layer 2: Evidence-Based

High Automation (80–100%)

  • •Image similarity (SSIM/LPIPS) for renders
  • •Runtime execution with logs
  • •Geometry/physics validation
  • •Simulation and statistical checks

Automation Level by Workflow

AutomationWorkflowsWhy
100%QA Automation, Balance SimDeterministic pass/fail, re-run simulation
~95%3D Modeling, Gameplay ProgrammingMesh checks, compile + functional tests
~90%Shader, Level Blockout, RiggingCompile + SSIM, pathing/collision, rig/loop checks
~85%Concept Art, Audio IntegrationSSIM/LPIPS, events + waveform
~75%VFX/ParticlesPerformance is objective, visual similarity can be subjective

Task Comparison Summary

WorkflowSoftwareRep.AutoScale
Concept Art ReproductionKrita★☆☆~75%★★★★
3D Character Modeling PipelineBlender★★☆~95%★★★★
Shader Authoring and DebuggingShadertoy★☆☆~90%★★★★
Game UI Automation TestingAirtest★★☆100%★★★★
Balance Tables and SimulationPython★★☆100%★★★★
Level Blockout (Grayboxing)Godot★★★~90%★★★★
Gameplay ProgrammingGodot★★★~95%★★★★

Recommended Tools (All Free/Open-Source)

GodotUnityBlenderKritaFMOD StudioAirtest/PocoShadertoyPythonopenpyxl

Contribute to Game Development

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Our Commitments to Contributors

  • Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
  • Partner Review: Industry partners can review and approve task specifications before public release.
  • Data Control: Contributors can exclude sensitive or proprietary data from submissions.