Shape the Benchmark

4 min read

Agents' Last Exam is Defined
by Domain Experts

What you get by contributing

Insight into how agents perform in your workflow
Co-authorship on manuscript
Monetary awards for high-impact contributions

Contribute a Task

The Task We Are Looking For

Industry production tasks executed using professional-grade tools or software, not simplified chat interactions.

We are collecting industry tasks that are:

Complex

Takes experts days, not minutes—substantial domain expertise and effort.

Representative

Workflows used in real industry with the right professional tools.

Verifiable

Outputs that are deterministic or scored against a clear rubric.

We strongly recommend watching the video below before submitting.

Complex

Takes experts days, not minutes

Bad example

Apply a color filter in DaVinci Resolve

Why bad: Single sub-operation completable in <10 steps

Good example

Move cheetah to Olympic race in another video

End-to-end task: tracking, rotoscoping, compositing, color matching

Representative

Workflows used in real industry

Bad example

Convert 2D blueprint to 3D model in AutoCAD

Why bad: Wrong tool; AutoCAD is primarily for 2D drafting

Good example

Convert 2D blueprint to 3D model in SolidWorks/Rhino

Industry-standard tools designed for this workflow

Verifiable

Deterministic or unambiguous rubric-based

Bad example

Design an RPG game with monsters

Why bad: Infinite valid outputs; hard to compare

Good example

Reproduce flash game using provided assets

Output matches reference exactly—deterministic

Criteria

Bad example

Good example

Complex

Takes experts days, not minutes

Apply a color filter in DaVinci Resolve

Why bad: Single sub-operation completable in <10 steps

Move cheetah to Olympic race in another video

End-to-end task: tracking, rotoscoping, compositing, color matching

Representative

Workflows used in real industry

Convert 2D blueprint to 3D model in AutoCAD

Why bad: Wrong tool; AutoCAD is primarily for 2D drafting

Convert 2D blueprint to 3D model in SolidWorks/Rhino

Industry-standard tools designed for this workflow

Verifiable

Deterministic or unambiguous rubric-based

Design an RPG game with monsters

Why bad: Infinite valid outputs; hard to compare

Reproduce flash game using provided assets

Output matches reference exactly—deterministic

How to Make Your Output Verifiable

Follow this decision tree to decide what to upload for objective evaluation. Click a question to expand.

Yes

Upload the output files.

NoContinue to the next question

Examples

•Medical imaging: A task that generates a report identifying tumor slice(s) in a brain MRI. If the key step is adjusting contrast / viewing layers to localize the tumor, a verifiable output can be a coordinate tuple—check whether it falls within an allowed region.
•Earth science: A task that retrieves data from a designated database. Because historical data is fixed, you can verify with a query like: “What is the average rainfall from 2010–2013 in Wisconsin?”

Note: Avoid trivially guessable values (e.g., (1, 2, 3)).

Yes

Upload (1) the numeric metric as a separate file and (2) the original output.

NoContinue to the next question

Examples

Video editing: A task like “change the watermelon from green to black”—take a screenshot and ask: “Is the watermelon black or green?”

Yes

Upload (1) corresponding question/answer pairs and (2) the original output.

NoContinue to the next question

!Last Resort

Upload your project file (the one opened in the professional tool or software) that generates the output. We'll help identify a verifiable subtask.

How to Prepare Your Submission

Every task submission follows a consistent structure. Here are the five components you need to define, followed by example tasks from different industries.

🎯

Task Description

What the agent must accomplish

📥

Input

Files, data, and context provided

🛠️

Software

Professional tools required

📤

Output

Expected deliverables

✅

Evaluation

How success is measured

Demo Tasks

See more on the demo page

Game Development

Flash Game Reproduction (RPGMaker XP)

Reproduce a Reproduce a Flash game using other engine (RPGMaker XP)

Input

Flash game (.swf) file and template project

Output

Playable game in exe

Software

RPGMaker XP

Evaluation

Reproduced game; Screenshots of all levels match the reference

Manufacturing

3D Structure Development

Construct a 3D industrial part from 2D input specifications using SolidWorks

Input

2D blueprint with dimensions and specifications (PNG)

Output

3D model file (.obj) matching all specified dimensions and features

Software

SolidWorks

Evaluation

Dimensional accuracy within tolerance; all features present; correct topology

Ready to contribute? Submit your task and help build the benchmark.

Contribute a Task

Evaluate your task idea

Describe the task idea in plain language. We will assess whether it fits the Agents' Last Exam criteria, but this step does not create or save a submission.

Task idea

Plain English is fine. Focus on the goal, required tools, expected outputs, and how the result should be checked.

Have Questions?

See more in FAQ and the Documentation.