Shape the Benchmark

4 min read

Agent-HLE is Defined
by Domain Experts

What you get by contributing
  • Insight into how agents perform in your workflow
  • Co-authorship on manuscript
  • Monetary awards for high-impact contributions
Contribute a Task

The Task We Are Looking For

Industry production tasks executed using professional-grade tools or software, not simplified chat interactions.

We are collecting industry tasks that are:

Complex

Takes experts days, not minutes—substantial domain expertise and effort.

Representative

Workflows used in real industry with the right professional tools.

Verifiable

Outputs that are deterministic or scored against a clear rubric.

We strongly recommend watching the video below before submitting.

Complex
Takes experts days, not minutes
Bad example
Apply a color filter in DaVinci Resolve
Why bad: Single sub-operation completable in <10 steps
Good example
Move cheetah to Olympic race in another video
End-to-end task: tracking, rotoscoping, compositing, color matching
Representative
Workflows used in real industry
Bad example
Convert 2D blueprint to 3D model in AutoCAD
Why bad: Wrong tool; AutoCAD is primarily for 2D drafting
Good example
Convert 2D blueprint to 3D model in SolidWorks/Rhino
Industry-standard tools designed for this workflow
Verifiable
Deterministic or unambiguous rubric-based
Bad example
Design an RPG game with monsters
Why bad: Infinite valid outputs; hard to compare
Good example
Reproduce flash game using provided assets
Output matches reference exactly—deterministic

How to Make Your Output Verifiable

Follow this decision tree to decide what to upload for objective evaluation. Click a question to expand.

Yes
Upload the output files.
NoContinue to the next question
Examples
  • •Medical imaging: A task that generates a report identifying tumor slice(s) in a brain MRI. If the key step is adjusting contrast / viewing layers to localize the tumor, a verifiable output can be a coordinate tuple—check whether it falls within an allowed region.
  • •Earth science: A task that retrieves data from a designated database. Because historical data is fixed, you can verify with a query like: “What is the average rainfall from 2010–2013 in Wisconsin?”
Note: Avoid trivially guessable values (e.g., (1, 2, 3)).
Yes
Upload (1) the numeric metric as a separate file and (2) the original output.
NoContinue to the next question
Examples

Video editing: A task like “change the watermelon from green to black”—take a screenshot and ask: “Is the watermelon black or green?”

Yes
Upload (1) corresponding question/answer pairs and (2) the original output.
NoContinue to the next question
!Last Resort

Upload your project file (the one opened in the professional tool or software) that generates the output. We'll help identify a verifiable subtask.

How to Prepare Your Submission

Every task submission follows a consistent structure. Here are the five components you need to define, followed by example tasks from different industries.

🎯
Task Description
What the agent must accomplish
📥
Input
Files, data, and context provided
🛠️
Software
Professional tools required
📤
Output
Expected deliverables
âś…
Evaluation
How success is measured
Game Development
Flash Game Reproduction (RPGMaker XP)

Reproduce a Reproduce a Flash game using other engine (RPGMaker XP)

Flash Game Reproduction (RPGMaker XP) - input
Input

Flash game (.swf) file and template project

Output

Playable game in exe

Software
RPGMaker XP
RPGMaker XP
Evaluation

Reproduced game; Screenshots of all levels match the reference

Manufacturing
3D Structure Development

Construct a 3D industrial part from 2D input specifications using SolidWorks

3D Structure Development - input
Input

2D blueprint with dimensions and specifications (PNG)

Output

3D model file (.obj) matching all specified dimensions and features

Software
SolidWorks
SolidWorks
Evaluation

Dimensional accuracy within tolerance; all features present; correct topology

Ready to contribute? Submit your task and help build the benchmark.

Evaluate your task idea

Describe the task idea in plain language. We will assess whether it fits AgentHLE's criteria, but this step does not create or save a submission.

Plain English is fine. Focus on the goal, required tools, expected outputs, and how the result should be checked.

Have Questions?

See more in FAQ and the Documentation.