Agents' Last Exam

Challenge and measure AI agents on economically valuable and real-world tasks.

Agents' Last Exam aims to build the largest-scale, broadest-coverage agent evaluation benchmark to date, measuring performance on long-horizon, economically valuable tasks with verifiable outcomes. Led by Berkeley RDI and 300+ industry experts, it now spans all 55 targeted sub-industries covering most major fields of professional work performed on a computer, with 1,500+ tasks collected toward a 5,000-task target, keeping scores objective, comparable, and meaningful across domains.

Motion & VFX

Motion & VFX

3D Modeling

3D Modeling

Game Development

Game Development

Mold Flow Analysis

Mold Flow Analysis

Architectural Modeling

Architectural Modeling

Brain Imaging

Brain Imaging

What makes Agents' Last Exam different

Broadest CoverageVerifiable OutcomesLong-HorizonEconomically Valuable

55

Sub-Industries Covered

1.5K+

Tasks Collected

300+

Experts

Co-led by

Berkeley RDI logo×RDI Foundation logo

Contributors & Partners from

Academic Institutions

MITMIT
HarvardHarvard
StanfordStanford
UC BerkeleyUC Berkeley
OxfordOxford
CMUCMU
CaltechCaltech
ETH ZurichETH Zurich
YaleYale
ColumbiaColumbia
UPennUPenn
CornellCornell
BrownBrown
Johns HopkinsJohns Hopkins
NIHNIH
UCLAUCLA
UCSFUCSF
NYUNYU
University of MichiganU Michigan
University of WashingtonU Washington
Georgia TechGeorgia Tech
USCUSC
UIUCUIUC
Washington University in St. LouisWashU
University of MelbourneU Melbourne
UC San DiegoUC San Diego
UC Santa BarbaraUC Santa Barbara
UC IrvineUC Irvine
University of Wisconsin-MadisonUW-Madison
EmoryEmory
UNC Chapel HillUNC
McGillMcGill
University of WaterlooU Waterloo
Boston UniversityBoston University
University of HelsinkiU Helsinki
MonashMonash
University of ColoradoU Colorado
UC Santa CruzUC Santa Cruz
UC RiversideUC Riverside
NortheasternNortheastern
SyracuseSyracuse
LehighLehigh
UT SouthwesternUT Southwestern
Texas A&MTexas A&M
MITMIT
HarvardHarvard
StanfordStanford
UC BerkeleyUC Berkeley
OxfordOxford
CMUCMU
CaltechCaltech
ETH ZurichETH Zurich
YaleYale
ColumbiaColumbia
UPennUPenn
CornellCornell
BrownBrown
Johns HopkinsJohns Hopkins
NIHNIH
UCLAUCLA
UCSFUCSF
NYUNYU
University of MichiganU Michigan
University of WashingtonU Washington
Georgia TechGeorgia Tech
USCUSC
UIUCUIUC
Washington University in St. LouisWashU
University of MelbourneU Melbourne
UC San DiegoUC San Diego
UC Santa BarbaraUC Santa Barbara
UC IrvineUC Irvine
University of Wisconsin-MadisonUW-Madison
EmoryEmory
UNC Chapel HillUNC
McGillMcGill
University of WaterlooU Waterloo
Boston UniversityBoston University
University of HelsinkiU Helsinki
MonashMonash
University of ColoradoU Colorado
UC Santa CruzUC Santa Cruz
UC RiversideUC Riverside
NortheasternNortheastern
SyracuseSyracuse
LehighLehigh
UT SouthwesternUT Southwestern
Texas A&MTexas A&M

Industries

Goldman SachsGoldman Sachs
JPMorganJPMorgan
Morgan StanleyMorgan Stanley
PIMCOPIMCO
MetaMeta
AmazonAmazon
AdobeAdobe
OracleOracle
Hippocratic AIHippocratic AI
HubSpotHubSpot
BrixBrix
Photon FundPhoton Fund
Snorkel AISnorkel AI
Unipat AI
Unipat AI
Tianqiao and Chrissy Chen InstituteTCCI
Goldman SachsGoldman Sachs
JPMorganJPMorgan
Morgan StanleyMorgan Stanley
PIMCOPIMCO
MetaMeta
AmazonAmazon
AdobeAdobe
OracleOracle
Hippocratic AIHippocratic AI
HubSpotHubSpot
BrixBrix
Photon FundPhoton Fund
Snorkel AISnorkel AI
Unipat AI
Unipat AI
Tianqiao and Chrissy Chen InstituteTCCI

Advisory Committee

Why Contribute - Help Set the Standard for
Agent Evaluation in Your Industry

Shape evaluation standards, publish research, and earn recognition.

Insight into Agents in Industry

See exactly how AI agents handle real workflows in industry, and where they fall short.

Learn more

Co-authorship on Manuscript

Qualifying contributors receive co-authorship credit on the research publication.

Learn more

Monetary Awards

High-impact contributions are recognized with monetary awards from our $100K+ funding pool.

Learn more
Choose How You'd Like to Contribute

For Domain Experts

Contribute domain expertise and real workflow data - no coding required.

For Researchers & Engineers

Turn real workflows into challenging, reproducible agent benchmarks: setup, execution, and evaluation.

FAQ

Common questions about software access, authorship, venues, and timeline.

FAQ page →

Contact

For inquiries, reach out to the team directly.

rdi_research@berkeley.edu →

Stay Updated

Subscribe for announcements, benchmark releases, and updates.

Join mailing list →