Join Our Team
Call for Engineering Collaborators
Shape the future of AI evaluation. Core contributors earn co-authorship.
Partner with us to build the next generation of AI agent benchmarks. You will design tasks that test frontier models in real-world software environments and help define the standards that will guide the field.
Your Role & Impact
(Actively Onboarding)Instead of just writing code, you will own a benchmark task end-to-end.
- • Task Ownership: Design the core logic (main.py) to test the true limits of state-of-the-art AI.
- • Expert Evaluation: Define rigorous success criteria, task configurations, and reference artifacts.
- • Pipeline Implementation: Author command sequences for reliable interaction within remote environments.
Our Commitment to You
A frictionless, "no dependency hell" research experience.
- • Zero Infrastructure Hassle: Fully managed remote Windows/Linux environments (no personal cloud spend required).
- • Streamlined Onboarding: Clear baseline docs, templates, and a quickstart workflow.
- • Collaborative Feedback: Constructive, research-driven reviews to elevate your work to publishable quality.
Let's Talk
If you want to explore fit, reach out at sunyiyou@berkeley.edu, hanxinyang@berkeley.edu