Join Our Team

Call for Engineering Collaborators

Shape the future of AI evaluation. Core contributors earn co-authorship.

Partner with us to build the next generation of AI agent benchmarks. You will design tasks that test frontier models in real-world software environments and help define the standards that will guide the field.

Your Role & Impact

(Actively Onboarding)

Instead of just writing code, you will own a benchmark task end-to-end.

  • Task Ownership: Design the core logic (main.py) to test the true limits of state-of-the-art AI.
  • Expert Evaluation: Define rigorous success criteria, task configurations, and reference artifacts.
  • Pipeline Implementation: Author command sequences for reliable interaction within remote environments.

Our Commitment to You

A frictionless, "no dependency hell" research experience.

  • • Zero Infrastructure Hassle: Fully managed remote Windows/Linux environments (no personal cloud spend required).
  • • Streamlined Onboarding: Clear baseline docs, templates, and a quickstart workflow.
  • • Collaborative Feedback: Constructive, research-driven reviews to elevate your work to publishable quality.

Let's Talk

If you want to explore fit, reach out at sunyiyou@berkeley.edu, hanxinyang@berkeley.edu