Epoch AI Logo

Epoch AI

Researcher, Evaluations

Posted 4 Hours Ago
Be an Early Applicant
Remote
Hiring Remotely in USA
115K-200K Annually
Mid level
Remote
Hiring Remotely in USA
115K-200K Annually
Mid level
Design and maintain a realistic benchmark suite of open-ended office tasks, create grading rubrics, run and analyze frontier AI models, produce public-facing evaluations and visualizations, and improve/automate evaluation workflows.
The summary above was generated by AI

Epoch AI is looking for a researcher to evaluate frontier AI models on hard-to-grade tasks drawn from real-world scenarios.

About the role

We’re seeking a Researcher to lead a new effort evaluating how well frontier models perform on the kinds of open-ended tasks that make up real office work. You will curate a suite of realistic tasks to serve as a benchmark, design the grading rubrics for AI performance, and run newly-released models through the suite, assessing their performance both quantitatively and qualitatively.

The focus is on how models handle messy, real-world work rather than on scientific knowledge or programming ability. The role makes heavy use of AI tools, but strong software engineering experience is not required. Comfort setting up AI-assisted automated workflows is a plus.

If this role sounds interesting, we are also looking for researchers on multiple other teams. 

Applications are rolling

Key Responsibilities

  • Create and curate an evaluation suite. Find real-world tasks that serve as challenging tests for practical AI capabilities, and update the tasks over time as AI capabilities evolve. Devise rubrics for evaluating AI performance.
  • Evaluate AI systems. Regularly evaluate new, notable AI models and products on the task suite. Update tasks and rubrics to reflect the changing landscape of AI capabilities.
  • Communicate your research. Create public-facing reports, blog posts, and data visualizations with your observations. Ensure the evaluations feed into our other research topics and help keep our team informed. 
  • Conduct data analysis. Analyze evaluation results and compare models across tasks. 
  • Improve the process. You might automate parts of the workflow, and build out parts of the evaluation into standalone benchmarks. 

What we are looking for

  • Analytical thinking. You conduct experiments with rigor and care, making sure that findings are well-supported by evidence.
  • Grounded, skeptical mentality. You form your own well-reasoned view of what an AI system can do, distinguishing practical capabilities from hype.
  • Comfort with AI agents and tools. You have experience working with AI agents in the course of your own work, and are comfortable delegating tasks.
  • Familiarity with AI benchmarks and evaluations. You follow AI capabilities at least casually and have opinions on what benchmarks do and don’t tell us.
  • Research and data-analysis experience, including enough comfort with light coding to analyze your own results.
  • Strong written communication skills: You can convey nuanced observations clearly and precisely.

Nice to have

  • Experience testing frontier models and writing assessments of their capabilities
  • Coding skills, including python proficiency
  • If you don’t tick all these boxes but think you would be a great fit, please consider applying anyway!

Compensation & Benefits

  • Annual salary between $115,000 – $200,000 USD, depending on location and experience. 
  • Salaries are not restricted to USD, and contracts and payments are usually in local currencies. Conversions are based on one-year average exchange rates.
  • Fully remote environment, including flexible work hours. 
  • Competitive global benefits program, including a comprehensive health insurance program—including supplemental benefits specific to a local country, as available and mandated by local law—and life insurance and a pension plan, if applicable in your country.
  • Generous paid time off (PTO), including no specific annual limit, with 30 days PTO per year protected, unlimited personal and sick leave, and 4 months paid parental leave for permanent staff with at least 12 months of tenure (prorated parental leave if less than 12 months). 
  • A flexible and generous expense policy for you to spend on equipment and a large range of productivity tools or learning/development opportunities, including unlimited spending on AI tools, subject to regulations and manager approval. 
  • Paid work trips, including 3 staff retreats per year and relevant conferences.
  • Access to our very well-equipped offices in Berkeley, California, including paid meals, snacks, gym, and more. All staff, independently of where they are based, have access to the office for at least 20 days each year.

Additional Information

While we welcome applicants from all time zones, we prefer candidates who can overlap with UTC–8 (Pacific Time) and UTC (Greenwich Mean Time), as most of our staff work in this range of time zones. We also prefer candidates who can travel: we hold three retreats per year, during which we record podcast episodes and other communication efforts.

Please submit all of your application materials in English and note that we require professional level English proficiency.

Epoch is committed to building an inclusive, equitable, and supportive community for you to thrive and do your best work. We’re committed to finding the best people for our team, so please don’t hesitate to apply for a role regardless of your age, gender identity/expression, political identity, personal preferences, physical abilities, veteran status, neurodiversity or any other background. Please email [email protected] if you have any questions about this role, accessibility requests, or if you want to request an extension to the application deadline. However, we will not review applications submitted to this email address; please submit your application through the link on this page. 

About Epoch AI

Epoch AI is a research institute that investigates trends in machine learning and the economic consequences of AI. Our mission is to develop a comprehensive, publicly accessible knowledge base on AI that informs policymakers, industry leaders, and society at large.

We strive to achieve both rigor and accessibility to our work, as exemplified by some of our most successful projects, including our database of AI models and our AI trends dashboard. Our body of research includes our work on compute trends (IJCN 2022), data scarcity (ICML 2024), and algorithmic progress (NeurIPS 2024). You can read more about our work and mission on our website and in this Time profile.

Similar Jobs

31 Minutes Ago
Easy Apply
Remote or Hybrid
Easy Apply
102K-128K Annually
Junior
102K-128K Annually
Junior
Cloud • Information Technology • Security • Software • Cybersecurity
Drive automation-first reliability for a global, multi-cloud platform: build scalable infra (AWS/GCP/bare-metal), write automation (Python/Go), implement observability (Prometheus/Grafana/OpenTelemetry), lead incident response/on-call, define SLIs/SLOs, and partner on operability reviews and post-incident analysis.
Top Skills: AnsibleAWSAzureBgpC/C++DnsGCPGoGrafanaGreHaproxyHelmIpsecItilLinuxOpentelemetryPrometheusPythonRhelTemporalTerraform
38 Minutes Ago
Remote or Hybrid
118K-201K Annually
Senior level
118K-201K Annually
Senior level
Aerospace • Hardware • Information Technology • Security • Software • Cybersecurity • Defense
Lead supplier quality for Printed Wiring Boards: audit suppliers, perform source and first-article inspections, drive root-cause analysis and corrective actions, implement process improvements, and ensure compliance with PWB and aerospace standards to deliver first-time quality.
Top Skills: ApqpAs9100As9102Asme Y14.5Asme Y15.1Black BeltControl PlanFirst Article InspectionGreen BeltIpc-6012Ipc-6013Ipc-6018Ipc-A-600Ipc-A-610Ipc-Tm-650Lean Six SigmaMil-Prf-31032Mil-Prf-38534Mil-Prf-55110Mil-Std-883PfmeaPpapSource Inspection
38 Minutes Ago
Remote or Hybrid
District of Columbia, USA
127K-215K Annually
Mid level
127K-215K Annually
Mid level
Aerospace • Hardware • Information Technology • Security • Software • Cybersecurity • Defense
Support and maintain complex applications and infrastructure for a government customer: monitor and triage events, troubleshoot Linux/Windows servers, deploy and integrate software (AWS, CloudFormation, RDS), use Salt for configuration management, work with databases (Oracle, MongoDB, PostgreSQL, MySQL), write SOPs, manage security groups, and support after-hours deployments. Requires strong communication and collaboration with developers and vendors.
Top Skills: AWSCloudFormationElasticsearchJavaScriptLinuxMongoDBMySQLOraclePostgresPythonRdsSaltstackWindows Server

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account