We're looking for Research Engineers to build the evaluations that tell us , and the world , what Claude can actually do. Your work will turn ambiguous notions of "intelligence" into clear, defensible metrics that researchers, leadership, and the public can rely on.

You'll design and implement evaluations across the full spectrum of Claude's capabilities and personality, and build the infrastructure that runs them reliably at scale. You'll partner closely with researchers throughout the lifecycle of a new capability , from defining what to measure, to running the eval against live training checkpoints, to interpreting the results.

Key responsibilities include designing and running new evaluations of Claude's capabilities, building and hardening the distributed eval execution platform, owning dashboards researchers and leadership use to monitor model health, debugging anomalous eval results, improving tooling and workflows, partnering with research teams, running experiments, and communicating evaluations and results to internal stakeholders and external audiences.

Minimum qualifications include strong Python programming skills, experience building or operating distributed systems, data pipelines, or other infrastructure that needs to be reliable at scale, clear written and verbal communication, comfort operating in an on-call or production-support capacity, and care about the societal impacts of your work.

Preferred qualifications include hands-on experience using large language models, background in data visualization, experience developing robust evaluation metrics for language models, experience with observability, monitoring, or experiment-tracking systems, background in statistics and experimental design, experience running or supporting ML training infrastructure, and a bias toward picking up slack and operating flexibly across team boundaries.

The annual compensation range for this role is $320,000-$485,000 USD.

Minimum education is a Bachelor's degree or an equivalent combination of education, training, and/or experience. Required field of study is a field relevant to the role as demonstrated through coursework, training, or professional experience. Minimum years of experience required will correlate with the internal job level requirements for the position.

Location-based hybrid policy: currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Visa sponsorship: we do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work.

Your safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you're ever unsure about a communication, don't click any links,visit anthropic.com/careers directly for confirmed position openings.

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact , advancing our long-term goals of steerable, trustworthy AI , rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills.

The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Pr

XML job scraping automation by YubHub

Research Engineer, Model Evaluations at Anthropic

Job Description

Events Lead – APAC

Head of Finance AI & Innovation

Base Build Maintenance (BBM) Manager

Automation Engineer

Reliability Maintenance Engineer

Senior Mechanical Engineer