Opening. This role is to reverse-engineer how trained models work because we believe that a mechanistic understanding is the most robust way to make advanced systems safe.
What you'll do
The Interpretability team at Anthropic is working to reverse-engineer how trained models work. We're looking for researchers and engineers to join our efforts.
- Implement and analyze research experiments, both quickly in toy scenarios and at scale in large models
- Set up and optimize research workflows to run efficiently and reliably at large scale
- Build tools and abstractions to support rapid pace of research experimentation
- Develop and improve tools and infrastructure to support other teams in using Interpretability's work to improve model safety
What you need
- Have 5-10+ years of experience building software
- Are highly proficient in at least one programming language (e.g., Python, Rust, Go, Java) and productive with python
- Have some experience contributing to empirical AI research projects
- Have a strong ability to prioritize and direct effort toward the most impactful work and are comfortable operating with ambiguity and questioning assumptions.