Opening. This role is a pivotal part of xAI's mission to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment.
What you'll do
We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment. This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure.
- Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.
- Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers.
What you need
- Strong programming skills with proven production experience in Python (required for automation and tooling); experience with Rust or willingness to work in Rust is a plus, but strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.
- Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.