Full-Time

Member of Technical Staff – Infrastructure Reliability at xAI

Company xAI
Location Palo Alto, CA
Salary $180,000 - $400,000 USD
How You'll Work onsite
Level staff
Sector Technology
Posted Posted 5 days ago

Job Description

About the Role

We are training some of the largest models in the world on the latest hardware across multiple environments. To do this reliably at xAI's pace, we need engineers who have battle-tested experience keeping massive distributed infrastructure up and running 24/7, including on-prem and cloud-based infrastructure.

You will own the availability, performance, and evolution of xAI's core compute, storage, and networking infrastructure. This is not an ops-only role , strong coding is a hard requirement. You will design, implement, and ship systems software, automation, and tooling in Python and/or Rust that directly impact training throughput and cluster utilization.

Responsibilities

  • Define and execute the technical strategy for infrastructure reliability and scalability
  • Build and maintain the automation, observability, and control planes that keep multi-datacenter, hybrid cloud/on-prem environments healthy
  • Lead incident response, deep-dive root cause analysis, and post-mortems that drive real fixes
  • Identify, instrument, and eliminate systemic failure patterns (capacity, network, hardware, storage, software)
  • Design and implement high-leverage systems software (daemons, controllers, schedulers, etc.) in Python and Rust.

Basic Qualifications

  • 5+ years shipping production software and/or operating distributed infrastructure at scale
  • Expert-level knowledge of Linux systems, TCP/IP networking, and systems programming
  • Strong coding skills with proven production experience in Rust (strongly preferred) and at least one of Python, Go, or C++.

Preferred Skills and Experience

  • Significant contributions to large-scale GPU clusters or AI/ML infrastructure
  • Experience in on-call rotations and incident response in high-stakes environments.

Compensation and Benefits

$180,000 – $400,000 USD

Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

XML job scraping automation by YubHub

Similar Jobs

Full-Time

Global Supply Manager – SaaS

xAI
Palo Alto, CA
More Info
Full Time|contract

Finance Expert – Quantitative Trading

xAI
Remote
More Info
Full Time|part Time|contract

Materials Science Tutor

xAI
Remote
More Info
Full Time|part Time|contract

Finance Expert – Quant

xAI
Remote
More Info
Full-Time

Member of Technical Staff – Mid-training

xAI
Palo Alto, CA
More Info
Full Time|part Time|contract

Legal & Compliance Tutor

xAI
Remote
More Info

Receive the latest articles in your inbox

Join the Houtini Newsletter

Practical AI tools, local LLM updates, and MCP workflows straight to your inbox.