Full-Time

Senior Datacenter Technical Program Manager, At-Scale AI Clusters at NVIDIA

Company NVIDIA
Location Santa Clara
How You'll Work onsite
Level senior
Sector Technology
Posted Posted 0 days ago

Job Description

We are looking for a highly-motivated Technical Program Manager (TPM) to join our Applied Systems Engineering Team to drive datacenter integration for the next generation of NVIDIA AI supercomputing systems.

This TPM will play a crucial role throughout the lifecycle of the latest AI systems at scale, from datacenter design and requirements definition, through systems integration of AI clusters into the datacenter environment, and support for these systems as they enter production.

The successful candidate will collaborate with outstanding engineers and architects to build and deploy large-scale GPU computing systems based on NVIDIA's reference supercomputing architectures.

Key responsibilities include:

  • Collaborating with engineering leaders across multiple hardware and software teams to build AI supercomputers for NVIDIA engineers and develop reference architectures to advise customers and partners.
  • Leading the integration of new AI clusters with datacenter facilities with demanding requirements on power, cooling, and instrumentation.
  • Coordinating design and fit-out of new datacenter builds, working with both internal engineering teams and external contractors.
  • Owning and producing detailed documentation for the end-to-end process for datacenter fit-out and integration.
  • Communicating internally with engineering leadership to prioritize and address key issues essential to the success of our largest customers.

We are looking for a TPM with a strong background in high-performance computing systems and GPU clusters deployed in on-premises datacenters.

  • BS in Applied Science or Engineering (or equivalent experience)
  • 8+ years of overall experience
  • Experience with high-performance computing systems and GPU clusters deployed in on-premises datacenters
  • A passion for understanding challenging technical problems and driving the process of finding a solution
  • Strong teamwork and interpersonal skills, to facilitate building a collaborative workflow for coordination between many teams
  • Understanding of datacenter design, including familiarity with power and cooling technologies
  • Expertise in system monitoring and instrumentation of large clusters, using technologies such as Prometheus, Grafana, Splunk, Modbus, and BACNet
  • Experience working with the engineering or academic research community supporting high-performance computing or deep learning

You will also be eligible for equity and benefits.

XML job scraping automation by YubHub

Similar Jobs

Full-Time

Sr. Manager, Logistics – Data Center Operations

xAI
Memphis, TN
More Info
Full-Time

Construction Manager

xAI
Memphis, TN
More Info
Full-Time

Backend Engineer – API

xAI
London, UK
More Info
Full-Time

Product Manager of AI Applications, Global Public Sector

Scale
Doha, Qatar ; Dubai, UAE
More Info
Full-Time

Product Manager, Public Sector GenAI Test & Evaluation (T&E)

Scale
San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC
More Info

Receive the latest articles in your inbox

Join the Houtini Newsletter

Practical AI tools, local LLM updates, and MCP workflows straight to your inbox.