We are looking for a highly-motivated Technical Program Manager (TPM) to join our Applied Systems Engineering Team to drive datacenter integration for the next generation of NVIDIA AI supercomputing systems.
This TPM will play a crucial role throughout the lifecycle of the latest AI systems at scale, from datacenter design and requirements definition, through systems integration of AI clusters into the datacenter environment, and support for these systems as they enter production.
The successful candidate will collaborate with outstanding engineers and architects to build and deploy large-scale GPU computing systems based on NVIDIA's reference supercomputing architectures.
Key responsibilities include:
- Collaborating with engineering leaders across multiple hardware and software teams to build AI supercomputers for NVIDIA engineers and develop reference architectures to advise customers and partners.
- Leading the integration of new AI clusters with datacenter facilities with demanding requirements on power, cooling, and instrumentation.
- Coordinating design and fit-out of new datacenter builds, working with both internal engineering teams and external contractors.
- Owning and producing detailed documentation for the end-to-end process for datacenter fit-out and integration.
- Communicating internally with engineering leadership to prioritize and address key issues essential to the success of our largest customers.
We are looking for a TPM with a strong background in high-performance computing systems and GPU clusters deployed in on-premises datacenters.
- BS in Applied Science or Engineering (or equivalent experience)
- 8+ years of overall experience
- Experience with high-performance computing systems and GPU clusters deployed in on-premises datacenters
- A passion for understanding challenging technical problems and driving the process of finding a solution
- Strong teamwork and interpersonal skills, to facilitate building a collaborative workflow for coordination between many teams
- Understanding of datacenter design, including familiarity with power and cooling technologies
- Expertise in system monitoring and instrumentation of large clusters, using technologies such as Prometheus, Grafana, Splunk, Modbus, and BACNet
- Experience working with the engineering or academic research community supporting high-performance computing or deep learning
You will also be eligible for equity and benefits.
XML job scraping automation by YubHub