Full-Time

Senior System Software Engineer – GPU Performance at NVIDIA

Company NVIDIA
Location Santa Clara
Salary Competitive salary
Posted Posted 0 days ago

Job Description

We are looking for a motivated Performance engineer to influence the roadmap of our communication libraries. The DL and HPC applications of today have a huge compute demand and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes.

What you'll do

  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  • Study the interaction of our libraries with all HW (GPU, CPU, Networking) and SW components in the stack
  • Evaluate proof-of-concepts, conduct trade-off analysis when multiple solutions are available
  • Triage and root-cause performance issues reported by our customers
  • Collect a lot of performance data; build tools and infrastructure to visualize and analyze the information
  • Collaborate with a very dynamic team across multiple time zones

What you need

  • M.S. (or equivalent experience) or PhD in Computer Science, or related field with relevant performance engineering and HPC experience
  • 3+ yrs of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
  • Experience conducting performance benchmarking and triage on large scale HPC clusters
  • Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals)
  • Implement micro-benchmarks in C/C++, read and modify the code base when required
  • Ability to debug performance issues across the entire HW/SW stack. Proficient in a scripting language, preferably Python
  • Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker)
  • Adaptability and passion to learn new areas and tools. Flexibility to work and communicate effectively across different teams and timezones

Similar Jobs

Full-Time

Data Scientist, Evals

Perplexity
London
More Info
Full-Time

Tech Lead Manager – Agents

Perplexity
San Francisco
More Info
Full-Time

Forward-Deployed Engineer – API Platform

Perplexity AI
New York City, London, San Francisco, Seattle
More Info
Full-Time

Business Development Representative

Perplexity
San Francisco, New York City
More Info
Full-Time

Engineering Site Lead

Perplexity
London
More Info
Full-Time

AI Software Engineer – Agents

Perplexity
San Francisco
More Info

Receive the latest articles in your inbox

Join the Houtini Newsletter

Practical AI tools, local LLM updates, and MCP workflows straight to your inbox.