Join our AI Networking co-design and benchmark R&D team as a senior software engineer. You will be responsible for building and productizing machine learning tools that use ML-based combinatorial optimization and build space exploration (DSE) techniques. These tools will be employed to optimize AI workloads across large GPU and CPU clusters, ensuring the most efficient and productive utilization of system resources at data center scale.

You will work on distributed Deep Learning, particularly within LLM training and inference stacks. A strong passion for collective communication and networking is desirable. You will interact with diverse hardware and platforms, such as Host Channel Adapters (HCAs), Switches, CPUs, GPUs, and complete Systems.

The role requires engagement across multiple software layers, including LLM applications, machine learning frameworks, and communication and computing libraries. You will develop tools and methodologies using Machine Learning (ML) for comprehensive performance analysis and optimization, potentially incorporating learning-based agentic techniques.

This work involves deep-diving across the software stack, from LLM applications and ML frameworks down to communication and computing libraries. This position offers a distinct opportunity to make significant contributions to the core infrastructure powering the next generation of large-scale AI systems.

Key Responsibilities:

Design and implement resource allocation and combinatorial optimization techniques to optimize LLM models at datacenter scale.
Research, develop, and deploy AI/ML techniques to optimize large-scale Deep Learning (LLM) training and inference on NVIDIA supercomputers and distributed systems.
Build and productionize ML-based tools for performance prediction and optimization, with a strong emphasis on networking aspects.
Develop and deploy a scalable, reliable data curation pipeline capable of handling complex data types, such as time series and PyTorch model graphs, to effectively support the training of high-performance Machine Learning models.
Collaborate across hardware and software teams to deliver valuable performance analysis insights.
Lead performance test planning, establish performance targets for new technologies and solutions, and drive efforts to achieve those performance goals.

Requirements:

Master's degree in Computer Science, Software Engineering, or equivalent experience.
Experience applying machine learning techniques to computer architecture and system optimization problems.
Hands-on experience developing and deploying various learning algorithms to tackle optimization challenges within computer architecture, system design, or networking domains.
Proficiency in building and using ML models with leading frameworks such as PyTorch or TensorFlow, or JAX.
Proven ability to apply GNNs/transformers-based optimization to PyTorch model graph and Kineto execution traces.
Expertise combining knowledge of NVIDIA GPUs, the CUDA library, and deep learning frameworks (TensorFlow/PyTorch) with networking concepts, including collective communication libraries (like NCCL) and protocols (such as RoCE and RDMA).
Strong programming capabilities in Python, Bash, and C++.
A collaborative teammate with effective communication and interpersonal abilities.

Nice to Have:

In-depth knowledge and experience with machine learning/reinforcement learning and frameworks.
Comprehensive understanding of computer architecture, system architecture and networking.
Extensive experience in applying machine learning techniques such as GNNs or related graph-based models.
Knowledge in PyTorch, CUDA, and NCCL libraries.
Proven software engineering/development skills.

XML job scraping automation by YubHub

Software Engineer, AI Networking- New College Grad 2026 at NVIDIA

Job Description

Sr. Manager, Logistics – Data Center Operations

Data Engineer

Construction Manager

Backend Engineer – API

Product Manager of AI Applications, Global Public Sector

Product Manager, Public Sector GenAI Test & Evaluation (T&E)