We're seeking an operationally-focused System Software Engineer to ensure the stability, reliability, and flawless execution of all NVIDIA Deep Learning Institute (DLI) training events. You will also oversee the broader day-to-day operational health of the entire learning platform. Your operational acumen will be instrumental in powering our latest educational experiences focused on safe, trustworthy, and ethical AI, ensuring a seamless experience for instructors and learners.
Join a close-knit team where your contributions truly matter. As a core member of our learning systems platform team, you'll collaborate with creative educators to ensure our hands-on training sets the standard for user experience. You'll play a crucial role in making our purpose-built Learning Management System (LMS) platform a delightful and efficient tool that empowers both learners and instructors.
What you'll be doing:
- Develop comprehensive operational plans and de-risking strategies to ensure flawless technical execution of technical training events.
- Provide expert, hands-on technical leadership during live training events, managing deployments and rapidly resolving emergent issues for an optimal user experience.
- Oversee the stability, scalability, and reliability of the DLI learning platform, implementing SRE principles and leading incident response for optimal performance and reliability.
- Lead cross-functional coordination, establish and enforce operational best practices, and drive continuous improvement initiatives to enhance platform services.
What we need to see:
- Bachelor's degree in Computer Science, a related technical field, or equivalent experience with over 5 years of DevOps experience optimizing, deploying and running containerized applications (Docker, Kubernetes) across AWS, Azure, and GCP, including hands-on work with EKS, AKS, and GKE.
- Proficient in Python and Linux shell scripting for automation, application development, system administration, and problem resolution.
- Validated experience architecting, implementing, and managing cloud infrastructure using Terraform.
- Demonstrated ability as a meticulous problem-solver with strong analytical skills, capable of diagnosing and resolving complex technical challenges under pressure.
- Excellent communication, teamwork, and collaboration skills, with an ability to articulate technical concepts clearly to diverse audiences and lead technical responses during incidents.
Ways to stand out from the crowd:
- Proven experience designing and implementing event-driven architectures using pub/sub patterns with platforms like AWS SNS / SQS, Google Pub / Sub, or Azure Service Bus.
- Knowledge of generative AI architectures (LLMs, diffusion models) and concepts such as Retrieval Augmented Generation (RAG) and vector databases.
- Hands-on experience with the NVIDIA AI stack (NeMo, Triton Inference Server, TensorRT) for model development, serving, and optimization. Production experience with NVIDIA NIM is a strong plus.
- Experienced in building and running CI/CD pipelines (Jenkins, GitLab CI) and managed software development environments, applying SRE principles to automate, enhance reliability, and improve performance.
- Familiarity with Python-based Learning Management Systems (LMS) such as Open edX.
XML job scraping automation by YubHub