Summary
Microsoft AI are looking for a talented Member of Technical Staff, High Performance Computing Engineer at their London office. This role sits at the heart of building and scaling the infrastructure that trains their frontier models and powers the next evolution of their personal AI, Copilot. You'll work directly with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs to resolution.
About the Role
As a Member of Technical Staff, High Performance Computing Engineer, you will design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings. You will own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale. You will serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.
Accountabilities
- Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.
- Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.
- Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.
The Candidate we're looking for
Experience:
- 4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters.
- 4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.).
- 4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP.
Technical skills:
- Experience with LLM training clusters.
- Experience working with AI platforms, frameworks, and APIs.
- Experience using Machine Learning frameworks, including experience using, deploying, and scaling language learning models, either personally or professionally.
Personal attributes:
- Ability to identify, analyze, and resolve complex technical issues, ensuring optimal performance, scalability, and user experience.
- Dedication to writing clean, maintainable, and well-documented code with a focus on application quality, performance, and security.
Benefits
- Competitive salary and benefits package.
- Opportunity to work with a leading technology company and contribute to HERE's mission.
- Collaborative and dynamic work environment.
- Professional development opportunities.