Full-Time

Member of Technical Staff, High Performance Computing Engineer at Microsoft AI

Company Microsoft AI
Location London
Salary Competitive salary
Posted Posted 0 days ago

Job Description

Summary

Microsoft AI are looking for a talented Member of Technical Staff, High Performance Computing Engineer at their London office. This role sits at the heart of building and scaling the infrastructure that trains their frontier models and powers the next evolution of their personal AI, Copilot. You'll work directly with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs to resolution.

About the Role

As a Member of Technical Staff, High Performance Computing Engineer, you will design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings. You will own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale. You will serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.

Accountabilities

  • Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.
  • Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.
  • Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.

The Candidate we're looking for

Experience:

  • 4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters.
  • 4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.).
  • 4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP.

Technical skills:

  • Experience with LLM training clusters.
  • Experience working with AI platforms, frameworks, and APIs.
  • Experience using Machine Learning frameworks, including experience using, deploying, and scaling language learning models, either personally or professionally.

Personal attributes:

  • Ability to identify, analyze, and resolve complex technical issues, ensuring optimal performance, scalability, and user experience.
  • Dedication to writing clean, maintainable, and well-documented code with a focus on application quality, performance, and security.

Benefits

  • Competitive salary and benefits package.
  • Opportunity to work with a leading technology company and contribute to HERE's mission.
  • Collaborative and dynamic work environment.
  • Professional development opportunities.

Similar Jobs

Full-Time

Strategic Customer Success Manager

Synthesia
New York City
More Info
Full-Time

Software Engineer, Machine Learning

Synthesia
Europe
More Info
Full-Time

Software Engineer, Back End – Video Generation (Tech Lead Level)

Synthesia
London
More Info
Full-Time

Marketing Rev Ops Manager

Synthesia
London
More Info
Full-Time

GTM Methodology Lead

Synthesia
New York City
More Info
Full-Time

Customer Support Associate

Synthesia
US Remote
More Info

Receive the latest articles in your inbox

Join the Houtini Newsletter

Practical AI tools, local LLM updates, and MCP workflows straight to your inbox.