Full-Time

Member of Technical Staff, Hardware Health at Microsoft AI

Company Microsoft AI
Location Redmond
Salary Competitive salary
Posted Posted 0 days ago

Job Description

Summary

Microsoft AI are looking for a talented Member of Technical Staff, Hardware Health, to ensure these systems deliver sustained reliability, performance, and availability across exascale-class deployments.

About the Role

We work closely with research, hardware, datacenter, and platform engineering teams to develop predictive health models, failure detection frameworks, and autonomous remediation systems that keep our AI clusters operating at frontier scale. Our team is responsible for Copilot, Bing, Edge, and generative AI research.

Accountabilities

  • Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).
  • Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
  • Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies.
  • Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms.
  • Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters.
  • Drive automation in health management to reduce manual intervention to the top 5% of anomalies.
  • Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability.

The Candidate we're looking for

Experience:

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Technical skills:

  • Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent).
  • Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies.
  • Proficiency in hardware telemetry, diagnostics, or failure analysis tools.

Personal attributes:

  • Strong analytical and problem-solving skills.
  • Excellent communication and collaboration skills.

Benefits

  • Competitive salary.
  • Comprehensive benefits package.
  • Opportunities for professional growth and development.
  • Collaborative and dynamic work environment.

Similar Jobs

Full-Time

Strategic Customer Success Manager

Synthesia
New York City
More Info
Full-Time

Software Engineer, Machine Learning

Synthesia
Europe
More Info
Full-Time

Software Engineer, Back End – Video Generation (Tech Lead Level)

Synthesia
London
More Info
Full-Time

Marketing Rev Ops Manager

Synthesia
London
More Info
Full-Time

GTM Methodology Lead

Synthesia
New York City
More Info
Full-Time

Customer Support Associate

Synthesia
US Remote
More Info

Receive the latest articles in your inbox

Join the Houtini Newsletter

Practical AI tools, local LLM updates, and MCP workflows straight to your inbox.