Summary
Microsoft AI are looking for a talented Member of Technical Staff, Hardware Health, to ensure these systems deliver sustained reliability, performance, and availability across exascale-class deployments.
About the Role
We work closely with research, hardware, datacenter, and platform engineering teams to develop predictive health models, failure detection frameworks, and autonomous remediation systems that keep our AI clusters operating at frontier scale. Our team is responsible for Copilot, Bing, Edge, and generative AI research. Join us and help shape the future of personal computing.
Accountabilities
- Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).
- Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
The Candidate we're looking for
Experience:
- Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Technical skills:
- Proficiency in hardware telemetry, diagnostics, or failure analysis tools.
- Experience with exascale-class systems or cloud-scale AI clusters.
Personal attributes:
- Strong analytical and problem-solving skills.
- Excellent communication and collaboration skills.
Benefits
- Competitive salary range: $139,900 – $274,800 per year.
- Comprehensive benefits package, including health insurance, retirement plan, and paid time off.
- Opportunities for professional growth and development.
- Collaborative and dynamic work environment.