We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment. This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure.
What you'll do
This role plays a pivotal role in bridging software engineering principles with physical data center realities. By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation.
- Design, develop, and deploy scalable code and services to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.
- Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers.
What you need
- Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).
- 5+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, or a related field.