You will be working as a Senior Systems Engineer in our Artificial Intelligence Operations team. We're building AI platforms for operating AI factories to make a lasting impact on resilient operations of AI clusters.
What you'll be doing:
You will bring together and understand internal and external customer requirements to improve AI cluster resiliency and design AIOps-based solutions that address these needs. You will develop automated workflows for issue detection and root cause analysis and closely collaborate with operators to debug sophisticated, full-stack AI cluster problems. You'll also deliver compelling technical presentations and lead hands-on demos or training, handle evaluation deployments (POC/POV), and ensure smooth, reliable installations by staying engaged and encouraging throughout the customer journey.
Requirements:
- Bachelor of Science or equivalent experience
- 8+ years of networking experience in enterprise or service provider environments, with strong hands-on expertise in routing and switching
- Proficient in scripting and automation using Python or similar languages, with strong Linux expertise
- Proven experience working directly with customers to resolve issues and ensure success in Systems Engineer or SRE roles
- Exceptional oral, written, and presentation skills for clearly communicating complex technical topics
- Demonstrated ability to collaborate effectively across teams, partnering with operations, engineering, and product development
Nice to have:
- Experience with data center infrastructure and cloud architectures
- Background in network performance monitoring or observability
- Previous experience working at a technological start-up
With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world's most desirable employers. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until March 13, 2026.
Job feed automation by YubHub