We are seeking an IC Agentic Engineering Manager to lead the development and application of agent-based systems for infrastructure delivery and operations within Stargate.
This is a player-coach role: you will contribute directly to system design and implementation while leading a small team. You will focus on applying agentic systems to infrastructure workflows such as deployment orchestration, system bring-up, issue triage, debugging, and capacity management.
This role is not focused on building general-purpose agent platforms. Instead, it is centered on applying agentic systems to solve concrete infrastructure problems, working closely with hardware, networking, and cluster teams.
Key Responsibilities
- Design and build agent-based systems to support infrastructure deployment and operations
- Identify high-impact opportunities to apply agents across workflows such as:
- cluster bring-up and deployment readiness
- incident triage and root cause analysis
- system validation and health monitoring
- capacity management and operational decision-making
- Lead a small team while contributing directly as an IC across system design, development, and integration
- Partner with infrastructure, hardware, and networking teams to integrate agentic systems into production workflows
- Develop systems that leverage telemetry, logs, and system signals to enable closed-loop automation
- Define evaluation frameworks to measure system effectiveness, reliability, and operational impact
- Drive iteration from prototype to production, ensuring robustness and scalability
Qualifications
- Strong software engineering background in distributed systems, infrastructure, or platform engineering
- Experience building production automation systems or data-driven operational tooling
- Experience applying AI, ML, or agent-based approaches to real-world systems or workflows
- Ability to operate as a hands-on IC while leading a small team
- Experience working cross-functionally with infrastructure, hardware, or systems teams
- Strong problem-solving skills in complex, ambiguous environments
Preferred Skills
- Experience with LLM-based systems, agents, or autonomous workflows
- Background in infrastructure operations, SRE, or large-scale system deployment
- Experience working on cluster bring-up, debugging, or data center infrastructure systems
- Familiarity with telemetry, monitoring systems, and observability pipelines
- Experience building internal tools or platforms for engineering productivity and operations
XML job scraping automation by YubHub