We are looking for a Senior or Staff Infrastructure Engineer to act as a primary technical lead, engineering the 'paved road' for our knowledge retrieval and inference engines. You will define the deployment standards for Agentic workflows at scale, bridging the gap between complex AI orchestration and world-class infrastructure. Our platform remains the most reliable destination for enterprise agents.
As a Staff Infrastructure Engineer, you will:
Architect multi-cloud systems and abstractions to allow the SGP platform to run on top of existing Cloud providers.
Use our own data and AI platform to analyze build and test logs and metrics to identify areas for improvement.
Define the architectural patterns for our multi-cloud infrastructure to support secure, reliable, and scalable Agentic workflows for enterprise customers.
Enhance engineering and infrastructure efficiency, reliability, accuracy, and response times, including CI/CD processes, test frameworks, data quality assurance, end-to-end reconciliation, and anomaly detection.
Collaborate with platform and product teams to develop and implement innovative infrastructure that scales to meet evolving needs.
Design and champion highly scalable, reliable, and low-latency infrastructure and frameworks for building, orchestrating, and evaluating multi-agent systems at enterprise scale.
Lead the infrastructure roadmap with a strong focus on compliance, privacy, and security standards, including designing change management and data isolation strategies.
Own the development and maintenance of our best-in-class Agentic observability platform (logging, metrics, tracing, and analytics) to proactively ensure system health and enable rapid incident response.
Drive developer efficiency by building automated tooling and championing Infrastructure-as-Code (IaC) paradigms throughout the engineering organization to improve workflows and operational efficiency.
We are looking for someone with proven experience in a senior role, with 5+ years of full-time software engineering experience. You should have a deep understanding of modern infrastructure practices, including CI/CD, IaC (e.g., Terraform, Helm Charts), container orchestration (e.g., Kubernetes) and observability platforms (e.g., Datadog, Prometheus, Grafana).
You should also have extensive experience with at least one major cloud provider (AWS, Azure, or GCP) and strong knowledge of security and compliance in enterprise environments, with a focus on access management, data isolation, and customer-specific VPC setups.
Bonus points for hands-on experience and a passion for working with Agents, LLMs, vector databases, and other emerging AI technologies.
This role may be eligible for additional benefits such as a commuter stipend.
XML job scraping automation by YubHub