Senior Site Reliability Enigneer
Apply at source. Synthesia handles the application directly; Houtini doesn't take a fee from candidates or companies. We curate which companies appear; the listings come from yubhub.
What the team is looking for.
Synthesia is the world's leading AI video platform for business, used by over 90% of the Fortune 100.
As AI continues to shape the way we live and work, Synthesia develops products to enhance visual communication and enterprise skill development, helping people work better and stay at the center of successful organisations.
Following our recent Series E funding round, where we raised $200 million, our valuation stands at $4 billion. Our total funding exceeds $530 million from premier investors including Accel, NVentures (Nvidia's VC arm), Kleiner Perkins, GV, and Evantic Capital, alongside the founders and operators of Stripe, Datadog, Miro, and Webflow.
About the team
Cloud Infrastructure owns the platform every Synthesia product runs on , AWS, Kubernetes, MongoDB, Temporal, our observability stack, and the vendor and cost relationships underneath them. We're a small, high-leverage team scaling toward a domain-ownership model: small groups that both _build_ and _operate_ the systems they're accountable for.
The role
We're hiring a dedicated SRE to take real ownership of operational excellence across Cloud Infrastructure. Today, too much critical operational knowledge , vendor relationships, cost management, and incident response , lives with one or two people. Your mission is to take genuine ownership of those domains, make them resilient to any single person, and raise the bar on how reliably we run. This is not simply a ticket-queue or keep-the-lights-on role. You'll own domains end to end: understand them deeply, operate them well, and build the automation and tooling that make them boring. We deliberately pair operational and engineering work so the role grows rather than narrows.
What you'll own
- Incident management & operational excellence , take custody of the incident process: on-call quality, response, post-mortems, and driving down incident count, time-to-detect, and time-to-resolve.
- Automation & reliability engineering , automate low-frequency, high-consequence operations (the certificate-renewal class of problem , rare, easy to forget, outage-causing when missed), not just the high-frequency toil. You decide what to automate based on risk and blast radius, not just time saved.
- A platform domain , over time, deep ownership of a domain such as Temporal, observability, or Kubernetes operations, partnering with the engineers building in it.
- Vendor & third-party management , own key external relationships and integrations (e.g. LLM API providers, third-party services), today managed manually and informally. Bring structure, automation, and bus-factor resilience.
- FinOps , own cloud and platform cost visibility and efficiency, and the mechanics of how usage maps to billing.
What success looks like (first 12 months)
- Critical operational knowledge is documented and shared , no single point of failure for vendor, cost, or incident response.
- Measurable reliability gains: fewer SEV1–SEV3 incidents per quarter, faster customer-impact resolution, and a much higher share of incidents caught by monitoring before customers feel them.
- High-risk manual processes are automated and self-documenting.
What we're looking for
- Strong production operations experience on AWS and Kubernetes; comfortable with MongoDB and scripting/automation in Python.
- An operations-and-reliability mindset , you take pride in systems that run quietly , _paired with_ the instinct to engineer the problem away rather than absorb it manually.
- Sound judgement on incidents and risk; calm and clear under pressure.
- Influences through relationships and evidence, not escalation; comfortable owning a domain and partnering across teams.
- Bonus: vendor/cost management exposure, Temporal, observability tooling.
- AWS
- Kubernetes
- MongoDB
- Python
- Temporal
- observability tooling
- vendor/cost management exposure
Other roles you might consider.
Filtered through the same AI-companies allowlist.
Platform Security Engineering - OpenBMC
Anthropic
Systems Engineer, HPC (APAC)
Mistral AI
Maintenance Planner
xAI
Staff Software Engineer, Developer Productivity (Dev Environments) - Claude Code
Anthropic
Staff Software Engineer, Developer Productivity (CI/CD) - Claude Code
Anthropic
Software Engineer, Identity & Access Controls
Anthropic
New to AI work? Start with these.
Six pieces of orientation. Most AI-company job specs assume you've done this kind of hands-on work already. If you haven't, an afternoon with one of these is the cheapest way to close the gap.
Claude Desktop, from zero.
The agentic-AI assistant most of the people you'd be working alongside use every day. Install, configure, first useful prompts.
What MCPs areThe best MCPs for Claude Desktop.
MCP servers extend an AI assistant with tools and data. The catalogue most teams use. Useful technical context for any AI-engineering role.
Code with AIClaude Code, the complete beginners' guide.
The CLI for AI-paired development. Required reading if you're applying for any engineering role that mentions agents, or any role full stop.
Run a local modelHow to set up LM Studio.
Running a model on your own machine teaches you more about how AI products work in three hours than a year of using ChatGPT will.
The hardware realityBeginner's guide to AI hardware.
What the infrastructure under the model actually looks like. Useful context for infrastructure, applied-AI and hardware roles.
Browse the stackMCP catalogue.
Eleven MCP servers Houtini maintains or recommends. Each detail page describes a real piece of working AI infrastructure.