Graduate
Site Reliability
Engineer
Critical Cloud delivers cloud operations across AWS and Azure. Our SRE team runs production environments for a portfolio of tech-led customers, monitoring, incident response, infrastructure automation, and continuous improvement. Everyone on the team touches real infrastructure for real customers.
This is the entry point into the SRE team. You'll work directly alongside senior engineers, learning how we operate production environments, instrument systems with Datadog, and respond to incidents. From the start you'll contribute to real work, monitoring customer environments, writing runbooks, supporting infrastructure changes, with progressively more ownership as your confidence and knowledge grows.
We're looking for a graduate with the core fundamentals, a genuine curiosity about how systems work under pressure, and the communication skills to operate in a customer-facing environment. You don't need production experience. You need to be the kind of engineer who asks why, reads the docs all the way through, and takes reliability seriously.
- Monitor customer AWS and Azure environments using Datadog, learning to triage alerts, identify signal from noise, and escalate with context
- Support incident response workflows alongside senior engineers, contributing to postmortem documentation and remediation tracking
- Assist with Datadog onboarding and instrumentation for new customers: agents, integrations, dashboards, monitors, and log pipelines
- Support infrastructure-as-code work (Terraform) for provisioning and configuration changes across customer accounts, under senior review
- Write and maintain runbooks and operational documentation, clear, accurate, and usable by anyone on the team at 3am
- Participate in proactive reliability reviews: alert tuning, capacity checks, dependency mapping, with guidance from senior engineers
- Contribute to internal tooling and AI-assisted automation initiatives as part of the wider engineering team
- Communicate directly with customers on day-to-day operational queries with a professional, calm, and clear style
We're a small team. The path from graduate to senior is real and faster than most places.
Site Reliability
Engineer
Site Reliability
Engineer
or Platform Eng
Lead Engineer
- A degree in Computer Science, Software Engineering, or a related technical discipline or equivalent demonstrable self-taught fundamentals
- Solid Linux fundamentals: CLI navigation, file systems, processes, networking basics
- Comfort with scripting in Bash, Python, or similar, you've automated something, even if small
- Understanding of core observability concepts: what metrics, logs, and traces are and what they tell you
- Awareness of cloud fundamentals, you know what EC2, S3, VPCs, and load balancers do, even without production experience
- Clear written and verbal communication, you'll be in customer-facing situations from early on
- Right to work in the UK without sponsorship
- Any hands-on Datadog experience, trial, personal project, or university lab
- Terraform or any infrastructure-as-code exposure
- Docker or Kubernetes, even containerising a personal project counts
- A cloud certification (AWS Cloud Practitioner, Azure Fundamentals, or equivalent)
- Experience in a customer-facing environment, even outside tech
- Any personal projects involving monitoring, automation, or infrastructure
- 25 days holiday + bank holidays plus a paid day off in your birthday month, taken in the month it falls
- Holiday grows with tenure: +1 day per year after your second work anniversary, up to 28 days total
- Enhanced maternity pay: 26 weeks at your full basic salary
- Enhanced paternity pay: 2 weeks at your full basic salary
- Datadog, AWS, and Azure certifications paid by the company, contractual, not discretionary
- AI tooling certifications also funded, staying current is part of the role
- Flexible working requests from your first day of employment, statutory right, supported in full
- Company-provided laptop and peripherals, set up before you start
- Workplace pension, auto-enrolled
We're a small, senior-heavy team and we hire graduates who want to operate at the level of someone two or three years ahead of where they are today. You'll be trusted early, expected to ask good questions, and supported by engineers who've done this before. The best fit is someone who's genuinely curious about how production systems break, who reads documentation properly, and who communicates proactively when something's unclear.
You don't need to know everything. You do need to be the kind of engineer who figures things out and who cares enough about reliability to want to prevent the same problem twice. We operate to ISO 27001, which means documentation, change control, and process discipline are part of the job. That's not bureaucracy. It's how you build things that stay working.
Four principles that show up in how we operate real infrastructure for real customers.
The engineers who progress fastest here ask "why" about every system they touch. Why is this alert configured this way? Why is this runbook written like this? Curiosity about the infrastructure you're operating is how you grow from observer to owner.
When you're assigned a task or an investigation, you see it through. You don't get stuck and go quiet, you ask, escalate, and update. We don't hand things off and hope. We take problems to resolution and prevent them next time.
We run multiple customer environments simultaneously. Everything you build or document has to be operable by anyone on the team, consistent, clear, maintainable. Build for the engineer picking it up without context.
Customers trust us with what's mission-critical. Every stable environment, every clean runbook, every resolved issue is how we earn and keep that trust. Consistency is the only currency that matters here.
Join the pipeline
We're not actively hiring right now, but we keep applications on file. Tell us about something you've built or operated, what broke, what you learned, and what you'd do differently. Personal projects count. No templates.