Critical Cloud · Careers

Graduate
Site Reliability
Engineer

Pipeline open Graduate Programme UK Remote / Cardiff Full-Time

Salary

£25–30k

Location

UK Remote

Stack

Datadog-native

Why Now

We're growing the customer base and the SRE team is expanding. This is a real engineering role from day one, not a graduate scheme with a two-year rotation. You'll be working alongside senior engineers on real production environments, learning Datadog, AWS, and Azure at depth, and building the kind of hands-on operational experience that most graduates don't get until year three somewhere else. SREs join a shared on-call rota, typically one week in five or six, reducing as the team grows. On-call weeks are paid at £500, adding roughly £5–6k a year on top of salary.

About Us

We are the world's first "Powered by Datadog" accredited MSP, a Datadog-native cloud managed service provider built for European tech-led SMBs. Our founders have scaled and exited multiple technology businesses. We operate lean, move fast, and take observability seriously.

Critical Cloud delivers cloud operations across AWS and Azure. Our SRE team runs production environments for a portfolio of tech-led customers, monitoring, incident response, infrastructure automation, and continuous improvement. Everyone on the team touches real infrastructure for real customers.

The Role

This is the entry point into the SRE team. You'll work directly alongside senior engineers, learning how we operate production environments, instrument systems with Datadog, and respond to incidents. From the start you'll contribute to real work, monitoring customer environments, writing runbooks, supporting infrastructure changes, with progressively more ownership as your confidence and knowledge grows.

We're looking for a graduate with the core fundamentals, a genuine curiosity about how systems work under pressure, and the communication skills to operate in a customer-facing environment. You don't need production experience. You need to be the kind of engineer who asks why, reads the docs all the way through, and takes reliability seriously.

What You'll Do

Monitor customer AWS and Azure environments using Datadog, learning to triage alerts, identify signal from noise, and escalate with context
Support incident response workflows alongside senior engineers, contributing to postmortem documentation and remediation tracking
Assist with Datadog onboarding and instrumentation for new customers: agents, integrations, dashboards, monitors, and log pipelines
Support infrastructure-as-code work (Terraform) for provisioning and configuration changes across customer accounts, under senior review
Write and maintain runbooks and operational documentation, clear, accurate, and usable by anyone on the team at 3am
Participate in proactive reliability reviews: alert tuning, capacity checks, dependency mapping, with guidance from senior engineers
Contribute to internal tooling and AI-assisted automation initiatives as part of the wider engineering team
Communicate directly with customers on day-to-day operational queries with a professional, calm, and clear style

Tech Stack

Datadog

Core observability platform

AWS

Primary cloud, multi-account

Azure

Secondary cloud workloads

Terraform

Infrastructure as code

Kubernetes

Container orchestration

GitHub Actions

CI/CD pipelines

PagerDuty

Incident management

Python / Bash

Automation & tooling

Career Path

We're a small team. The path from graduate to senior is real and faster than most places.

Start

Graduate Site Reliability Engineer

Year 1–2

Junior Site Reliability Engineer

Year 2–3

Senior SRE or Platform Eng

Year 3+

Staff SRE or Lead Engineer

Requirements

Must Have

A degree in Computer Science, Software Engineering, or a related technical discipline or equivalent demonstrable self-taught fundamentals
Solid Linux fundamentals: CLI navigation, file systems, processes, networking basics
Comfort with scripting in Bash, Python, or similar, you've automated something, even if small
Understanding of core observability concepts: what metrics, logs, and traces are and what they tell you
Awareness of cloud fundamentals, you know what EC2, S3, VPCs, and load balancers do, even without production experience
Clear written and verbal communication, you'll be in customer-facing situations from early on
Right to work in the UK without sponsorship

Nice to Have

Any hands-on Datadog experience, trial, personal project, or university lab
Terraform or any infrastructure-as-code exposure
Docker or Kubernetes, even containerising a personal project counts
A cloud certification (AWS Cloud Practitioner, Azure Fundamentals, or equivalent)
Experience in a customer-facing environment, even outside tech
Any personal projects involving monitoring, automation, or infrastructure

How We Work

Four principles that show up in how we operate real infrastructure for real customers.

Stay Curious

The engineers who progress fastest here ask "why" about every system they touch. Why is this alert configured this way? Why is this runbook written like this? Curiosity about the infrastructure you're operating is how you grow from observer to owner.

Own the Problem

When you're assigned a task or an investigation, you see it through. You don't get stuck and go quiet, you ask, escalate, and update. We don't hand things off and hope. We take problems to resolution and prevent them next time.

Operate at Scale

We run multiple customer environments simultaneously. Everything you build or document has to be operable by anyone on the team, consistent, clear, maintainable. Build for the engineer picking it up without context.

Earn Trust by Delivering

Customers trust us with what's mission-critical. Every stable environment, every clean runbook, every resolved issue is how we earn and keep that trust. Consistency is the only currency that matters here.

Compensation & Benefits

£25–30k

Base salary DOE

Remote-first

UK-based, async-friendly

Certs funded

Datadog, AWS, Azure & AI, contractual

On-call paid

~£5–6k/yr on top of salary

On-call allowance (in addition to base salary): SREs join a shared rota, typically one week in five or six, reducing as the team grows. Paid £500 per on-call week, which works out at roughly £5–6k a year on top of salary, varying with the rota size.

25 days holiday + bank holidays plus a paid day off in your birthday month, taken in the month it falls
Holiday grows with tenure: +1 day per year after your second work anniversary, up to 28 days total
Enhanced maternity pay: 26 weeks at your full basic salary
Enhanced paternity pay: 2 weeks at your full basic salary
Datadog, AWS, and Azure certifications paid by the company, contractual, not discretionary
AI tooling certifications also funded, staying current is part of the role
Flexible working requests from your first day of employment, statutory right, supported in full
Company-provided laptop and peripherals, set up before you start
Workplace pension, auto-enrolled

Who Thrives Here

We're a small, senior-heavy team and we hire graduates who want to operate at the level of someone two or three years ahead of where they are today. You'll be trusted early, expected to ask good questions, and supported by engineers who've done this before. The best fit is someone who's genuinely curious about how production systems break, who reads documentation properly, and who communicates proactively when something's unclear.

You don't need to know everything. You do need to be the kind of engineer who figures things out and who cares enough about reliability to want to prevent the same problem twice. We operate to ISO 27001, which means documentation, change control, and process discipline are part of the job. That's not bureaucracy. It's how you build things that stay working.

Join the pipeline

We're not actively hiring right now, but we keep applications on file. Tell us about something you've built or operated, what broke, what you learned, and what you'd do differently. Personal projects count. No templates.

Pipeline open Cover letter required Direct to founders

careers@criticalcloud.ai →

GraduateSite ReliabilityEngineer

Join the pipeline

Graduate
Site Reliability
Engineer