Critical Cloud · Careers

Junior
Site Reliability
Engineer

Pipeline open UK Remote / Cardiff Full-Time Entry–Mid Level

Salary

£35–40k

Location

UK Remote

Stack

Datadog-native

Why Now

We're growing the customer base, building AI tooling to scale operations, and expanding across Europe. The SRE team is small and senior-heavy, real infrastructure ownership from your first week, not ticket triaging. SREs join a shared on-call rota, typically one week in five or six, reducing as the team grows. On-call weeks are paid at £500, adding roughly £5–6k a year on top of salary.

About Us

We are the world's first "Powered by Datadog" accredited MSP, a Datadog-native cloud managed service provider built for European tech-led SMBs. Our founders have scaled and exited multiple technology businesses. We operate lean, move fast, and take observability seriously.

Critical Cloud delivers cloud operations across AWS and Azure through three commercial motions: Adopt, Optimise, and Manage. We're building the dominant Datadog-native managed service brand in Europe, and we want engineers who want to grow with us.

The Role

This is a ground-floor SRE role inside a fast-moving cloud MSP. You'll work directly with our senior engineers and founders, supporting real production environments for a portfolio of tech-led customers. Expect genuine exposure to Datadog, AWS/Azure, incident response, and infrastructure automation from day one, not ticket triaging.

We're looking for someone early in their career who has the fundamentals, the curiosity to go deep, and the communication skills to work directly with customers. You don't need to know everything. You need to be the kind of engineer who figures things out.

What You'll Do

Monitor, triage, and respond to alerts across customer AWS and Azure environments using Datadog as the primary observability platform
Participate in on-call rotations and support incident management workflows, including contributing to postmortem documentation
Assist with Datadog onboarding and instrumentation for new customers: infrastructure, APM, log management, dashboards, and SLOs
Support infrastructure-as-code work (Terraform) for provisioning, configuration, and change management across customer accounts
Write and maintain runbooks, escalation guides, and operational documentation to ISO 27001 standards
Collaborate with senior engineers on proactive reliability improvements: capacity reviews, alert tuning, dependency mapping
Contribute to the development of Critical Cloud's internal tooling and AI-assisted automation initiatives
Engage directly with customers on day-to-day operational queries with a clear, professional communication style

Tech Stack

Datadog

Core observability platform

AWS

Primary cloud, multi-account

Azure

Secondary cloud workloads

Terraform

Infrastructure as code

Kubernetes

Container orchestration

GitHub Actions

CI/CD pipelines

PagerDuty

Incident management

Python / Bash

Automation & tooling

Career Path

We're a small team. Progression is real and fast, you won't be waiting on a committee to notice you.

Start

Junior Site Reliability Engineer

Year 1–2

Engineer I Site Reliability

Year 2–3

Senior SRE or Platform Eng

Year 3+

Staff SRE or Lead Engineer

Requirements

Must Have

Solid Linux fundamentals: CLI, networking, process management
Working knowledge of at least one cloud platform (AWS or Azure)
Comfort with scripting: Bash, Python, or similar
Understanding of core observability concepts: metrics, logs, traces
Clear written and verbal communication: you'll work with customers
Right to work in the UK without sponsorship

Nice to Have

Hands-on Datadog experience (any tier)
Terraform or other IaC tooling
Kubernetes or containerised workload exposure
Experience in an MSP or multi-customer environment
Familiarity with ISO 27001 or similar compliance frameworks
Any cloud certification (AWS, Azure, or Datadog)

How We Work

Four principles that show up in how we operate real infrastructure for real customers.

Own the Problem

On-call means on-call. When an alert fires on your rotation, you take it through to resolution and the postmortem. We don't hand things off and hope. We solve the problem and prevent it next time.

Stay Curious

The engineers who progress fastest here ask "why" about every system they touch. Why is this alert configured this way? Why is this runbook written like this? Curiosity about the infrastructure you're operating is how you grow from operator to engineer.

Operate at Scale

We run multiple customer environments simultaneously. Everything you build has to be operable by anyone on the team, documented, consistent, maintainable. Build for the on-call engineer picking it up at 3am without context.

Earn Trust by Delivering

Customers trust us with what's mission-critical. Every stable environment, every resolved incident, every clean service review is how we earn that trust again. Consistency is the only currency that matters here.

Compensation & Benefits

£35–40k

Base salary DOE

Remote-first

UK-based, async-friendly

Certs funded

Datadog, AWS, Azure & AI, contractual

On-call paid

~£5–6k/yr on top of salary

On-call allowance (in addition to base salary): SREs join a shared rota, typically one week in five or six, reducing as the team grows. Paid £500 per on-call week, which works out at roughly £5–6k a year on top of salary, varying with the rota size.

25 days holiday + bank holidays plus a paid day off in your birthday month, taken in the month it falls
Holiday grows with tenure: +1 day per year after your second work anniversary, up to 28 days total
Enhanced maternity pay: 26 weeks at your full basic salary
Enhanced paternity pay: 2 weeks at your full basic salary
Datadog, AWS, and Azure certifications paid by the company, you need these certs to do this job, and the company pays for them. Contractual obligation, not a discretionary budget.
AI tooling certifications also funded, we're building AI-augmented operations, so staying current is part of the role
Flexible working requests from your first day of employment, statutory right, supported in full
Company-provided laptop and peripherals, set up before you start
Workplace pension, auto-enrolled

Who Thrives Here

We're a small, senior-heavy team. You won't be managed closely. You'll be trusted and expected to own your work. The best fit is someone who treats production environments with respect, communicates proactively when something's wrong, and genuinely wants to understand the systems they're operating, not just keep the lights on.

We operate to ISO 27001 and take our IMS seriously. That means documentation, change control, and process discipline matter. If that sounds like constraint, this probably isn't the role. If it sounds like craft, read on.

Join the pipeline

We're not actively hiring right now, but we keep applications on file. The cover letter matters most: tell us why Critical Cloud, what draws you to reliability engineering, and what you've built or operated. No templates.

Pipeline open Cover letter required Direct to founders

careers@criticalcloud.ai →

JuniorSite ReliabilityEngineer

Join the pipeline

Junior
Site Reliability
Engineer