Careers

Site Reliability Engineer (SRE)

Operate and improve AWS/Azure platforms under Critical Support, using Datadog as the operational foundation for unified observability, incident response, and continuous improvement.

Datadog (essential)AWSAzureIncident responseSLOsTerraform

Apply by email Talk to us Back to roles

Location

Cardiff / London / Dublin (hybrid)

Employment

Full-time

Level

Experienced (SRE / Platform / DevOps)

Team

Cloud Operations (Datadog-powered CMSP)

Role overview

This role is for someone who enjoys operating real production systems and making them better every week — not just keeping the lights on.

You’ll work across AWS and Azure environments for tech-led customers. Datadog is the backbone: metrics, logs, traces, security signals, cloud cost insights, and alerting all live in one place, and both we and the customer operate from that shared view.

You’ll be part of an on-call rotation and you’ll also deliver improvement engineering — automation, guardrails, and tuning — so platforms become more reliable, secure, and cost-controlled over time.

What you’ll do

The day-to-day responsibilities of the role.

Own incident response from detection to resolution: triage, escalation, comms, and follow-up.
Build, tune, and maintain Datadog monitors, dashboards, SLOs, log pipelines, APM traces, and alert routing.
Reduce noise and improve signal quality: tag strategy, service catalog alignment, alert thresholds, and runbook links.
Deliver monthly improvement work: reliability hardening, cost optimisation, security guardrails, and automation.
Write clear runbooks and playbooks; improve operational readiness (testing, game-days, drills).
Collaborate with customer engineers and leadership to align priorities and explain trade-offs clearly.
Contribute to platform standards (IaC patterns, monitoring baselines, incident process).

What you’ll bring

Datadog skills are essential for this role.

Must-have

Hands-on Datadog experience in production (monitors, dashboards, logs/APM, alerting workflows).
Experience operating cloud platforms (AWS and/or Azure) with strong fundamentals in networking, Linux, and IAM.
Comfortable with incident management and post-incident review (blameless RCA, action tracking).
Infrastructure-as-code experience (Terraform preferred) and comfort with scripting/automation.
Clear written communication — you can explain incidents and changes without jargon.

Nice-to-have

Experience operating both AWS and Azure in production.
Kubernetes or container platform experience.
FinOps / cloud cost tooling experience (Datadog Cloud Cost Management, AWS Cost Explorer, Azure Cost Management).
Security tooling experience (SIEM signals, CSPM concepts, Datadog Security Monitoring).
Datadog certifications or proven contributions to observability standards.

What success looks like (first 90 days)

You can confidently navigate our Datadog baselines and understand how we structure services, tags, monitors, and on-call.
You’ve taken ownership of incidents end-to-end and contributed to clear, useful RCA outputs.
You’ve shipped at least one meaningful improvement (automation, alert tuning, reliability hardening, or cost optimisation) with measurable impact.
Customers trust your comms and your judgement during real operational events.

How we hire

A simple process designed to respect your time.

Intro call

Alignment on the role, expectations, and what you’re looking for.

Technical conversation

Real scenarios: Datadog signals, incidents, systems, trade-offs.

Practical exercise

A realistic task. No long take-home marathons.

Meet the team

Working style fit, then we move quickly.

Equal opportunities

Critical Cloud is an equal opportunity employer. We value diverse perspectives and are committed to creating an inclusive environment for everyone.

Apply for this role

Email us your CV and a short note on why this role fits you. We’ll get back to you as soon as we can.

Back to careers Apply by email