Junior
Site Reliability
Engineer
Critical Cloud delivers cloud operations across AWS and Azure through three commercial motions: Adopt, Optimise, and Manage. We're building the dominant Datadog-native managed service brand in Europe, and we want engineers who want to grow with us.
This is a ground-floor SRE role inside a fast-moving cloud MSP. You'll work directly with our senior engineers and founders, supporting real production environments for a portfolio of tech-led customers. Expect genuine exposure to Datadog, AWS/Azure, incident response, and infrastructure automation from day one, not ticket triaging.
We're looking for someone early in their career who has the fundamentals, the curiosity to go deep, and the communication skills to work directly with customers. You don't need to know everything. You need to be the kind of engineer who figures things out.
- Monitor, triage, and respond to alerts across customer AWS and Azure environments using Datadog as the primary observability platform
- Participate in on-call rotations and support incident management workflows, including contributing to postmortem documentation
- Assist with Datadog onboarding and instrumentation for new customers: infrastructure, APM, log management, dashboards, and SLOs
- Support infrastructure-as-code work (Terraform) for provisioning, configuration, and change management across customer accounts
- Write and maintain runbooks, escalation guides, and operational documentation to ISO 27001 standards
- Collaborate with senior engineers on proactive reliability improvements: capacity reviews, alert tuning, dependency mapping
- Contribute to the development of Critical Cloud's internal tooling and AI-assisted automation initiatives
- Engage directly with customers on day-to-day operational queries with a clear, professional communication style
- Solid Linux fundamentals: CLI, networking, process management
- Working knowledge of at least one cloud platform (AWS or Azure)
- Comfort with scripting: Bash, Python, or similar
- Understanding of core observability concepts: metrics, logs, traces
- Clear written and verbal communication: you'll work with customers
- Right to work in the UK without sponsorship
- Hands-on Datadog experience (any tier)
- Terraform or other IaC tooling
- Kubernetes or containerised workload exposure
- Experience in an MSP or multi-customer environment
- Familiarity with ISO 27001 or similar compliance frameworks
- Any cloud certification (AWS, Azure, or Datadog)
We're a small, senior-heavy team. You won't be managed closely. You'll be trusted and expected to own your work. The best fit is someone who treats production environments with respect, communicates proactively when something's wrong, and genuinely wants to understand the systems they're operating, not just keep the lights on.
We operate to ISO 27001 and take our IMS seriously. That means documentation, change control, and process discipline matter. If that sounds like constraint, this probably isn't the role. If it sounds like craft, read on.
Ready to apply?
Send a cover letter and your CV. The cover letter matters most: tell us why Critical Cloud, what draws you to reliability engineering, and what you've built or operated. No templates.