Graduate Engineer
AI Tooling &
Site Reliability

~50% AI Tooling + ~50% Site Reliability
Salary
£25–30k
Location
UK Remote
Level
Graduate
Hiring Now Graduate Programme UK Remote / Cardiff Full-Time
About the Role
This isn't a rotation programme. From week one, you'll contribute to both tracks: shipping AI tooling that helps us run cloud operations better, and operating real production infrastructure for real customers. Two disciplines, one engineer, no siloes.

Critical Cloud is the world's first "Powered by Datadog" certified partner, a Datadog-native cloud MSP built for European tech-led SMBs. We're building an internal AI platform (the Critical Cloud Platform) to automate and augment how we operate customer environments. This role sits at the centre of that programme.

Half your time will be engineering AI-assisted tooling: LLM integrations, agents, and automation workflows that reduce toil and improve our operational quality. The other half will be hands-on SRE work: monitoring, incident support, infrastructure-as-code, and customer-facing operations. Each half makes you better at the other.

What You'll Do
AI Tooling Track
  • Build and iterate on AI-assisted automation workflows using LLM APIs (Claude, OpenAI) integrated with cloud and observability tooling
  • Develop tooling for automated infrastructure discovery, customer onboarding, and operational runbook generation
  • Contribute to the Critical Cloud Platform: our internal AI governance framework and agent operating model
  • Design and implement MCP (Model Context Protocol) integrations connecting AI agents to Datadog, AWS, and Azure APIs
  • Write evaluation harnesses and regression tests to keep AI tool output reliable and auditable
  • Document AI system behaviour against our constitutional operating framework and ISO 27001 controls
Site Reliability Track
  • Monitor and triage alerts across customer AWS and Azure environments using Datadog as the primary observability platform
  • Participate in on-call rotations and incident response, contributing to postmortems and remediation work
  • Support Datadog onboarding for new customers: instrumentation, dashboards, monitors, and SLO configuration
  • Write and maintain Terraform modules for infrastructure provisioning and change management
  • Produce and maintain operational runbooks, escalation guides, and change records to ISO 27001 standards
  • Contribute SRE context back into AI tooling: you'll know what's worth automating because you've done it manually
Requirements
Must Have
  • A degree in Computer Science, Software Engineering, or a related technical field (2:1 or above)
  • Solid Python: comfortable writing scripts, working with APIs, and handling structured data
  • Familiarity with cloud fundamentals (AWS or Azure), ideally through coursework, personal projects, or placement
  • Experience consuming REST APIs or LLM APIs, whether through a project, dissertation, or side work
  • Linux command-line confidence: networking basics, process management, file systems
  • Clear written communication: you'll be writing docs and talking to customers
Nice to Have
  • Hands-on LLM work: prompt engineering, tool use, agent frameworks, or evaluation pipelines
  • Terraform or any IaC tooling (even tutorials count)
  • Datadog experience, even a free tier account you've played with
  • Kubernetes or containerised workload exposure
  • Any cloud or AI certification (AWS, Azure, Google, or Datadog)
  • A GitHub profile with something worth showing us
Tech Stack
AI & Automation
AI Claude / Anthropic API Primary LLM platform
AI MCP (Model Context Protocol) Agent–tool integration
AI Python Tooling & automation
Observability & Cloud
Datadog Core observability platform
AWS Primary cloud, multi-account
Azure Secondary cloud workloads
Terraform Infrastructure as code
GitHub Actions CI/CD pipelines
Kubernetes Container orchestration
PagerDuty Incident management
Career Path

We're a small team. Progression is real and fast, not managed by a committee.

Start
Graduate Engineer
AI & SRE
Year 1–2
Engineer I
AI Platform / SRE
Year 2–3
Engineer II
Specialise or Broaden
Year 3+
Senior / Lead
Platform or SRE
Compensation & Benefits
£25–30k
Base salary DOE
Remote-first
UK-based, async-friendly
Certs funded
AWS, Datadog, AI tooling
Who Thrives Here

The ideal candidate doesn't have to choose between writing code and running infrastructure. They're curious about both and understand that the two inform each other. You'll build AI tooling that automates real operational problems precisely because you've experienced those problems hands-on in the SRE track.

We operate to ISO 27001. Everything we build, including AI systems, has to be explainable, auditable, and consistent with our governance framework. If you care about building AI tools that are reliable, not just impressive demos, you'll fit right in.

This is an early career role, but we don't run it like one. You'll have genuine ownership, direct access to founders, and the chance to shape a platform that will define how Critical Cloud operates at scale.

Sound like you?

Send a cover letter and your CV. The cover letter matters most: tell us what draws you to both AI tooling and reliability engineering, and share what you've built, whether a project, a repo, a dissertation, or anything real.

Hiring now Cover letter required Direct to founders
careers@criticalcloud.ai →