Skip to content
24×7 Cloud Managed Service, Powered by Datadog

Critical Support
24×7 incident management + improvement engineering.

Critical Support is our flagship managed service for tech-led SMBs on AWS and Azure. We own incident response 24×7 and deliver monthly improvement engineering across six pillars, so your platform becomes more reliable, secure, and cost-controlled over time, not just maintained.

15 min
SEV-1 & SEV-2 response time
24×7×365
Always-on coverage
AWS + Azure
Clouds supported
Powered by Datadog
World's first accredited MSP
Improvement pillars Plans Getting started
Two services in one
Incident Management
24×7 detection → triage → response → escalation → recovery & review. 15-min response time for SEV-1 & SEV-2.
Improvement Engineering
16–56 hrs/month across reliability, security, cost, performance, automation & governance. Monthly reporting.
Three plans: Core · Standard · Advanced, each priced on scope. Talk to us for pricing.
Improvement Engineering

Six improvement pillars, delivered every month

Critical Support isn't just incident cover. Every month our engineers work through an agreed improvement backlog across six pillars, so the platform gets better, not just maintained.

01

Reliability & Resilience

Failover design, redundancy improvements, early issue detection, and SLO/SLA management to reduce the frequency and impact of incidents.

02

Security & Compliance

Access control reviews, vulnerability management, threat detection operationalisation, and alignment to ISO 27001 and Cyber Essentials Plus.

03

Cost Optimisation & FinOps

Rightsizing, waste elimination, reserved instance and savings plan recommendations, and cost attribution to give teams financial ownership.

04

Performance & Scalability

Latency diagnosis, scaling policy improvements, database query optimisation, and capacity planning ahead of growth or traffic events.

05

Automation & Efficiency

Runbooks as code, IaC improvements, auto-remediation, and reducing the manual operational burden so engineers focus on what matters.

06

Governance & Observability

Tagging standards, Datadog dashboard quality, alerting hygiene, reporting cadence, and governance guardrails that scale with your platform.

Incident Management

Five-stage incident lifecycle

Every incident follows the same structured process, from first signal to blameless postmortem.

Stage 01

Monitoring

Datadog telemetry, alerting, and noise reduction keep signal quality high. Synthetic monitors and anomaly detection catch issues before customers report them.

Stage 02

Triage

Severity classification (SEV-1–4), blast-radius assessment, and ownership assignment. Bits AI SRE assists our engineers, humans confirm before acting.

Stage 03

Response

Runbooks, safe workarounds, and rollback procedures executed by on-call engineers. Customer notified within the contracted response window.

Stage 04

Escalation

On-call routing, cloud-provider escalation, vendor coordination, and customer communication, all tracked in Datadog Incident Management.

Stage 05

Recovery & Review

Fix or rollback with validation, blameless RCA, and improvement actions fed back into the monthly engineering backlog. 60-minute recovery is a target for SEV-1.

Plans

Three plans, Core, Standard, Advanced

All plans include 24×7 SEV-1 and SEV-2 incident management with a 15-minute response time and a 60-minute recovery target. Plans differ by platform complexity and monthly improvement engineering hours. Talk to us for pricing.

Feature Core Standard Advanced
Coverage24×7 SEV-1 & SEV-224×7 SEV-1 & SEV-224×7 SEV-1 & SEV-2
Response time15 min15 min15 min
Recovery target (SEV-1)60 min target60 min target60 min target
Improvement hours/month16 hrs32 hrs56 hrs
Cloud scopeSingle cloud, 1 landing zone (hub + 1–2 spokes)Single cloud, multiple landing zones / accountsAWS and/or Azure, 5+ landing zones / hybrid
Improvement pillars coveredReliability, security & costAll six pillarsAll six pillars, cross-cloud
Governance cadenceMonthly reportingFortnightly reportingWeekly review + quarterly strategy
Runbooks & RCACore runbooks, standard reviewsAdvanced playbooks, full RCA + automationCustom cross-cloud workflows, postmortems

Recovery time is a target, not a contractual guarantee. Response time (15 min for SEV-1 & SEV-2) is the contractual commitment.

SLAs & response

What we commit to, precisely

Response time is a firm contractual commitment. Recovery time is a target, because recovery depends on the nature of the incident, not just our speed.

  • 15-minute response time for SEV-1 and SEV-2 incidents; this is the contractual commitment. Time to first engineer contact, 24×7×365.
  • 60-minute recovery target for SEV-1; this is a target. We work as fast as technically possible; complex incidents take longer by nature.
  • SEV-1: complete outage or material risk to business. SEV-2: significant degradation or partial outage. SEV-3/4: limited impact, handled in-hours.
  • Blameless postmortem for all SEV-1 incidents. Findings feed the improvement backlog.
  • Incidents covered: service outages, performance degradation, security alerts (operational triage and containment, not SOC/MDR/forensics), integration/API failures, cloud provider incidents affecting your environment.
  • Customer retains: IAM and admin control, access approvals, and all business and release decisions. Material changes need customer sign-off.
  • AI is advisory: Bits AI SRE and Watchdog assist diagnosis, humans approve all production, security, and cost changes.
What a good cloud partner looks like

Five principles we hold ourselves to

Transparency

You keep access to your Datadog environment, your data, and your dashboards at all times. Nothing is hidden in a proprietary layer.

Ownership

When an incident fires, we own it to resolution, not to the first opportunity to hand it back. Accountability is the baseline.

Collaboration

Shared backlog, shared visibility. You see what we're working on and why. Service reviews are conversations, not status reports.

Integration

Improvement work is tied to reliability, security, and cost outcomes, not abstract platform activity. Everything maps to a business metric.

Enablement

Runbooks, standards, and knowledge stay in your environment after every engagement. You should be less dependent on us over time, not more.

Shared responsibility

You own your apps and decisions. We operate and improve. Hyperscalers provide the platform.

AreaCustomerCritical CloudCloud provider
Application code & dataOwns and controlsSupports, does not access dataN/A
Infrastructure provisioning (Terraform/IaC)Approves changesImplements and improvesN/A
Monitoring & observability (Datadog)Has full access alwaysBuilds, manages, optimisesN/A
Security, compliance & access controlOwns decisions & approvalsOperates controls, improves posturePlatform primitives
Incident managementInformed, approves resolutionDetects, triages, responds, recoversProvider incident support
Global infrastructure & physical securityN/AN/AOwns and guarantees
Getting started

Three pathways into Critical Support

Whichever path you take, the outcome is the same: 24×7 reliability, observability, and continuous improvement from day one.

Build & Operate

New platform or product

Design and provision using Terraform and best-practice landing zones, implement Datadog monitoring foundations, then transition directly into 24×7 Critical Support at go-live.

Migrate & Operate

Move from on-prem or another cloud

Plan and execute migration with minimal disruption, align to landing zone standards and Datadog instrumentation, then activate 24×7 incident management and improvement engineering immediately post-migration.

MSP Transfer

Take over from an existing provider

Review configuration, access, and governance for full transparency. Establish runbooks and Datadog observability baselines. Move onto the Critical Support model with a service review in week one.

How we're different

Critical Support vs hyperscaler support plans

AWS Business/Enterprise and Azure Unified/Developer support answer questions. Critical Support owns the environment.

CapabilityHyperscaler supportCritical Support
Incident responseAdvisory guidance; you actionWe own response and recovery
Proactive engineeringNot included16–56 hrs/month across six pillars
Observability platformNative CloudWatch / Azure Monitor onlyDatadog across the full stack (infra, APM, logs, security, cost)
Who does the workYou, with vendor adviceOur SRE team, with your oversight
Runbooks & automationYou build and maintainWe build, own, and improve
Blameless postmortemsNot standardIncluded for all SEV-1 incidents
Cloud scopeSingle providerAWS and Azure in one service
Powered by Datadog

Datadog is the operational backbone, not a monitoring add-on

Every Critical Support customer has direct access to their own Datadog environment, infrastructure, APM, logs, traces, security signals, cloud cost, and LLM monitoring, all configured to their AWS and/or Azure architecture. You keep full visibility. We operate it.

Traditional MSPs rely on proprietary monitoring that limits customer insight. We use Datadog, the same platform our engineers use, in your account, visible to your team at all times.

Critical Cloud is the world's first Powered by Datadog accredited MSP →

Delivered work

Case studies

OPX, Azure + Critical Support

Full-stack observability via Datadog across OPX's Azure environment, combined with monthly improvement cycles. Incident noise reduced by more than 60%, with faster root-cause analysis through unified dashboards and alert tuning.

Read case study →

FAW / Hopp Studio, AWS + Critical Support

24×7 incident response plus proactive improvement for coaching systems and public websites. Tighter Datadog monitoring, quicker recovery, and improved resilience during high-traffic events.

More case studies →
Service family

Need more flexibility? The full cloud service family.

Critical Support is the flagship. If you need lighter cover or incident-response-only, we have options.

Incident response only

Critical Response

Detection, response, and recovery, no proactive engineering. Plans: Daytime, Evenings & Weekends, 24×7. For teams that want cover without the full managed service commitment.

Critical Response →
Start-ups & single-cloud

Critical Support Lite

Right-sized incident cover plus a smaller improvement engineering allocation. Plans: Monitor + Fix, Engineer Assist, Partner Plus. Designed to grow into Critical Support.

Critical Support Lite →
Platform-specific

Critical Support by platform

The same 24×7 service written through an AWS or Azure lens, with platform-native tooling, architecture patterns, and SEO context for each hyperscaler.

AWS → Azure →

FAQ

What is Critical Support?

Critical Support is Critical Cloud's flagship managed service: 24×7 incident management combined with monthly improvement engineering across six pillars (reliability, security, cost, performance, automation, governance) for AWS and Azure environments, with Datadog as the operational foundation.

What clouds do you support?

AWS and Azure. We do not currently support GCP.

What is your response time commitment?

15 minutes for SEV-1 and SEV-2 incidents; this is the contractual response time. The 60-minute recovery figure for SEV-1 is a target; recovery time depends on the nature of the incident.

What is the difference between Critical Support and Critical Support Lite?

Critical Support is the full flagship: 24×7 coverage, 15-minute SEV-1/SEV-2 response, and 16–56 hours of improvement engineering per month. Critical Support Lite is designed for start-ups and smaller single-cloud environments; lighter coverage windows and fewer improvement hours, but the same SRE-driven, Datadog-native model. Lite customers can step up to Critical Support as they grow.

Do you use AI in your operations?

Yes. Datadog's Bits AI SRE and Watchdog assist our engineers in triaging alerts and surfacing likely root causes. AI is advisory: humans approve all production, security, and cost changes.

Can I keep access to my Datadog environment?

Always. Every Critical Support customer retains direct, full-fidelity access to their own Datadog environment. Nothing is hidden in a proprietary layer. This is one of the five partner principles we hold ourselves to.

Ready to move on from reactive firefighting?

Tell us about your platform. We'll recommend the right plan and show you how Critical Support would work for your environment.

Critical Support Lite Talk to us