Skip to content
Cloud Incident Response, Powered by Datadog

Critical Response
Rapid detection. Clear escalation. Fast recovery.

Critical Response is incident-response-only cloud cover for AWS and Azure. We detect, triage, respond, and recover, within the coverage window you choose. No proactive engineering. Just reliable, SRE-driven incident management when you need it.

15 min
SEV-1 response time
Daytime · E&W · 24×7
Coverage options
AWS + Azure
Clouds covered
Powered by Datadog
Full observability always
Who it's for
  • Teams that need after-hours or 24×7 cover but already have in-house day-to-day ops capability
  • Businesses that want to supplement their in-house on-call without replacing it
  • Scale-ups that need weekend and overnight cover as they grow but aren't ready for a full managed service
Want proactive improvement too? See Critical Support or Critical Support Lite.
How it works

Five-stage incident lifecycle

Every incident follows the same structured process, from first signal in Datadog to blameless postmortem.

Stage 01

Monitoring

Datadog telemetry, synthetic monitors, and alert rules watch your environment continuously. Bits AI SRE helps surface signal from noise.

Stage 02

Triage

On-call engineer classifies severity (SEV-1–4), assesses blast radius, and confirms ownership. Customer notified immediately for SEV-1.

Stage 03

Response

Runbooks executed, safe workarounds applied, rollback procedures followed as appropriate. All actions documented in real time in Datadog Incident Management.

Stage 04

Escalation

On-call routing, cloud-provider escalation, vendor coordination, and stakeholder communications, managed by our engineers so yours can focus on the fix.

Stage 05

Recovery & Review

Validated recovery, blameless RCA, and a written summary. Recovery time is a target (not a guarantee), SEV-1 60–120 min depending on plan.

Incidents covered

What Critical Response handles

  • Service outages, complete or partial failures affecting end users or dependent services
  • Performance degradation, sustained latency spikes, elevated error rates, or throughput collapse
  • Security alerts, operational triage and containment only (not SOC/MDR/forensics)
  • Integration and API failures, broken upstream or downstream dependencies causing customer impact
  • Cloud provider incidents, AWS/Azure provider events that affect your environment, with response and workaround coordination
Plans

Three plans, Daytime, Evenings & Weekends, 24×7

Choose the coverage window that fills your gap. All plans use the same 5-stage lifecycle and Datadog-native tooling. Response times are contractual. Recovery times are targets. Talk to us for pricing.

Feature Daytime Evenings & Weekends 24×7
Coverage hours09:00–17:00 Mon–Fri17:00–09:00 Mon–Fri + 24×7 weekendsFull 24×7×365
Severity coveredSEV-1SEV-1 & SEV-2SEV-1 & SEV-2
Response time15 min (SEV-1)SEV-1: 15 min · SEV-2: 30 minSEV-1: 15 min · SEV-2: 15 min
Recovery target (SEV-1)120-min target90-min target60-min target
Incident management time/month4 hrs4 hrs8 hrs
Monitored servicesUp to 10Up to 10Up to 20
External endpoints monitored1 @ 5-min interval1 @ 2-min interval5 @ 1-min interval
DashboardsStandardStandardStandard + 1 custom
Out-of-hours calloutsNone2 per month4 per month
Runbooks & reportsStandard + monthly summaryStandard + monthly summaryCustomised + detailed RCA trends

Recovery times are targets, not contractual guarantees. Response times (time to first engineer contact) are the contractual commitment.

Severity model

Four severity levels, SEV-1 to SEV-4

Classification happens at triage. SEV-1 and SEV-2 trigger immediate response within contracted hours.

SEV-1 · Critical

Complete outage or material risk

Total service unavailability, data loss risk, or severe breach of contractual obligations. Immediate response. 15-min response target on all plans.

SEV-2 · High

Significant degradation or partial outage

Major feature failure, severe performance degradation, or partial loss of service affecting a significant number of users. Covered on E&W and 24×7 plans.

SEV-3 · Moderate

Limited impact, workaround available

Non-critical issues with a viable workaround. Handled during business hours. Not covered under Critical Response out-of-hours plans.

SEV-4 · Low

Informational or minor

Minor issues, informational alerts, or configuration questions. Handled in-hours. Not covered under Critical Response out-of-hours plans.

Shared responsibility

Clear ownership during incidents

Critical Cloud owns: detection, classification, response execution, escalation to vendors, stakeholder communication, and recovery validation. All documented in Datadog Incident Management.

  • We notify you at SEV-1 detection and at key recovery milestones.
  • Material changes (infrastructure, configuration) need your approval.

You own: application code, business continuity decisions, customer communications, and access approvals.

  • You retain full IAM and admin control at all times.
  • You have full, real-time access to your Datadog environment throughout any incident.
Want proactive improvement too?

Critical Support, incident management plus monthly engineering

Critical Response covers you when things break. Critical Support also improves things so they break less often. Monthly improvement engineering across six pillars, reliability, security, cost, performance, automation, and governance. If you're spending engineering time on reactive firefighting, Critical Support is built to change that.

Explore Critical Support Critical Support Lite

FAQ

What is Critical Response?

Critical Response is an incident-response-only service for AWS and Azure: detection, triage, response, escalation, and recovery within the coverage window you choose (Daytime, Evenings and Weekends, or 24×7). There is no proactive improvement engineering, for that, see Critical Support or Critical Support Lite.

Does Critical Response include proactive engineering?

No. Critical Response is incident management only. Each plan includes a small allocation of incident management time for runbook maintenance and operational overhead, but no improvement engineering backlog. For proactive improvement, see Critical Support or Critical Support Lite.

What does "recovery target" mean?

Recovery targets (e.g. 60-minute target for SEV-1 on 24×7) are working objectives we aim to meet. They reflect our operational capability and historic performance, but are not contractual guarantees, complex incidents take longer by their nature. Response time (time to first engineer contact) is the contractual commitment.

What happens after an incident?

For all SEV-1 incidents: a blameless postmortem and written RCA summary, shared with you within the agreed timeframe. Findings can be fed into an improvement backlog if you're on Critical Support or Lite.

Need reliable cover for the hours that matter?

Tell us about your AWS or Azure environment and we'll recommend the right coverage window.

Critical Support Talk to us