Critical Response
Rapid detection. Clear escalation. Fast recovery.
Critical Response is incident-response-only cloud cover for AWS and Azure. We detect, triage, respond, and recover, within the coverage window you choose. No proactive engineering. Just reliable, SRE-driven incident management when you need it.
- Teams that need after-hours or 24×7 cover but already have in-house day-to-day ops capability
- Businesses that want to supplement their in-house on-call without replacing it
- Scale-ups that need weekend and overnight cover as they grow but aren't ready for a full managed service
Five-stage incident lifecycle
Every incident follows the same structured process, from first signal in Datadog to blameless postmortem.
Monitoring
Datadog telemetry, synthetic monitors, and alert rules watch your environment continuously. Bits AI SRE helps surface signal from noise.
Triage
On-call engineer classifies severity (SEV-1–4), assesses blast radius, and confirms ownership. Customer notified immediately for SEV-1.
Response
Runbooks executed, safe workarounds applied, rollback procedures followed as appropriate. All actions documented in real time in Datadog Incident Management.
Escalation
On-call routing, cloud-provider escalation, vendor coordination, and stakeholder communications, managed by our engineers so yours can focus on the fix.
Recovery & Review
Validated recovery, blameless RCA, and a written summary. Recovery time is a target (not a guarantee), SEV-1 60–120 min depending on plan.
What Critical Response handles
- Service outages, complete or partial failures affecting end users or dependent services
- Performance degradation, sustained latency spikes, elevated error rates, or throughput collapse
- Security alerts, operational triage and containment only (not SOC/MDR/forensics)
- Integration and API failures, broken upstream or downstream dependencies causing customer impact
- Cloud provider incidents, AWS/Azure provider events that affect your environment, with response and workaround coordination
Three plans, Daytime, Evenings & Weekends, 24×7
Choose the coverage window that fills your gap. All plans use the same 5-stage lifecycle and Datadog-native tooling. Response times are contractual. Recovery times are targets. Talk to us for pricing.
| Feature | Daytime | Evenings & Weekends | 24×7 |
|---|---|---|---|
| Coverage hours | 09:00–17:00 Mon–Fri | 17:00–09:00 Mon–Fri + 24×7 weekends | Full 24×7×365 |
| Severity covered | SEV-1 | SEV-1 & SEV-2 | SEV-1 & SEV-2 |
| Response time | 15 min (SEV-1) | SEV-1: 15 min · SEV-2: 30 min | SEV-1: 15 min · SEV-2: 15 min |
| Recovery target (SEV-1) | 120-min target | 90-min target | 60-min target |
| Incident management time/month | 4 hrs | 4 hrs | 8 hrs |
| Monitored services | Up to 10 | Up to 10 | Up to 20 |
| External endpoints monitored | 1 @ 5-min interval | 1 @ 2-min interval | 5 @ 1-min interval |
| Dashboards | Standard | Standard | Standard + 1 custom |
| Out-of-hours callouts | None | 2 per month | 4 per month |
| Runbooks & reports | Standard + monthly summary | Standard + monthly summary | Customised + detailed RCA trends |
Recovery times are targets, not contractual guarantees. Response times (time to first engineer contact) are the contractual commitment.
Four severity levels, SEV-1 to SEV-4
Classification happens at triage. SEV-1 and SEV-2 trigger immediate response within contracted hours.
Complete outage or material risk
Total service unavailability, data loss risk, or severe breach of contractual obligations. Immediate response. 15-min response target on all plans.
Significant degradation or partial outage
Major feature failure, severe performance degradation, or partial loss of service affecting a significant number of users. Covered on E&W and 24×7 plans.
Limited impact, workaround available
Non-critical issues with a viable workaround. Handled during business hours. Not covered under Critical Response out-of-hours plans.
Informational or minor
Minor issues, informational alerts, or configuration questions. Handled in-hours. Not covered under Critical Response out-of-hours plans.
Clear ownership during incidents
Critical Cloud owns: detection, classification, response execution, escalation to vendors, stakeholder communication, and recovery validation. All documented in Datadog Incident Management.
- We notify you at SEV-1 detection and at key recovery milestones.
- Material changes (infrastructure, configuration) need your approval.
You own: application code, business continuity decisions, customer communications, and access approvals.
- You retain full IAM and admin control at all times.
- You have full, real-time access to your Datadog environment throughout any incident.
Critical Support, incident management plus monthly engineering
Critical Response covers you when things break. Critical Support also improves things so they break less often. Monthly improvement engineering across six pillars, reliability, security, cost, performance, automation, and governance. If you're spending engineering time on reactive firefighting, Critical Support is built to change that.
FAQ
What is Critical Response?
Critical Response is an incident-response-only service for AWS and Azure: detection, triage, response, escalation, and recovery within the coverage window you choose (Daytime, Evenings and Weekends, or 24×7). There is no proactive improvement engineering, for that, see Critical Support or Critical Support Lite.
Does Critical Response include proactive engineering?
No. Critical Response is incident management only. Each plan includes a small allocation of incident management time for runbook maintenance and operational overhead, but no improvement engineering backlog. For proactive improvement, see Critical Support or Critical Support Lite.
What does "recovery target" mean?
Recovery targets (e.g. 60-minute target for SEV-1 on 24×7) are working objectives we aim to meet. They reflect our operational capability and historic performance, but are not contractual guarantees, complex incidents take longer by their nature. Response time (time to first engineer contact) is the contractual commitment.
What happens after an incident?
For all SEV-1 incidents: a blameless postmortem and written RCA summary, shared with you within the agreed timeframe. Findings can be fed into an improvement backlog if you're on Critical Support or Lite.
Need reliable cover for the hours that matter?
Tell us about your AWS or Azure environment and we'll recommend the right coverage window.