The five-stage incident lifecycle
Good incident response is not a single action. It is a structured sequence of stages, each with a clear owner, a defined output, and a handoff into the next. Understanding the stages helps you evaluate whether a provider's process is real or just a sales claim.
-
Stage 01
Monitoring
Continuous telemetry from the environment - metrics, logs, traces, synthetic checks - processed through alerting rules and anomaly detection. The goal is to catch issues before customers report them. This stage is not passive: alert quality requires ongoing tuning. Noisy alerts lead to alert fatigue, which leads to missed incidents. Good monitoring means the right alerts fire for the right reasons, with enough context for the on-call engineer to act without manual investigation.
-
Stage 02
Triage
When an alert fires, the first job is classification. What is the severity? What is the blast radius? Who owns this? The on-call engineer (assisted by Datadog's Bits AI SRE in our case) assesses the alert, assigns a SEV level, and determines whether to escalate or handle within the standard response workflow. This stage happens fast - for a SEV-1, triage should complete in minutes, not tens of minutes.
-
Stage 03
Response
The on-call engineer executes the response: runbook steps, safe workarounds, rollback procedures where appropriate. The customer is notified with a clear status update. Every action taken is documented in real time. The response stage ends when the immediate impact is mitigated or contained - not necessarily when the root cause is fully understood or the permanent fix is in place.
-
Stage 04
Escalation
Complex incidents require escalation: additional on-call engineers, cloud provider support channels, third-party vendors, or customer stakeholders. The escalation path should be documented in advance, not improvised in the moment. For SEV-1 incidents, the escalation path should include named individuals and defined communication cadences - not just "we will call you."
-
Stage 05
Recovery and review
Once service is restored and validated, the incident enters review. A blameless postmortem identifies contributing factors, not culprits. The output is a set of improvement actions - runbook updates, monitoring changes, architectural recommendations - that feed back into the improvement backlog. Incidents without postmortems are incidents that will recur.
The SEV model: why severity classification matters
Not all incidents are equal, and treating them as if they are leads to either overreaction (everything is an emergency, on-call burns out) or underreaction (genuine outages are handled with insufficient urgency). A severity model is the mechanism that calibrates response proportionally to impact.
The model we use:
- SEV-1: Critical. Complete outage, data loss risk, or material business impact. Immediate response, full escalation, continuous customer updates until resolved.
- SEV-2: High. Significant degradation or partial outage affecting a meaningful number of users. Active response within contracted window, regular customer updates.
- SEV-3: Moderate. Limited impact, workaround available, affecting a subset of users or non-critical functionality. Handled in business hours.
- SEV-4: Low. Informational, minor issues, configuration questions. Addressed in normal service review cycles.
The SEV model matters in two specific ways when evaluating a provider. First, which severity levels do they cover out-of-hours? A provider that only covers SEV-1 out-of-hours is making a different commitment than one covering SEV-1 and SEV-2. For most production environments, SEV-2 incidents can cause real customer impact and waiting until business hours to address them is not acceptable. Second, what are the response time commitments per severity? These should be specific and contractual, not approximate.
Response time vs recovery time: the distinction that matters most
This is the area where provider claims are most frequently misleading.
Response time is the time from alert to first engineer contact - the time to acknowledge and begin investigation. This is a contractual commitment that a provider can reliably make, because it is entirely within their control. A 15-minute response time means an engineer has acknowledged the alert and begun triage within 15 minutes.
Recovery time is the time from incident detection to full service restoration. This is not reliably within the provider's control - it depends on the nature of the incident, the underlying infrastructure, whether the cloud provider itself is experiencing an issue, whether a rollback is required, and dozens of other variables. Any provider claiming a guaranteed recovery time for a generic managed service should be questioned carefully.
The right framing is: response time is a commitment; recovery time is a target. A serious provider will express a recovery target (for example, 60 minutes for a SEV-1 on a full 24/7 plan) and will work to meet it, while being honest that complex incidents can take longer. A provider claiming a guaranteed recovery time for all SEV-1 incidents is either operating a very narrow scope or not being accurate about what they can deliver.
What to demand from a provider
These are the specific commitments and behaviours that separate structured incident response from informal support. Ask for each of these explicitly, and ask for the contractual document rather than the sales deck.
Contractual response times by severity, 24/7
Not "around 15 minutes" or "typically within the hour." A contractual commitment, expressed as a number, for SEV-1 and SEV-2, for all coverage hours. If the contract says business hours only, that is not 24/7 incident response regardless of what the marketing says.
A named severity model with clear definitions
Ask the provider to define their SEV levels in writing. If they cannot, they do not have a consistent severity model, which means the response to any given incident depends on whoever is on call that night rather than a defined process.
Blameless postmortems for SEV-1 incidents
A written postmortem - timeline, contributing factors, actions taken, improvement actions - delivered after every SEV-1. This is a minimum. Without it, you have no audit trail for regulatory purposes and no mechanism to improve the environment based on what incidents reveal.
Your access to the monitoring environment
You should be able to see the same Datadog dashboards and alert history that your provider is looking at. Not a portal version, not a summarised view - the same environment. If a provider is not willing to give you direct access to the observability layer, ask why. The answer will be instructive.
Runbooks: the test of operational maturity
Ask to see a sample runbook. Not a list of runbook titles - an actual runbook. How specific is it? Does it have clear decision points, specific commands, and expected outcomes? A vague runbook ("investigate the issue and escalate if needed") is effectively no runbook. A real runbook reflects accumulated operational knowledge about your specific environment.
The relationship between incident response and improvement engineering
Incident response is the reactive half of a good managed service. The proactive half is improvement engineering: the monthly work that makes incidents less frequent and less severe over time.
Without improvement engineering, a managed service is a maintenance arrangement - you are paying someone to respond to the same problems indefinitely. With it, each incident feeds back into a backlog of improvements: alert tuning, architecture changes, runbook updates, cost optimisations. The environment gets better month by month, and the incident rate over time reflects that.
For businesses evaluating whether they need incident-response-only coverage or a fuller managed service, the question is how much operational debt the environment carries. A new, well-architected environment with minimal technical debt may do fine with incident-response-only coverage in the short term. An older or more complex environment typically needs the improvement engineering cadence to avoid the reactive work spiralling over time.
The test: ask the provider how many SEV-1 incidents the same environment had in months one, six, and twelve of their engagement. If the number is not declining, the improvement engineering is not working or is not happening. A well-operated managed service should produce a measurable reduction in incident frequency over time.