New platform or product
Design and provision using Terraform and best-practice landing zones, implement Datadog monitoring foundations, then transition directly into 24×7 Critical Support at go-live.
Critical Support is our flagship managed service for tech-led SMBs on AWS and Azure. We own incident response 24×7 and deliver monthly improvement engineering across six pillars, so your platform becomes more reliable, secure, and cost-controlled over time, not just maintained.
Critical Support isn't just incident cover. Every month our engineers work through an agreed improvement backlog across six pillars, so the platform gets better, not just maintained.
Reliability & Resilience
Failover design, redundancy improvements, early issue detection, and SLO/SLA management to reduce the frequency and impact of incidents.
Security & Compliance
Access control reviews, vulnerability management, threat detection operationalisation, and alignment to ISO 27001 and Cyber Essentials Plus.
Cost Optimisation & FinOps
Rightsizing, waste elimination, reserved instance and savings plan recommendations, and cost attribution to give teams financial ownership.
Performance & Scalability
Latency diagnosis, scaling policy improvements, database query optimisation, and capacity planning ahead of growth or traffic events.
Automation & Efficiency
Runbooks as code, IaC improvements, auto-remediation, and reducing the manual operational burden so engineers focus on what matters.
Governance & Observability
Tagging standards, Datadog dashboard quality, alerting hygiene, reporting cadence, and governance guardrails that scale with your platform.
Every incident follows the same structured process, from first signal to blameless postmortem.
Monitoring
Datadog telemetry, alerting, and noise reduction keep signal quality high. Synthetic monitors and anomaly detection catch issues before customers report them.
Triage
Severity classification (SEV-1–4), blast-radius assessment, and ownership assignment. Bits AI SRE assists our engineers, humans confirm before acting.
Response
Runbooks, safe workarounds, and rollback procedures executed by on-call engineers. Customer notified within the contracted response window.
Escalation
On-call routing, cloud-provider escalation, vendor coordination, and customer communication, all tracked in Datadog Incident Management.
Recovery & Review
Fix or rollback with validation, blameless RCA, and improvement actions fed back into the monthly engineering backlog. 60-minute recovery is a target for SEV-1.
All plans include 24×7 SEV-1 and SEV-2 incident management with a 15-minute response time and a 60-minute recovery target. Plans differ by platform complexity and monthly improvement engineering hours. Talk to us for pricing.
| Feature | Core | Standard | Advanced |
|---|---|---|---|
| Coverage | 24×7 SEV-1 & SEV-2 | 24×7 SEV-1 & SEV-2 | 24×7 SEV-1 & SEV-2 |
| Response time | 15 min | 15 min | 15 min |
| Recovery target (SEV-1) | 60 min target | 60 min target | 60 min target |
| Improvement hours/month | 16 hrs | 32 hrs | 56 hrs |
| Cloud scope | Single cloud, 1 landing zone (hub + 1–2 spokes) | Single cloud, multiple landing zones / accounts | AWS and/or Azure, 5+ landing zones / hybrid |
| Improvement pillars covered | Reliability, security & cost | All six pillars | All six pillars, cross-cloud |
| Governance cadence | Monthly reporting | Fortnightly reporting | Weekly review + quarterly strategy |
| Runbooks & RCA | Core runbooks, standard reviews | Advanced playbooks, full RCA + automation | Custom cross-cloud workflows, postmortems |
Recovery time is a target, not a contractual guarantee. Response time (15 min for SEV-1 & SEV-2) is the contractual commitment.
Response time is a firm contractual commitment. Recovery time is a target, because recovery depends on the nature of the incident, not just our speed.
Transparency
You keep access to your Datadog environment, your data, and your dashboards at all times. Nothing is hidden in a proprietary layer.
Ownership
When an incident fires, we own it to resolution, not to the first opportunity to hand it back. Accountability is the baseline.
Collaboration
Shared backlog, shared visibility. You see what we're working on and why. Service reviews are conversations, not status reports.
Integration
Improvement work is tied to reliability, security, and cost outcomes, not abstract platform activity. Everything maps to a business metric.
Enablement
Runbooks, standards, and knowledge stay in your environment after every engagement. You should be less dependent on us over time, not more.
| Area | Customer | Critical Cloud | Cloud provider |
|---|---|---|---|
| Application code & data | Owns and controls | Supports, does not access data | N/A |
| Infrastructure provisioning (Terraform/IaC) | Approves changes | Implements and improves | N/A |
| Monitoring & observability (Datadog) | Has full access always | Builds, manages, optimises | N/A |
| Security, compliance & access control | Owns decisions & approvals | Operates controls, improves posture | Platform primitives |
| Incident management | Informed, approves resolution | Detects, triages, responds, recovers | Provider incident support |
| Global infrastructure & physical security | N/A | N/A | Owns and guarantees |
Whichever path you take, the outcome is the same: 24×7 reliability, observability, and continuous improvement from day one.
Design and provision using Terraform and best-practice landing zones, implement Datadog monitoring foundations, then transition directly into 24×7 Critical Support at go-live.
Plan and execute migration with minimal disruption, align to landing zone standards and Datadog instrumentation, then activate 24×7 incident management and improvement engineering immediately post-migration.
Review configuration, access, and governance for full transparency. Establish runbooks and Datadog observability baselines. Move onto the Critical Support model with a service review in week one.
AWS Business/Enterprise and Azure Unified/Developer support answer questions. Critical Support owns the environment.
| Capability | Hyperscaler support | Critical Support |
|---|---|---|
| Incident response | Advisory guidance; you action | We own response and recovery |
| Proactive engineering | Not included | 16–56 hrs/month across six pillars |
| Observability platform | Native CloudWatch / Azure Monitor only | Datadog across the full stack (infra, APM, logs, security, cost) |
| Who does the work | You, with vendor advice | Our SRE team, with your oversight |
| Runbooks & automation | You build and maintain | We build, own, and improve |
| Blameless postmortems | Not standard | Included for all SEV-1 incidents |
| Cloud scope | Single provider | AWS and Azure in one service |
Every Critical Support customer has direct access to their own Datadog environment, infrastructure, APM, logs, traces, security signals, cloud cost, and LLM monitoring, all configured to their AWS and/or Azure architecture. You keep full visibility. We operate it.
Traditional MSPs rely on proprietary monitoring that limits customer insight. We use Datadog, the same platform our engineers use, in your account, visible to your team at all times.
Critical Cloud is the world's first Powered by Datadog accredited MSP →
OPX, Azure + Critical Support
Full-stack observability via Datadog across OPX's Azure environment, combined with monthly improvement cycles. Incident noise reduced by more than 60%, with faster root-cause analysis through unified dashboards and alert tuning.
Read case study →FAW / Hopp Studio, AWS + Critical Support
24×7 incident response plus proactive improvement for coaching systems and public websites. Tighter Datadog monitoring, quicker recovery, and improved resilience during high-traffic events.
More case studies →Critical Support is the flagship. If you need lighter cover or incident-response-only, we have options.
Critical Response
Detection, response, and recovery, no proactive engineering. Plans: Daytime, Evenings & Weekends, 24×7. For teams that want cover without the full managed service commitment.
Critical Response →Critical Support Lite
Right-sized incident cover plus a smaller improvement engineering allocation. Plans: Monitor + Fix, Engineer Assist, Partner Plus. Designed to grow into Critical Support.
Critical Support Lite →Critical Support is Critical Cloud's flagship managed service: 24×7 incident management combined with monthly improvement engineering across six pillars (reliability, security, cost, performance, automation, governance) for AWS and Azure environments, with Datadog as the operational foundation.
AWS and Azure. We do not currently support GCP.
15 minutes for SEV-1 and SEV-2 incidents; this is the contractual response time. The 60-minute recovery figure for SEV-1 is a target; recovery time depends on the nature of the incident.
Critical Support is the full flagship: 24×7 coverage, 15-minute SEV-1/SEV-2 response, and 16–56 hours of improvement engineering per month. Critical Support Lite is designed for start-ups and smaller single-cloud environments; lighter coverage windows and fewer improvement hours, but the same SRE-driven, Datadog-native model. Lite customers can step up to Critical Support as they grow.
Yes. Datadog's Bits AI SRE and Watchdog assist our engineers in triaging alerts and surfacing likely root causes. AI is advisory: humans approve all production, security, and cost changes.
Always. Every Critical Support customer retains direct, full-fidelity access to their own Datadog environment. Nothing is hidden in a proprietary layer. This is one of the five partner principles we hold ourselves to.
Tell us about your platform. We'll recommend the right plan and show you how Critical Support would work for your environment.