When the Pager Goes Off Who Answers? Building a Safety Net
When your systems fail at 2 AM, who’s ready to respond? For many UK small and medium-sized businesses (SMBs), the lack of a proper on-call system can lead to costly downtime, customer dissatisfaction, and team burnout. The solution isn’t about hiring a massive team - it’s about defining roles, creating clear processes, and using the right tools to ensure incidents are handled efficiently. Here’s what you need to know:
- Small Teams, Big Challenges: SMBs face the same risks as large enterprises but often lack the resources for a 24/7 response team.
- Key Issues: Understaffing, alert fatigue, and unstructured escalation processes are common problems.
- Smart Solutions: Use virtual response teams, fair on-call rotations, and well-documented workflows to manage incidents effectively.
- Essential Tools: Platforms like PagerDuty, Opsgenie, and Zenduty can help automate alerts and streamline escalation paths.
- Legal Compliance: Ensure on-call policies follow UK employment laws, including rest periods and fair pay.
- External Support: Outsourcing incident response can provide 24/7 coverage and specialist expertise without the cost of building an internal team.
The goal is simple: set up a system that’s efficient, fair, and scalable, so your team can respond swiftly when it matters most. Let’s dive into the details.
Reducing the burden of incident response on your teams
Setting Up Clear On-Call Roles and Duties
Sorting out on-call responsibilities can mean the difference between a smooth resolution and a chaotic midnight scramble. For smaller teams, having a structure that's both fair and workable is crucial.
How to Assign and Document On-Call Tasks
Small teams often face unique challenges when it comes to 24/7 coverage. To make it work, tailor your on-call strategy to fit your team's dynamics. Start by understanding your team's preferences and constraints. Some developers thrive at night and are fine handling late alerts, while others are at their best during regular working hours.
Engage your team to document their availability, preferences, and any personal commitments that might affect their capacity to respond. For example, a developer with family obligations may struggle with a 3 AM database issue but could easily handle daytime application problems.
Create a rotation system that's fair and flexible, taking into account UK bank holidays, annual leave, and unexpected absences. For larger teams, a weekly rotation might work well, but smaller teams may benefit from longer shifts to avoid burnout. Allow for shift swaps to handle unforeseen circumstances smoothly.
Keep everything documented and accessible. Use a mobile-friendly shared platform to record team availability, system status, and ongoing issues. Include detailed handover notes with information on recent deployments, ongoing incidents, and scheduled maintenance. This way, the on-call engineer won't be left scrambling through various channels at 2 AM.
If your team operates across different time zones, consider a "follow the sun" model. For instance, a team member in Edinburgh could handle morning alerts, while someone in Lisbon takes over for evening incidents. This approach helps distribute the workload more sustainably.
Finally, ensure escalation paths are clearly defined to prevent any incident from stalling.
Building Your Escalation Chain: Primary, Secondary, and Backup
A well-structured escalation chain ensures incidents are handled efficiently, even if the first responder is unavailable or lacks the expertise needed.
Set up three levels of escalation: a primary responder, a secondary backup, and a final escalation point. The primary responder handles initial triage and common issues. If they’re unavailable or the issue exceeds their expertise, the secondary backup steps in. For critical problems or complex system failures, the final escalation - usually a senior developer or technical lead - takes over.
Tailor escalation paths to the type of incident. A hierarchical approach works well for general issues, escalating from less experienced to more senior team members. For specialised problems, such as database errors, functional escalation directs the issue to the expert in that area, regardless of their seniority. Automatic escalation can also be set up to trigger if response times exceed predefined limits.
"The best SLAs I've seen specify things like 'We will acknowledge an incident within 30 minutes of being notified and provide an update with the severity and estimated recovery time within 30 minutes of acknowledgment. Until the incident is resolved, we will continue to provide updates every 30 minutes for SEV-1 incidents and 60 minutes for SEV-2 and below incidents.'" - beyondholdem
Plan for holidays and absences by creating backup escalation chains. These should activate when key personnel are unavailable, and the details should be documented so everyone knows when and how they apply. For example, a developer on a two-week holiday shouldn’t leave a gap in your incident response plan.
Keep contact details up to date and accessible. Make sure your on-call team has quick access to mobile numbers, personal emails, and other communication methods. A shared contact management system that updates automatically when details change can simplify this process.
UK Employment Law and On-Call Work
If your team operates on-call schedules in the UK, it's essential to understand the legal requirements around working time, pay, and employee rights. Ignoring these can lead to legal trouble and damage team morale.
Under UK law, Working Time Regulations apply to on-call work. If employees are required to stay on-site or be immediately available, this time typically counts as working hours. However, if they just need to be reachable and can respond within a reasonable timeframe, it usually doesn’t count towards the 48-hour weekly limit.
Ensure your on-call system complies with these regulations and provide fair compensation or time off in lieu (TOIL). Many small businesses offer TOIL for handling weekend incidents, while others provide on-call allowances based on the expected workload. Keep in mind minimum wage requirements - if someone spends significant time responding to incidents, their total pay must not fall below the statutory minimum wage.
Rest periods also matter. Employees are entitled to at least 11 consecutive hours of rest within any 24-hour period and one full day of uninterrupted rest each week. Plan your on-call rotations to respect these rules, especially after major incidents that extend working hours.
Document your on-call policies clearly, either in employment contracts or separate agreements. Include details about response times, compensation, and what constitutes reasonable availability. Transparency protects both your team and your business.
"Our communication infrastructure is the backbone of our operations. It ensures we can respond swiftly to challenges and collaborate effectively across all areas of responsibility." - Robbie Warwick, Head of the Voice and Video Products, Home Office
Voluntary on-call arrangements are generally less restrictive than mandatory ones. If team members can opt out of on-call duties without affecting their employment, you have more flexibility in structuring your system. However, make sure you still have enough coverage even if some team members choose not to participate.
Building Escalation Workflows and Communication Rules
Clear escalation workflows and communication rules are the backbone of effective incident management. When something goes wrong - whether it’s 3 PM or 3 AM - every team member should know exactly what to do and who to contact.
Setting Up Clear Escalation Paths
An effective escalation process answers two key questions: when to escalate and who to escalate to. The goal is to define triggers that remove any guesswork during an incident.
Start by categorising incidents based on their severity. For example:
- Severity 1: Full platform outage affecting all customers.
- Severity 2: A feature outage impacting a specific group of users.
- Severity 3: Performance issues noticeable but not critical.
A simple matrix mapping common scenarios to these severity levels can make decision-making faster and more straightforward during emergencies.
Time-sensitive escalation is crucial to avoid bottlenecks. For instance, if the primary responder hasn’t acknowledged an issue within 15 minutes, escalate to the secondary contact. For a Severity 1 incident, escalate to the technical lead or CTO if no updates are provided within 30 minutes. It’s important to set realistic timeframes based on your team’s actual capabilities, not overly optimistic goals. Research shows that in nearly 19% of cases, data exfiltration occurs within the first hour of a breach.
Specialised issues should bypass the usual chain of command and go directly to the right expert. For example, database issues should go to the database specialist, while payment system problems might need the attention of someone familiar with your Stripe integration. This avoids wasting time as the issue bounces between team members.
Define decision-making authority at every level. For example, the primary responder might handle routine fixes like restarting services or tweaking configurations. The secondary contact could manage rollbacks, while the technical lead or CTO should handle critical decisions, such as taking systems offline or communicating with major clients during prolonged outages.
Ownership of key decisions is vital. Who decides to wake the entire team for a weekend incident? Who authorises emergency spending on cloud resources? Who handles customer communication during a data breach? These questions must be answered in advance to prevent confusion during a crisis.
Finally, maintain detailed contact information as part of your communication protocol. These escalation paths naturally tie into structured response procedures, often outlined in runbooks.
Creating Runbooks and Response Checklists
Runbooks turn chaotic incident responses into a structured process, which is especially helpful for smaller teams where knowledge might be concentrated among a few individuals.
Each runbook should follow a consistent format:
- Trigger conditions: Symptoms indicating the problem.
- Immediate actions: Steps to contain the issue.
- Diagnostic steps: Methods to identify the root cause.
- Resolution procedures: Steps to fix the issue.
- Verification steps: Actions to confirm the fix worked.
- Escalation criteria: What to do if the issue persists.
Focus on scenarios your team encounters often or those that carry significant risks. For example, a small business might need runbooks for application crashes, database connection errors, payment failures, or third-party outages.
For more complex situations, decision trees can guide responders through conditional steps. For instance, if your application is slow, the next steps might depend on whether CPU usage is high, database queries are lagging, or external APIs are timing out. A simple flowchart can help responders troubleshoot effectively without needing deep technical expertise.
Be specific in your instructions. Instead of vague advice like "check the database", include exact commands. For example, for PostgreSQL, use SELECT * FROM pg_stat_activity WHERE state = 'active';
, or for MySQL, try SHOW PROCESSLIST;
. Include notes on what the output should look like and how to interpret it.
Here’s an example of a web server restart runbook for a UK-based team:
-
Pre-checks and immediate actions:
Log into the monitoring dashboard to confirm the issue, check system load withtop
, and review recent error logs usingtail -n 100 /var/log/apache2/error.log
. These steps help avoid unnecessary restarts and provide context. -
Communication steps:
Notify the team on Slack: "Web server issue – investigating restart." If customers are affected, update the status page with a brief message like, "We’re investigating reports of slow response times." -
Backup and restart sequence:
Back up the current configuration:cp /etc/apache2/apache2.conf /etc/apache2/apache2.conf.bak_$(date +%F)
. Stop the service withsudo systemctl stop apache2
, wait a few seconds, then restart it withsudo systemctl start apache2
. Verify the status withsudo systemctl status apache2
and check access logs to confirm normal operation. -
Troubleshooting guidance:
If the server fails to restart, check for port conflicts usingsudo lsof -i :80
or review error logs for specific failure messages. Add notes on common issues your team has encountered.
Regularly test your runbooks through practice drills or live incidents. Assign owners to keep each runbook updated as systems evolve - outdated runbooks can create more problems than they solve.
Picking the Right Communication Tools
Once your escalation and runbook protocols are in place, choosing the right communication tools ensures everything runs smoothly. The tools you select should focus on reliability, accessibility, and urgency.
- Real-time coordination: Platforms like Slack or Microsoft Teams work well for live updates. Set up dedicated incident channels, use threads to keep discussions organised, and pin key updates to avoid losing them in the noise.
- Phone calls: For urgent escalations or when systems are down, phone calls are essential. Maintain an updated phone tree with mobile numbers and use conference calling for team coordination.
- Email: Useful for formal updates and audit trails. Send summaries, post-mortem reports, and customer notifications via email. Use clear subject lines like "RESOLVED: Payment system outage – 29/07/2025" to make them easy to find later.
- Status pages: Keep customers informed and reduce support queries by updating a status page. Whether you use a tool like Atlassian Statuspage or a simple website update, aim to post an initial update within minutes of identifying a customer-impacting issue.
- SMS alerts: For critical incidents, SMS can cut through notification fatigue. Use text messages for Severity 1 issues or when other methods fail. Many on-call tools can automate SMS alerts based on escalation rules.
Consider team schedules and time zones. For example, someone might miss a Slack notification outside working hours but will answer their phone. Weekend incidents may also require a different approach compared to weekday events.
Document your communication expectations clearly. Define target response times for each channel - such as 5 minutes for SMS, 15 minutes for Slack, and 30 minutes for email - and ensure these align with your team’s capabilities.
Finally, prepare templates for common scenarios. Pre-written updates, customer emails, and escalation messages can save time and reduce stress during high-pressure situations. Integrate your tools to automate workflows, like configuring monitoring systems to create Slack messages or update status pages, but always keep manual overrides available for unique situations.
sbb-itb-424a2ff
Choosing and Setting Up On-Call Tools
The right on-call tool can turn chaotic incident responses into a streamlined, dependable process. With your escalation workflows and communication rules in place, it’s time to pick technology that simplifies and strengthens your incident management.
On-Call Tool Comparison for UK SMBs
When assessing on-call tools, focus on features that align with your team’s specific needs instead of being distracted by unnecessary extras. For small and medium-sized businesses (SMBs) in the UK, basic plans typically cost £3–8 per user per month, while advanced features can push prices to £15–25. Prioritise features like on-call scheduling, reliable alerts, incident management, ease of integration, SLA tracking, and reporting to match your requirements.
PagerDuty is a leading option with over 700 integrations, making it ideal for teams using a wide variety of tools. Pricing starts at £15–20 per user per month for basic plans, with enterprise features available at a higher cost. It offers robust escalation rules and mobile apps, though its interface can feel overwhelming for smaller teams that don’t need all the bells and whistles.
Opsgenie, part of the Atlassian family, offers a simpler experience with over 200 integrations and seamless compatibility with Jira and Confluence. For teams already using Atlassian tools, Opsgenie is a natural fit. Its pricing is competitive, starting at £3–8 per user per month, making it an attractive option for SMBs on a budget.
Zenduty has gained popularity among smaller businesses, earning a 4.6 out of 5 rating on G2. It’s praised for delivering meaningful alerts quickly without unnecessary complexity. As Atmesh Mishra from Chalo notes:
"Zenduty has been great so far in terms of delivering meaningful alerts to the right person quickly, which has also improved our uptime significantly."
Felipe Urbina, CTO at Simpliroute, highlights another strength:
"Easy to configure, intuitive platform that triggers alerts from our monitoring tools such as Datadog, AWS CloudWatch, GCP, etc, and helps us respond to incidents faster."
Splunk On-Call (formerly VictorOps) provides solid incident management features at a starting price of around £4 per user per month. It’s a good fit for teams already invested in the Splunk ecosystem, though it may lack the depth of features offered by dedicated incident management platforms.
When calculating total costs, consider your team size, the volume of incidents you expect, and the integrations you’ll need. A unified platform is especially valuable during those late-night outages when quick, clear responses are critical.
Tool | Starting Price (per user/month) | Key Strengths | Best For |
---|---|---|---|
PagerDuty | £15-20 | 700+ integrations, complex workflows | Teams with diverse toolsets |
Opsgenie | £3-8 | Atlassian integration, user-friendly | Jira/Confluence users |
Zenduty | £5-12 | Simple setup, fast alerts | SMBs wanting quick deployment |
Splunk On-Call | £4-6 | Timeline-based incident view | Splunk ecosystem users |
Once you’ve chosen a tool, the next step is to integrate it seamlessly with your monitoring and ticketing systems.
Connecting Monitoring and Ticketing Systems
After selecting your on-call tool, proper integration ensures smooth, real-time incident management. Connecting your monitoring and ticketing systems allows incident details to flow automatically, reinforcing escalation paths and ensuring the right responders are alerted promptly.
Monitoring system integration is the backbone of automated incident response. Tools like Datadog, AWS CloudWatch, and Google Cloud Monitoring can create incidents in your on-call platform when specific thresholds or anomalies are detected. Configure these alerts to include details like server names, error rates, and affected services, so responders have the context they need immediately.
Use targeted alert routing to ensure the right person gets notified. For example, database-related alerts should go directly to your database specialist, while payment system issues might follow a different escalation path. This approach avoids unnecessary disruptions and ensures the problem is handled by the right expert.
Ticketing system integration helps create an audit trail for post-incident analysis. When tools like PagerDuty or Opsgenie receive an alert, they can automatically generate tickets in systems like Jira, Zendesk, or ServiceNow. These tickets document timelines, actions, and resolutions, providing valuable data for future reviews.
Both PagerDuty and Opsgenie support API and email integrations, allowing monitoring tools to send structured emails that create incidents automatically or use webhooks to trigger API calls when specific conditions are met.
Automation rules can reduce manual effort significantly. Set up rules to assign incidents based on time of day, escalate unresolved alerts after a set period, or update status pages automatically for certain incident types. However, always maintain manual override options for unique situations that fall outside standard patterns.
For example, Specsavers uses PagerDuty Runbook Automation to reduce training time by 75% and streamline incident resolution. Simon Hamilton-Peach, Platform Engineer at Specsavers, explains:
"Automation helps us deal with our technical debt. It bridges between legacy systems and helps integrate new systems with old systems."
Regular testing of integrations is crucial. Schedule monthly tests to trigger alerts from your monitoring systems and verify that they flow correctly through your on-call tool to the right responders. Document any issues and adjust configurations as needed.
As your integrations grow more complex, configuration management becomes essential. Keep documentation updated to reflect which alerts trigger specific escalation paths. Review these mappings quarterly to ensure they align with your team’s current structure and systems.
Finally, be mindful of notification fatigue. Too many low-priority alerts can desensitise your team to critical emergencies. Use filtering and suppression features to ensure that only actionable alerts reach responders, while dashboards and reports provide visibility into overall system health without overwhelming the team.
Using External Support for Incident Response
Even the most capable in-house teams can benefit from bringing in external experts during incidents. For UK small and medium-sized businesses (SMBs) managing both development and operations, external partners provide specialised skills and round-the-clock support - without the cost of building a full internal operations team. This approach acts as an extra layer of protection, complementing the internal processes outlined earlier.
How External Partners Can Help
External incident response providers bring expertise that smaller teams often lack. With 43% of data breaches affecting small and medium-sized enterprises and 39% of UK businesses reporting cybersecurity incidents in the past year, having expert backup is no longer a luxury - it’s a necessity.
One major advantage is their ability to provide 24/7 coverage. These providers maintain on-call teams across different time zones, ensuring that experienced professionals are available even during off-hours or when your team is unavailable.
For critical situations, incident response retainers grant access to skilled incident commanders. These professionals lead the response to complex outages, managing the crisis while your team focuses on resolving technical issues. Their structured approach ensures that no critical steps are missed during high-stress moments.
Beyond immediate response, external partners can help with cost optimisation. They identify unused resources and streamline cloud infrastructure, which can significantly reduce monthly expenses. Additionally, they simplify compliance and security hardening by implementing frameworks like ISO 27001 and SOC 2 and by providing audit-ready documentation.
Good partners don’t just fix problems - they leave your team stronger. By documenting solutions, creating runbooks, and offering training, they equip your team to handle similar incidents independently in the future.
In-House vs External On-Call: Costs and Benefits
Comparing in-house and external on-call teams reveals clear differences in cost and scalability.
Building an in-house team for 24/7 coverage is expensive. Beyond salaries, there are costs for recruitment, training, and additional compensation for on-call duties. On the other hand, external support offers a more predictable and scalable pricing model. Basic monitoring and alerting services start at around £400 per month, while comprehensive 24/7 response services typically range from £800 to £1,200 per month, with premium options rarely exceeding £2,000.
For companies without dedicated operations teams, external providers are invaluable. They can quickly scale up during major incidents, mobilising larger, specialised teams - something internal teams may struggle with, especially when multiple systems are affected.
External support also reduces risk. Unlike internal teams, which can become a single point of failure, external providers offer redundant expertise and well-documented processes. This ensures continuity even if key internal staff are unavailable.
The National Crime Agency has highlighted that SMEs are increasingly targeted by cybercriminals because they’re "less likely to have the weight of law enforcement and the intelligence community descend on them". This reality makes it essential for businesses to have strong incident response capabilities, whether through internal resources or external partners.
Many SMBs adopt a hybrid approach: they maintain basic internal monitoring and first-line response capabilities while relying on external partners for complex issues, compliance needs, and after-hours support. This strategy allows teams to focus on product development, secure in the knowledge that expert help is readily available when needed.
Conclusion: Building Your On-Call Safety Net
Creating a dependable on-call system doesn’t mean you need a huge team or a massive budget. What matters most is nailing the basics: clear roles, well-documented processes, the right tools, and expert support when it’s needed most.
Start by defining responsibilities and mapping out escalation paths. This ensures that even as your team evolves, there’s consistency in how incidents are handled.
Choose tools that are affordable yet effective. Look for options that simplify scheduling and automate escalations - this reduces manual errors, especially during high-pressure situations.
Fairly distributing on-call duties is crucial to avoid burnout. Automated scheduling tools can help, and allowing team members to swap shifts ensures flexibility for personal needs. Regularly reviewing workloads also helps prevent any one person from being overwhelmed.
Make sure your on-call policies comply with UK employment laws. Transparent guidelines and fair compensation not only keep you compliant but also boost team morale.
Don’t forget the value of external support. It provides an extra layer of reliability and access to specialised expertise when your team needs backup.
Track key metrics like response times, escalation rates, and workloads. Reviewing these regularly helps you spot potential issues before they become problems.
This straightforward approach is ideal for small and medium-sized businesses that need effective incident response without unnecessary complexity. Start simple, document thoroughly, and refine your system based on real-world experience. By combining well-defined roles, compliant practices, and flexible tools, you’ll have the peace of mind that someone capable will always be ready to respond when the call comes in.
FAQs
How can SMBs in the UK create fair on-call schedules while staying compliant with employment laws?
Small and medium-sized businesses (SMBs) in the UK can create fair on-call schedules by adhering to employment laws and ensuring workers' rights are respected. This means sticking to the 48-hour weekly working limit, allowing employees at least 11 uninterrupted hours of rest between shifts, and providing fair compensation, such as a retainer fee combined with overtime pay for any call-outs.
If employees are required to be available or perform tasks during on-call hours, this time should be classified as working time. It’s also worth considering flexible working requests and designing rotations that evenly distribute workloads, helping to avoid burnout. A clear and transparent on-call policy not only safeguards employees but also supports your business by fostering a fair and balanced work environment.
What are the main advantages of outsourcing incident response for small businesses?
Outsourcing incident response offers small businesses a range of advantages. It helps cut down on operational disruptions and lowers the chances of financial losses or reputational harm during incidents. By bringing in external specialists, businesses can rely on experienced professionals to handle situations swiftly and efficiently, leading to quicker recovery.
Another benefit is staying compliant with regulations, which is crucial for companies managing sensitive data. This approach also lets smaller teams concentrate on their primary responsibilities without being overwhelmed by the demands of incident management, all while maintaining strong operational stability.
How can small teams organise on-call duties and escalation paths to avoid burnout and quickly resolve incidents?
Small teams can handle on-call responsibilities more efficiently by setting up well-defined roles and responsibilities and creating fair rotation schedules to share the workload evenly. Clear escalation workflows are key - these should specify who to reach out to and at what stage, making sure incidents are handled quickly and without unnecessary delays.
To avoid burnout, foster open communication among team members and frequently review both schedules and processes to ensure they are equitable. As your team grows or changes, adjust these plans to keep them practical and aligned with your needs.