Monitoring Isn’t a Tool Problem - It’s a Signal Problem
Your monitoring system isn’t broken because of bad tools - it’s because you’re overwhelmed by noise. Here’s why most teams struggle with cloud monitoring and how to fix it:
- 73% of SMBs experienced a data breach last year, despite using monitoring tools. The issue? Too many irrelevant alerts hide critical issues.
- Teams face 4,484 alerts daily, with 90% being false positives, leading to alert fatigue and missed warnings.
- Collecting more data doesn’t help - most monitoring systems only tell you what happened, not why it happened.
The Fix:
- Focus on key signals like latency, traffic, errors, and saturation (Google’s SRE golden signals).
- Use dynamic thresholds to reduce false alarms and prioritise actionable alerts.
- Leverage tools like OpenTelemetry to filter out noise and focus on meaningful data.
By shifting focus from tools to signals, small teams can avoid chaos, improve response times, and concentrate on what matters most - user experience and system reliability.
Reduce Alert Fatigue: Improve MTTA & MTTR with These SRE BEST Practices
Why Tools Don't Fix Monitoring Problems
Monitoring effectively isn’t just about having the latest tools; it’s about cutting through the noise to focus on what truly matters. The monitoring industry often sells the idea that the right platform will solve all your issues. But here’s the hard truth: even the most advanced tools can make things worse if the root problem - signal overload - isn’t addressed first.
Teams regularly invest thousands of pounds in monitoring platforms, hoping for instant clarity. Instead, they often end up with cluttered dashboards and an avalanche of alerts, leaving them more confused than before.
How Alert Fatigue Undermines Teams
Alert fatigue is more than an annoyance - it’s a serious risk to your business. When teams face a constant stream of alerts, critical warnings can easily get lost in the noise.
Consider this: IT teams handle an average of 4,484 alerts daily. Even more concerning, up to 90% of security alerts are false positives. This relentless flood of notifications can cause teams to miss the ones that actually matter.
Captain Chesley "Sully" Sullenberger, famed for successfully landing a plane on the Hudson River, highlighted the importance of prioritising warnings:
"The warnings in cockpits now are prioritised so you don't get alarm fatigue...We work very hard to avoid false positives because false positives are one of the worst things you could do to any warning system. It just makes people tune them out."
If the aviation industry goes to such lengths to avoid false positives, shouldn’t cloud monitoring adopt the same level of precision?
Monte Carlo's research adds another layer to this. They found that alert fatigue kicks in when a notification channel receives more than 30 to 50 alerts per week. Beyond this point, attention to alerts drops by 30% with each repeated reminder of the same issue. The problem worsens when alerts lack context. For example, a CPU usage spike might seem alarming - unless it coincides with predictable peak activity.
On top of alert overload, there’s a widespread but flawed belief that collecting more data will improve monitoring.
Why More Data Doesn’t Equal Better Monitoring
The idea that more metrics lead to better observability is misleading. A 2023 survey revealed that 63% of organisations deal with over 1,000 cloud infrastructure alerts daily, with 22% facing more than 10,000 alerts every day. Alarmingly, security analysts spend a third of their time investigating false alarms or low-priority threats.
For smaller teams, this deluge of data presents two major challenges. First, it’s costly. Every metric you track, store, and process adds to your expenses. For growing SaaS companies or digital agencies, these costs can spiral out of control without delivering real value. Second, the sheer volume of data makes identifying genuine issues feel like searching for a needle in a haystack.
Monitoring tools come with another limitation: they rely on predefined metrics and logs. This means they can only alert you to problems you’ve anticipated. In practice, this makes monitoring reactive. It tells you what happened but not why it happened - or how to stop it from happening again. As one expert explains:
"Monitoring only tells you what is happening. It does not explain why something is happening or provide visibility into the deeper layers of your cloud architecture to pinpoint root causes."
For teams lacking dedicated operations staff, this is a major hurdle. When an alert goes off, someone has to manually sift through data from siloed systems to figure out the root cause. This process is not only time-consuming but also inefficient.
The solution isn’t to collect less data - it’s to be smarter about which signals you prioritise and how you configure your alerts. In the next section, we’ll look at how to identify the signals that truly matter for your business.
Finding the Right Signals: What Actually Matters
To monitor effectively, you need to focus on metrics that genuinely influence user performance. For smaller teams, this means concentrating on the data that directly impacts user experience. By doing so, you can make the best use of limited resources and prioritise what truly matters.
The 4 SRE Golden Signals Explained
Google's Site Reliability Engineering (SRE) team developed four key signals - latency, traffic, errors, and saturation - to offer a clear picture of service performance and resource health.
- Latency is the time your system takes to respond to requests. It’s crucial to separate the latency of successful requests from failed ones. For instance, a rise in error latency might point to database connection issues, while slower successful responses could indicate broader performance problems. Setting a baseline latency, like 200 ms for API calls, helps you monitor deviations.
- Traffic measures the demand on your system, such as HTTP requests per second for a web service or database transactions per second. Recognising traffic patterns ensures you can distinguish between normal load changes and actual issues.
- Errors cover all failed requests. Not all errors are equal - some might be minor, while others could disrupt core functionality. Understanding the impact of different error types is essential.
- Saturation shows how "full" your system is by tracking the usage of its most limited resources. Systems often degrade before reaching full capacity, so setting realistic thresholds - such as 70% CPU usage - can help you avoid major issues.
"Fundamentally, it's what happens when you ask a software engineer to design an operations function."
– Ben Treynor, Google's VP of Engineering
These golden signals complement other frameworks like the RED (Rate, Errors, Duration) and USE (Utilisation, Saturation, Errors) methods. While RED focuses on service-level metrics and USE targets infrastructure health, the golden signals provide a more all-encompassing perspective.
Method | Signal types collected | Excludes |
---|---|---|
Golden Signals | Latency, traffic, errors, saturation | None |
RED Method | Rate (traffic), errors, duration (latency) | Saturation |
USE Method | Utilisation (traffic), saturation, errors | Latency (partly measured through utilisation) |
By building on these signals, small teams can customise their monitoring to fit their unique business needs.
Key Metrics for Small Teams
For smaller teams, choosing the right metrics is critical. While the golden signals provide a strong foundation, they need to be adapted to your specific business and user behaviour.
- EdTech companies often face seasonal traffic shifts. For instance, September might bring a surge in users as students return to school, while summer usage could drop. Tracking metrics like time-on-task and user error rates can help distinguish between normal usage patterns and actual performance issues.
- Digital agencies managing client campaigns deal with sudden traffic spikes during campaign launches. Even a slight delay - like a 500 ms lag in ad delivery - can harm campaign results and client satisfaction.
- SaaS companies should monitor metrics that affect user satisfaction and retention. Key indicators include API response times, feature availability, and data processing delays. Downtime can be incredibly costly, with estimates reaching around £4,200 per minute. Prioritising these metrics ensures you address issues that directly impact users and revenue.
Setting performance baselines is just as important. A baseline allows you to spot deviations from normal behaviour, making it easier to identify real problems. Using percentiles instead of averages for alerts can provide a clearer picture of user experience and minimise false alarms.
The aim isn’t to track everything - it’s to track what matters most. As Arfan Sharif from CrowdStrike explains:
"Today's operations teams that handle hundreds of cloud services across multiple providers often feel like they are drowning with too much information. When the signal-to-noise ratio is high enough, real warning signs can go unnoticed."
– Arfan Sharif, Product Marketing Lead for the Observability portfolio at CrowdStrike
Next, we'll dive into how to filter these signals to streamline your monitoring process further.
How to Filter Signals and Reduce Noise
After identifying the signals worth monitoring, the next step is refining your alerts to focus on these key signals. For small teams, this isn't just a matter of efficiency - it's essential for survival. When resources are stretched thin, dealing with false alarms wastes precious time and undermines trust in your monitoring system.
The solution? Move beyond static thresholds and basic alerting rules. Instead, aim for intelligent filtering that adapts to your environment. This means using dynamic thresholds, leveraging modern observability tools, and building alerting systems that interpret context rather than just reacting to raw data.
Setting Up Dynamic Thresholds and Smart Alerts
Static thresholds might seem like a straightforward solution, but they often fall short in dynamic cloud environments. For instance, setting a CPU usage alert at 80% may sound reasonable, but it doesn't account for natural fluctuations in your application's behaviour, such as regular traffic spikes.
Dynamic thresholds address this problem by adapting to real-time data patterns. They learn what's typical for your environment and only trigger alerts for significant deviations. For example, instead of flagging every predictable morning traffic spike, dynamic thresholds recognise these patterns as normal and alert you only when something unusual occurs, like an unexpected surge in latency.
Smart monitoring systems take this a step further. They consolidate related alerts into a single, meaningful notification, saving you from drowning in a sea of separate warnings. These systems also prioritise alerts based on their impact. For example, a slight increase in API response time during peak hours might be more critical than a larger issue during quieter periods.
For teams with limited resources, here are some practical ways to optimise alerts:
- Adjust threshold sensitivity based on the importance of each metric.
- Add context to alerts with data like infrastructure topology or historical trends.
- Automate routine troubleshooting steps for common issues.
- Establish clear communication channels to streamline incident responses.
Using OpenTelemetry to Filter Signals
The goal is to focus on meaningful data, and OpenTelemetry (OTel) has become a go-to tool for this. Its vendor-neutral framework works with various observability backends, making it ideal for organisations in the UK that need to balance comprehensive monitoring with data sovereignty and GDPR compliance. OTel allows you to retain sensitive data within your infrastructure while still achieving in-depth observability.
The OpenTelemetry Collector is key to managing and filtering signals. Instead of sending every piece of telemetry data to your backend, the collector processes and filters it first, reducing noise and saving costs. For example, it can sample error traces, filter out routine health checks, and enrich metrics with deployment details - helping you link performance changes to recent updates.
Tail-based sampling is particularly useful for cost control. Unlike head-based sampling, which makes decisions at the start of a trace, tail-based sampling waits until the trace is complete. This way, you can choose to keep traces that show errors or unexpected latency, ensuring you retain the most relevant data.
Security is another critical aspect of OTel filtering. Use strong transport security to protect communication between your applications and the collector. Run the collector with minimal privileges, keep it updated with the latest patches, and follow GDPR guidelines to ensure both security and compliance.
To maintain consistency across your telemetry data, use semantic conventions to standardise attribute names and structures. Before deployment, validate your YAML configuration and set up precise pipeline rules, using environment variables for sensitive data. Monitor the collector’s resource usage and scale horizontally if needed. Testing in a controlled environment - using tools like logging exporters and health check extensions - ensures your filtering rules work as intended without accidentally discarding critical information.
sbb-itb-424a2ff
Case Study: How a UK SaaS Startup Fixed Alert Chaos
The Problems They Faced
TechFlow, a London-based SaaS company that offers project management tools for creative agencies, found itself drowning in a sea of alerts by early 2023. Their monitoring system was bombarding the team with notifications for every minor CPU spike, fleeting network issue, and routine maintenance task. This avalanche of alerts is a familiar headache for many UK small and medium-sized businesses (SMBs). Industry reports suggest that organisations often deal with thousands of alerts daily, many of which are simply ignored.
The real trouble began when a critical production issue - a database connection leak - caused significant customer disruption. The alert for this major problem got buried under the flood of less urgent notifications. The delay in addressing the issue led to frustrated customers and raised serious questions about the company's operational dependability. It became clear that the problem wasn’t the monitoring system itself but the unchecked chaos of irrelevant signals.
What They Did to Fix It
Instead of jumping to adopt yet another tool, TechFlow’s CTO, Sarah Mitchell, took a different approach: improving the quality of the alerts themselves. The team began by auditing their notifications, categorising them into groups such as false positives, non-actionable alerts, duplicates, and misconfigured thresholds. With this groundwork in place, they implemented dynamic thresholding for key metrics.
The first step was to replace static alerts for CPU and memory usage with dynamic thresholds. This allowed notifications to trigger only when anomalies persisted over time. Using Prometheus, they ensured that alerts would only fire for sustained problems, cutting down on unnecessary noise.
Next, the team grouped related alerts into single, contextual notifications. For instance, instead of being flooded with multiple alerts for different symptoms of an issue with their payment processing system, they now received one consolidated notification. They also introduced role-based alert routing, ensuring that alerts were directed to the right people. Backend issues went to backend engineers, UI problems were sent to the frontend team, and major production errors still reached the entire team.
To further reduce noise, scheduled maintenance windows were configured with proper alert silencing. Routine tasks that previously triggered false alarms were now muted. A simple yet powerful rule was introduced: every alert had to be actionable and require human intervention. Automated health checks and routine system behaviours were removed from the alerting pipeline altogether.
These changes streamlined their alerting process, paving the way for noticeable operational improvements.
The Results
The results were immediate and impactful. TechFlow significantly reduced alert noise and improved the relevance of notifications. The proportion of actionable alerts skyrocketed, and genuine issues were identified and addressed much faster. Incident response times improved, and the engineering team was able to shift their focus from managing false positives to developing new features and optimising infrastructure.
Additionally, the leaner alerting system made it easier to scale their infrastructure efficiently while enhancing overall system reliability.
"Every page should be actionable."
By concentrating only on actionable alerts, TechFlow’s team transitioned from constantly reacting to problems to managing their systems proactively. Reflecting on the transformation, Sarah Mitchell explained, "We realised we weren't monitoring our system - we were monitoring our monitoring system. Once we focused on signals that actually mattered to our customers, everything became clearer."
TechFlow’s journey highlights an essential truth: effective monitoring isn’t about collecting endless data or using the fanciest tools. It’s about identifying the signals that matter most and filtering out the noise. This shift can transform a company’s operations from reactive chaos to proactive control.
Conclusion: Focus on Signals, Not Just Tools
For SMB teams, the key to effective monitoring lies in identifying meaningful signals rather than chasing the "perfect" monitoring tool. Real-world examples show that focusing on actionable data beats collecting mountains of information every time.
Take cloud infrastructure, for instance. Companies using it report 35% fewer unplanned outages. One organisation, by concentrating on golden signals in its GKE environment, managed to save over €260,000 annually. These examples highlight how prioritising key metrics can lead to tangible results.
"If there was one takeaway from this engagement that Generali took, it was the importance of proactive financial management and optimisation."
- Mohamed Talhaoui, Project Lead
For smaller teams with tighter budgets, this approach is even more critical. With the cloud-native application market projected to grow from around £4.7 billion in 2023 to approximately £13.6 billion by 2028, focusing on the right signals can help SMBs stay competitive without being overwhelmed.
Start by setting clear performance baselines for essential metrics like CPU usage, memory consumption, error rates, and response times. These baselines act as benchmarks to distinguish normal operations from real issues. Pair them with automated alerts for deviations, but make sure every alert is actionable - no one needs unnecessary notifications clogging their workflow.
Cost management is another vital signal. The most successful SMBs treat cost as a core metric, monitoring unit costs alongside traditional indicators to catch inefficiencies before they spiral out of control. This enables you to scale effectively while keeping a close eye on your budget.
Monitoring should grow with your business. Start with the four SRE golden signals - latency, traffic, errors, and saturation - and expand as your team and needs evolve. The ultimate aim? Actionable insights that drive decisions, not an overwhelming flood of data.
Your monitoring strategy should align with your business goals, not burden your team. Focus on what truly matters, cut through the noise, and use these signals to maintain clarity and enable proactive problem-solving.
FAQs
How can small teams cut through alert noise and focus on what really matters?
To manage alert noise effectively and focus on what truly matters, small teams should embrace smart strategies that prioritise clarity and relevance. Start by setting clear thresholds for alerts. This helps minimise unnecessary noise, ensuring only the most meaningful signals reach your team. It’s also essential to regularly review and adjust alert configurations to stay in tune with your team’s priorities and any changes in your systems.
Another helpful tactic is using automated grouping to consolidate similar notifications, reducing the chance of being overwhelmed. Pair this with severity-based filtering to ensure that critical issues are addressed promptly. Finally, fostering a culture of collaboration and feedback among team members can refine your monitoring practices further. This approach keeps alerts actionable and effective, without unnecessarily interrupting workflows.
What is the difference between static and dynamic thresholds in cloud monitoring, and why are dynamic thresholds often preferred?
Static vs Dynamic Thresholds
Static thresholds are fixed values you set to trigger alerts when a metric goes beyond a specific limit. For instance, you might configure an alert for CPU usage when it exceeds 80%. While this approach can work in some scenarios, it often leads to unnecessary alerts during normal metric fluctuations. Worse, it might fail to catch unusual behaviour that doesn’t breach the predefined limit.
Dynamic thresholds take a more adaptive approach by analysing historical data alongside real-time metrics. This allows them to differentiate between normal variations and genuine anomalies. The result? Fewer false alarms and more accurate alerts. These thresholds are especially helpful in cloud environments, where workloads and performance metrics can vary significantly, ensuring teams focus on actual issues instead of chasing noise.
In environments that are constantly changing - like cloud-native systems - dynamic thresholds tend to outperform static ones. They provide the flexibility and precision needed to keep up with fluctuating demands and detect anomalies effectively.
Why is it better to focus on key signals rather than collecting large amounts of data in cloud monitoring?
Focusing on the right signals in cloud monitoring allows teams to pinpoint and resolve critical issues swiftly without getting lost in a sea of irrelevant data. By zeroing in on key metrics - like latency, traffic, errors, and resource usage - you cut through the noise, making it easier to identify trends and address incidents effectively.
This targeted approach not only enhances system performance and improves user experience but also ensures that monitoring efforts align with broader business goals. Prioritising meaningful data enables teams to use resources wisely, avoid unnecessary complications, and maintain a clear focus on what truly matters in their cloud environments.