How We Utilised Azure To Eliminate Alert Fatigue and Prevent False Alarms
Here at xTEN, we triage incidents into the following categories:
When xTEN receive a P1 alert, the team who supports that customer downs tools and joins an emergency call to resolve the issue. Every P1 is treated the same and will result in an incident report then a follow-up meeting to prevent the P1 from occurring again.
If your team follow a similar process, you may encounter an issue we had: many P1s we received were not genuine emergencies. Alerts were coming through, but often because of planned work, such as server updates, rather than a serious incident that needed our attention. This presented us with a challenge: how can we differentiate between a genuine P1 and a false one?
If you are an in-house team, it may be easier to know instinctively the difference between the two. Hopefully, you know what work your colleagues are doing on which servers. However, as an external team, we rely on excellent customer communication to help us figure this out. We aim to work as though we are our client’s in-house team as much as possible, fitting into existing workflows wherever we can. With this in mind, we came up with a solution that would be transparent to our clients and keep them involved in the process: a client API.
Our new client API explained
We wanted our customers to be part of our monitoring, not excluded from it, so our client API began with the ability to pause and resume monitoring, either instantly or scheduled via our change log functionality. There are several features available when the client outlines a change log request:
- Clients can set a provisional flag which determines how the changelog affects alerting. Non-provisional changes will pause monitoring during the time window for the servers specified, whereas provisional changes maintain monitoring and simply inform the team that a change may be taking place if alerts are raised.
- Servers can be defined as affected or associated. Affected servers will pause monitoring for non-provisional changes. Associated servers are not directly affected by the change but may experience repercussions, so monitoring continues but with associated change information provided within alerts.
- There is an option to set a single date and time window for a change or implementing a schedule for recurring changes
- Clients can set a suppress cluster flag, which allows them to ensure all servers in a cluster with affected or associated servers are treated in the same manner.
We supply our customers with full documentation complete with scripts. With custom error messages to help the client refine their requests, we ensure adding changes is simplistic and informative, allowing the client to hone their requests and create their own automated process to add changes if they wish
Making the most of Microsoft Azure
Our API follows a microservices architecture, using serverless resources within Microsoft Azure. We know we will want to continually expand the functionality of the API, so we opted for a combination of Azure API Management & Azure Functions, with an Azure SQL Database for storage.
Azure API Management ensures that as the backend functionality grows and develops, the client endpoint stays the same. This also allows us to continue to support older API versions if required. Each area of functionality within the API is and will continue to be governed by a corresponding Azure Function. This grants us the benefits of modularisation; decoupling different functionalities to improve resilience, scale only where necessary and isolate issues faster.
API Management also allows us to define and restrict access through subscription keys. If we only want to expose certain parts of the API for different tiers of service, we can easily configure the keys to do this within the Azure Portal.
After the client request has passed all validation checks, the entry is added to the client changelog table within the Azure SQL DB. The client receives a 200 HTTP status code response with a JSON outlining details of the changelog added, and we get a notification to our internal client Slack channel detailing the change that has been added and who added this change.
The client API is already making an impact; false flags have decreased, client communication has improved and we have increased visibility on recent changes in our client's estates.
Now we have established our microservices solution, we can continually expand the functionality to improve our process further. We’re starting to roll out an xTEN Server Agent, which will integrate with our API and current monitoring solutions to provide custom insights into each client’s estate. This combined with the foundations of our xTEN Client portal means there’s going to be a lot of exciting updates ahead!
This solution was imperative for us in avoiding unnecessary alerts to the team, so it’s certainly worth considering your own options on how to reduce the noise and protect you and your team from alert fatigue
Here’s what we've been up to recently.
xTEN is now Cyber Essentials Plus certified
At xTEN security is a priority. Recently taken over by the IASME consortium (as of 1 April 2020), the Cyber Essentials certification consists of a self-assessment of 5 basic security controls which is then verified by a qualified assessor.