At approximately 9:30 am BST on 9th June, we began receiving reports of dashboards being slow to load, alongside degraded performance on softphone applications and wallboards. We also observed that certain backend processes were delayed. Desk phones and live calls were not affected at any point during the incident.
Upon investigation, our engineering team identified that our primary database was operating at maximum capacity. This was not caused by a sudden traffic spike, but rather a gradual increase in the volume of frequently accessed "hot data" over time. Once this threshold was reached, the resource-intensive swapping process created a system-wide slowdown.
The direct impact on customers was a significant slowdown across the platform. To protect the core service, we temporarily paused several non-essential backend processes like advanced wallboards.
While we have extensive monitoring, once the threshold was crossed, the volume of delayed processes created an immediate and significant impact before automated systems could fully mitigate it.
Resolution
Our Engineering and DevOps teams implemented two hugely significant updates to resolve the issue and permanently improve the platform’s overall performance.
First, we deployed a software fix that overhauled our API infrastructure, optimising and reducing the queries sent to the database. This was completed by 5:00 pm on 9th June. After a brief period of clearing the backlog, normal service resumed. Second, we performed a zero-downtime infrastructure upgrade overnight on 10th June, significantly increasing the server memory and processing power.
Next Steps
We are confident these changes provide a permanent and significant improvement to platform resilience. To ensure this issue does not reoccur, we are taking the following steps:
We thank you for your patience and understanding as we worked to resolve this and apologise for the disruption caused.