Delay and disrupted load times on dashboards and softphone applications.

Incident Report for Nebula Limited

Postmortem

What Happened

At approximately 9:30 am BST on 9th June, we began receiving reports of dashboards being slow to load, alongside degraded performance on softphone applications and wallboards. We also observed that certain backend processes were delayed. Desk phones and live calls were not affected at any point during the incident.

Upon investigation, our engineering team identified that our primary database was operating at maximum capacity. This was not caused by a sudden traffic spike, but rather a gradual increase in the volume of frequently accessed "hot data" over time. Once this threshold was reached, the resource-intensive swapping process created a system-wide slowdown.

Impact

The direct impact on customers was a significant slowdown across the platform. To protect the core service, we temporarily paused several non-essential backend processes like advanced wallboards.

While we have extensive monitoring, once the threshold was crossed, the volume of delayed processes created an immediate and significant impact before automated systems could fully mitigate it. 

Resolution

Our Engineering and DevOps teams implemented two hugely significant updates to resolve the issue and permanently improve the platform’s overall performance.

First, we deployed a software fix that overhauled our API infrastructure, optimising and reducing the queries sent to the database. This was completed by 5:00 pm on 9th June. After a brief period of clearing the backlog, normal service resumed. Second, we performed a zero-downtime infrastructure upgrade overnight on 10th June, significantly increasing the server memory and processing power. 

Next Steps

We are confident these changes provide a permanent and significant improvement to platform resilience. To ensure this issue does not reoccur, we are taking the following steps:

  1. Enhanced Monitoring: We have implemented more specific monitoring, based on the characteristics of this incident, to help us identify and prevent similar issues far earlier.
  2. Proactive Optimisation: Our engineering team is conducting a comprehensive review of our codebase to identify and improve other potentially inefficient processes.
  3. Capacity Planning: We are refining our forecasting to ensure our infrastructure always scales well ahead of demand.

We thank you for your patience and understanding as we worked to resolve this and apologise for the disruption caused.

Posted Jun 16, 2025 - 15:16 BST

Resolved

This incident has been resolved.
Posted Jun 12, 2025 - 10:18 BST

Monitoring

We've implemented a fix and are seeing services restoring to normal. We will continue to monitor this incident over the next 24 hours and provide further updates if necessary.
Posted Jun 09, 2025 - 18:04 BST

Update

Our team are in the process of diverting the affected traffic to additional resources, and will be monitoring for further improvements.
Posted Jun 09, 2025 - 16:02 BST

Update

We are still investigating this issue and will provide a further update within 1 hour.
Posted Jun 09, 2025 - 15:03 BST

Update

Our engineers are still investigating the issue, and will provide a further update within 1 hour.
Posted Jun 09, 2025 - 13:40 BST

Update

We are continuing to work on a fix for this issue.
Posted Jun 09, 2025 - 12:33 BST

Update

Our team continue to investigate the issue. We are hearing reports users are being logged out of the softphone applications and then experiencing issues with logging back in. We are looking at this with urgent priority and will provide a further update in the next hour.
Posted Jun 09, 2025 - 11:48 BST

Identified

Our engineering team are investigating reports of slow load and login times on our online dashboards, and desktop softphone applications. Calls and physical handsets are not currently affected. We will provide another update within 1 hour. Customers may also experience delays in webhooks and voicemail notifications coming through.
Posted Jun 09, 2025 - 10:55 BST
This incident affected: Core Network, Dashboard, Mobile Applications, and Desktop Applications.