Degraded Performance Across Calls, Apps and Dashboards

Incident Report for Nebula Limited

Postmortem

What Happened

At approximately 11:20 am BST on 9th July, we identified a sudden and unprecedented surge in traffic through our network, which resulted in a build up of messages entering the platform. This created a rapid slowdown of elements of the platform that rely on this technology; namely call authentication, softphone app status and chat functions, and dashboard navigation. 

Impact

While we have rigorous monitoring and load balancing protections in place, the rate at which this occurred meant that our usual redundancy measures were not able to sufficiently redirect all offending traffic and this ultimately led to a degradation of service which was seen across a significant number of accounts. 

Our team worked throughout the day to both mitigate the impact on our end users, and also investigate and identify the root cause. We systematically suspended various elements of our network while we investigated the issue, and each time were periodically restarting affected servers which cleared the backlog instantly, however shortly after each restart the traffic returned to unprecedented levels via different channels. Unfortunately this led to a number of ‘false positives’ on the root cause, leading our team to incorrectly and prematurely confirm a resolution via our status page.

We ultimately identified the root cause as a specific customer’s misconfiguration of an ancillary service to our main product line. Unlike with our own software, the protections around a complex integration with this third-party service rely in part on safeguards implemented by the other party; in this case, we found there was insufficient rate limiting on their side, which allowed an inflated level of traffic into the Nebula platform. This traffic came from various origins and made it harder for our team to quickly identify the root cause.

Resolution

After identifying the origin of the issue, we immediately suspended all services surrounding the account and integration. Once incoming traffic returned to normal levels, the backlog was immediately cleared and normal service resumed within minutes. 

We then worked with the service provider to establish the cause on their end, and implemented further checks within our own platform to prevent it recurring. 

Next Steps

We have since begun a comprehensive review of all aspects of the connecting infrastructure, and will be implementing further improvements to allow us to better monitor and identify any similar issues in future. This also includes a more comprehensive review of our complete technology stack to ensure we are sufficiently protected for other cases of a similar nature.

We would like to sincerely apologise for the disruption caused, and thank you for your patience and understanding.

Posted Jul 11, 2025 - 10:23 BST

Resolved

This is confirmed as resolved, and a Post Mortem will be provided ASAP via this status page.
Posted Jul 10, 2025 - 09:03 BST

Monitoring

A fix has been implemented and we are monitoring the situation.
Posted Jul 09, 2025 - 14:17 BST

Identified

We are seeing degraded performance in some areas and working on the fix as we speak.
Posted Jul 09, 2025 - 13:56 BST

Monitoring

Normal service has resumed again and our team continue to closely monitor the situation.
Posted Jul 09, 2025 - 13:07 BST

Identified

We're investigating further reports of similar issues and will provide a further update within 30 minutes
Posted Jul 09, 2025 - 12:45 BST

Monitoring

Service has been restored in most areas and continues to improve in some niche cases. Our Team are still monitoring the incident and will mark this as resolved after a successful period. More information will be provided later via Post Mortem on this status.
Posted Jul 09, 2025 - 11:51 BST

Identified

The affected traffic is being rerouted and we're beginning to see performance in those areas improve. We are continuing to monitor and will provide a further update in the next 30 minutes.
Posted Jul 09, 2025 - 11:40 BST

Investigating

Our NOC are investigating reports of degraded performance across various elements of our platform, including calls, softphone apps and load speeds with our online dashboards. We will provide a further update in 30 minutes.
Posted Jul 09, 2025 - 11:24 BST
This incident affected: Core Network, Dashboard, Mobile Applications, and Desktop Applications.