At approximately 11:20 am BST on 9th July, we identified a sudden and unprecedented surge in traffic through our network, which resulted in a build up of messages entering the platform. This created a rapid slowdown of elements of the platform that rely on this technology; namely call authentication, softphone app status and chat functions, and dashboard navigation.
While we have rigorous monitoring and load balancing protections in place, the rate at which this occurred meant that our usual redundancy measures were not able to sufficiently redirect all offending traffic and this ultimately led to a degradation of service which was seen across a significant number of accounts.
Our team worked throughout the day to both mitigate the impact on our end users, and also investigate and identify the root cause. We systematically suspended various elements of our network while we investigated the issue, and each time were periodically restarting affected servers which cleared the backlog instantly, however shortly after each restart the traffic returned to unprecedented levels via different channels. Unfortunately this led to a number of ‘false positives’ on the root cause, leading our team to incorrectly and prematurely confirm a resolution via our status page.
We ultimately identified the root cause as a specific customer’s misconfiguration of an ancillary service to our main product line. Unlike with our own software, the protections around a complex integration with this third-party service rely in part on safeguards implemented by the other party; in this case, we found there was insufficient rate limiting on their side, which allowed an inflated level of traffic into the Nebula platform. This traffic came from various origins and made it harder for our team to quickly identify the root cause.
Resolution
After identifying the origin of the issue, we immediately suspended all services surrounding the account and integration. Once incoming traffic returned to normal levels, the backlog was immediately cleared and normal service resumed within minutes.
We then worked with the service provider to establish the cause on their end, and implemented further checks within our own platform to prevent it recurring.
Next Steps
We have since begun a comprehensive review of all aspects of the connecting infrastructure, and will be implementing further improvements to allow us to better monitor and identify any similar issues in future. This also includes a more comprehensive review of our complete technology stack to ensure we are sufficiently protected for other cases of a similar nature.
We would like to sincerely apologise for the disruption caused, and thank you for your patience and understanding.