Incident Analysis
At 10:46am BST, we identified issues with outbound call authentication and navigation within online dashboards. After approximately 10 minutes, the issue expanded, affecting some inbound calls and ability to log into softphone applications.
Our Engineering Team quickly identified the cause as a ‘lock’ within an integral database that handles our API layer, triggered by an inefficient query. A database lock prevents new traffic and data from being written, causing a rapid backlog of processes such as placing and taking calls, presence, and application authentication. Inbound calls became affected later as a result of the delays on the overall network.
At approximately 11:05 AM, our team restarted the affected database and traffic resumed within minutes. However, due to the volume of queued messages it took an additional 25 minutes for them all to be accepted into the database and cleared. As a result, service would have gradually restored across all affected areas over this time.
Next Steps
Our team has since addressed this particular query, and all other queries that had been flagged have been stopped until they have been optimised in the same way. They will remain ceased, until the fixes are implemented within the next 5 working days.