Major Incident - Funnelback SaaS UK
Incident Report for Squiz
Postmortem

Summary

During routine monitoring, Squiz identified operational issues with multiple Funnelback servers, leading to search function disruptions for several customers.

Customer impact

A subset of UK Customers may have experienced delays in search results and encountered 500 errors when attempting to utilise the search function.

Issue and Resolution

Squiz engineers were alerted to errors and timeouts originating from our Squiz Funnelback services.

Subsequent investigation revealed that the search session functionality within Funnelback was causing slow or erroneous requests, leading to a build up of requests within the search query processing pipeline. Requests utilising the search session feature were subject to slow response times or termination. In turn this impacted performance and resulted in timed out searches.

In response, we isolated the specific searches and their connection behaviour to our session feature and, where needed, disabled/paused the use of this feature temporarily to allow the query processing pipeline to recover. As a precautionary measure, resource allocation to the query processing pipeline was increased.

As part of our standard process we initiated a period of heightened monitoring leading to resolution on April 25th at 13:00 BST

Mitigation

In light of this incident, Squiz support staff conducted a thorough review of our UK Funnelback systems to preempt future disruptions, including the expansion of memory resources. In addition, measures have been taken to enhance processes enabling fast-tracked resolutions to similar incidents in the future.

Posted May 01, 2024 - 03:08 AEST

Resolved
We are pleased to confirm that the previously reported issue affecting the performance of our Funnelback system has been successfully resolved.
Our team closely monitored the situation, and were able to apply a fix for the issue, which led to significant improvements in performance.
We will continue to keep a watchful eye on the system to ensure optimal performance and stability. We appreciate your patience and understanding during this time and apologise for any inconvenience caused.

A post mortem will be made available on https://status.squiz.cloud/ in the coming days.
Posted Apr 25, 2024 - 22:39 AEST
Update
We are now seeing systems recovering and are continuing to monitor in case of further issues.

A post mortem for this incident will also be made available on https://status.squiz.cloud/ in the coming days.
Posted Apr 25, 2024 - 22:18 AEST
Update
We are now seeing an improvement in performance and are continuing to investigate the root cause of this issue.
Posted Apr 25, 2024 - 22:09 AEST
Update
Our engineers continue to address the issue and apply fixes where possible to improve service performance.
Posted Apr 25, 2024 - 21:55 AEST
Update
We are continuing to work on an effective fix for this issue.
Posted Apr 25, 2024 - 21:34 AEST
Update
The problem is receiving our full attention as our engineers work to improve service performance through applied fixes.
Posted Apr 25, 2024 - 21:13 AEST
Update
Our engineering team is actively engaged in resolving the issue and applying fixes to boost service performance.
Posted Apr 25, 2024 - 20:48 AEST
Update
The problem is receiving our full attention as our engineers continue to work on improving service performance through applied fixes.
Posted Apr 25, 2024 - 20:28 AEST
Update
We are continuing to work on an effective fix for this issue.
Posted Apr 25, 2024 - 20:04 AEST
Update
Our engineers continue to address the issue and apply fixes where possible to improve service performance.
Posted Apr 25, 2024 - 19:42 AEST
Update
Our fix resulted in initial performance improvements, however we are still seeing a performance degradation.
Our teams continue to work on a full resolution.
Posted Apr 25, 2024 - 19:20 AEST
Monitoring
Our engineers have identified the likely cause of the incident and have implemented an appropriate fix.
We are now testing and monitoring search performance, and have started to see signs of recovery.
Posted Apr 25, 2024 - 19:08 AEST
Update
We are currently experiencing a degradation impacting Funnelback and continue to investigate the situation.
We have announced a major Incident and have multiple teams engaged in resolution.
Posted Apr 25, 2024 - 18:45 AEST
Update
We are continuing to investigate this issue.
Posted Apr 25, 2024 - 18:11 AEST
Investigating
Squiz monitoring has detected a degradation of service in our Funnelback Pods

Squiz is working hard to investigate the route cause of the issue and will provide further updates via https://status.squiz.cloud in 15 minutes, or earlier if the situation or information changes.
Posted Apr 25, 2024 - 17:59 AEST
This incident affected: Squiz SaaS Hosted Instances and Squiz Funnelback Hosted Instances.