Major Incident - Funnelback SaaS UK
Incident Report for Squiz
Postmortem

Summary

During routine monitoring, Squiz identified operational issues with multiple Funnelback servers, leading to search function disruptions and latency for several customers.

Customer impact

A subset of UK Customers may have experienced delays in search results and encountered 500 errors when attempting to utilise the Funnelback search function.

Issue and Resolution

Squiz engineers were alerted to errors and timeouts originating from our Squiz hosted Funnelback services.

Investigation by Squiz engineers revealed that certain filter configurations were resulting in background optimisation calculations taking an excessive amount of time to compute.
This resulted in a large amount of computation resources being locked by these calculations for extended periods, reducing available computation resources to other main searches.

In response, we reduced the timing thresholds and the amount of computing power these calculations are allowed to take, in order to streamline query performance.
These restrictions only apply to optional background calculations and the limitations put in place will not cause disruption to search traffic.

As part of our standard process we initiated a period of heightened monitoring leading to resolution on May 8th at 11:03 BST

Mitigation

We have added new monitoring checks to flag excess computation delay as well as utility scripts to help us debug slow performance in the future. Our Product team is investigating approaches to improve filter background calculation performance in order to improve overall query performance going forward.

Posted May 10, 2024 - 21:08 AEST

Resolved
We can now confirm that the issues affecting Squiz Hosted Funnelback services are now resolved.

Our team monitored the situation carefully, and applied a fix for the issue. We are seeing system performance back to normal after this fix was applied.

We will continue to monitor the system to ensure optimal performance and stability. We appreciate your patience and understanding during this time and apologise for any inconvenience caused.

A post mortem with further information will be made available on https://status.squiz.cloud/ in the coming days.
Posted May 08, 2024 - 21:12 AEST
Update
Performance remains stable and we are continuing to monitor for any outstanding issues at this time.

A further update will provided in 15 minutes, or earlier if the situation or information changes.
Posted May 08, 2024 - 20:51 AEST
Update
We are still observing stable performance and are continuing to monitor for any outstanding issues at this time.

A further update will provided in 15 minutes, or earlier if the situation or information changes.
Posted May 08, 2024 - 20:33 AEST
Monitoring
We have now deployed a fix for the issue and are seeing an improvement in performance.
We are continuing to monitor for any outstanding issues at this time.

A further update will provided in 15 minutes, or earlier if the situation or information changes.
Posted May 08, 2024 - 20:14 AEST
Identified
We have now identified what is causing this issue and are deploying a change to rectify this.

A further update will provided in 15 minutes, or earlier if the situation or information changes.
Posted May 08, 2024 - 20:09 AEST
Update
We are continuing to investigate this issue.
Posted May 08, 2024 - 20:00 AEST
Update
Our engineers continue to investigate the issue.
We will provide a further update in 15 minutes, or earlier if the situation or information changes.
Posted May 08, 2024 - 19:25 AEST
Update
Our engineers are currently investigating the issue. We will provide a further update in 15 minutes, or earlier if the situation or information changes.
Posted May 08, 2024 - 19:02 AEST
Update
We are continuing to investigate this issue.
Posted May 08, 2024 - 18:44 AEST
Investigating
Squiz monitoring has detected a degradation of service with Squiz hosted Funnelback.

We are working hard to investigate the route cause of the issue and will provide further updates via https://status.squiz.cloud in 15 minutes, or earlier if the situation or information changes.
Posted May 08, 2024 - 18:38 AEST
This incident affected: Squiz SaaS Hosted Instances and Squiz Funnelback Hosted Instances.