Squiz DXP funnelback degraded performance

Incident Report for Squiz

Postmortem

Executive Summary

On the 1st of August 2024 Squiz received client reports that Funnelback services had sporadically begun returning 504 errors whilst trying to access DXP hosted search services.

‌

Shortly after Squiz Engineers identified an increase in traffic to our DXP Funnelback infrastructure that led to search timeouts. After initially investigating the potential threat of malicious activity, the team identified this traffic was legitimate and scaled our infrastructure to meet the demands of this new traffic.

‌

Timeline: All times 01/08/2024: (BST):

08:49
Squiz receives reports of degradation of DXP Funnelback Services from UK clients. Squiz technical support team begin investigation.

08:50 - 09:48
Squiz Technical Support Engineers and Developers work to identify the cause of issues.

09:48
More clients report issues with the service and Squiz declares a Major Incident.

10:01
Squiz Cloud Engineering team identifies that Funnelback Query Processors are under heavy load. Work to identify the cause of the load.

10:11
Squiz identifies an increase in traffic levels has been caused by legitimate traffic. Decision is made to scale the infrastructure to meet demand.

10:27
The first of the two *Query Processors is taken offline to double its capacity.

10:42
The first uplifted Query Processor is back online, serving traffic and marked as Healthy. The second Query Processor is taken offline to double its capacity.

10:43
Squiz Engineers begin to see recovery of client services. Monitoring phase begins.

10:48
The second uplifted Query Processor comes online, serving traffic and marked as Healthy.

11:10
Squiz monitoring sees all systems as healthy and officially resolves the incident.

*Query Processor: Server responsible for serving Funnelback search requests.

‌

Customer Impact

Shortly before and during the incident clients saw an increase in 504 errors. The impact of these errors varied depending on clients' usage of Squiz Funnelback services. This included outages of systems reliant on Funnelback for login procedures and failure to customer search results

‌

Root Cause

The root cause was identified as the servers being under-resourced for the amount of traffic, which was resolved by right-sizing for capacity. Additional mitigation was performed by increasing capacity further for additional head-room while we investigated the source of the traffic.

We have since identified the source of traffic and taken steps to reduce malicious and low-value bot traffic. This has further increased head-room on the servers. Instead of scaling resources back, we've left them scaled up to ensure no issues during clearing

Mitigation and Follow-up Actions

Increased Capacity

During the incident the capacity of the Query Processors were doubled in-order to stabilise the system. Post incident the servers have been uplifted and now have eight times the capacity of the original size.

This action constitutes our main mitigation strategy and follow-up plan, offering ample headroom for Query processing.

Increased Monitoring

Squiz is currently enhancing our monitoring processes for the DXP Funnelback Service. This initiative encompasses a comprehensive review of our server load monitoring practices to proactively alert engineers about potential high load situations well in advance, preventing them from escalating into critical issues.

Improved identification of unoptimised code

Our Cloud Engineering team has dedicated efforts to enhance our capability in pinpointing collections that are not operating optimally. By effectively identifying these under-performing collections, we can optimise them to boost overall system performance.

Posted Aug 06, 2024 - 18:23 AEST

Resolved

We are pleased to confirm that the previously reported issue affecting the performance of our Funnelback DXP system has been successfully resolved.
Our team closely monitored the situation, and were able to apply a fix for the issue, which led to significant improvements in performance.
We will continue to keep a watchful eye on the system to ensure optimal performance and stability. We appreciate your patience and understanding during this time and apologise for any inconvenience caused.

A post mortem will be made available on https://status.squiz.cloud/ in the coming days.

Posted Aug 01, 2024 - 20:18 AEST

Monitoring

We are continuing to see service improvements across the board.
We'll continue to monitor the situation and look to confirm resolution when possible.

Posted Aug 01, 2024 - 19:56 AEST

Update

We are beginning to see slight improvements in Funnelback search performance.
Squiz will continue to investigate the root cause and provide regular updates via this status page.

Posted Aug 01, 2024 - 19:50 AEST

Update

We are continuing to investigate the situation and our engineers are working to find the root cause of this issue.

Posted Aug 01, 2024 - 19:27 AEST

Update

Squiz monitoring has detected a degradation of service impacting some Funnelback DXP customers in the UK only.

Some customers are experiencing slow response times and/or timeouts.

We are currently working to find the root cause of this issue.

A further update will be provided via https://status.squiz.cloud in 15 minutes, or earlier if the situation or information changes.

Posted Aug 01, 2024 - 19:10 AEST

Investigating

Our monitoring systems found slowness on some DXP Funnelback instances. Squiz is actively working on finding the root cause and fixing the issue.

Posted Aug 01, 2024 - 18:55 AEST

This incident affected: Squiz SaaS Hosted Instances.