On the 1st of August 2024 Squiz received client reports that Funnelback services had sporadically begun returning 504 errors whilst trying to access DXP hosted search services.
Shortly after Squiz Engineers identified an increase in traffic to our DXP Funnelback infrastructure that led to search timeouts. After initially investigating the potential threat of malicious activity, the team identified this traffic was legitimate and scaled our infrastructure to meet the demands of this new traffic.
Timeline: All times 01/08/2024: (BST):
08:49
Squiz receives reports of degradation of DXP Funnelback Services from UK clients. Squiz technical support team begin investigation.
08:50 - 09:48
Squiz Technical Support Engineers and Developers work to identify the cause of issues.
09:48
More clients report issues with the service and Squiz declares a Major Incident.
10:01
Squiz Cloud Engineering team identifies that Funnelback Query Processors are under heavy load. Work to identify the cause of the load.
10:11
Squiz identifies an increase in traffic levels has been caused by legitimate traffic. Decision is made to scale the infrastructure to meet demand.
10:27
The first of the two *Query Processors is taken offline to double its capacity.
10:42
The first uplifted Query Processor is back online, serving traffic and marked as Healthy. The second Query Processor is taken offline to double its capacity.
10:43
Squiz Engineers begin to see recovery of client services. Monitoring phase begins.
10:48
The second uplifted Query Processor comes online, serving traffic and marked as Healthy.
11:10
Squiz monitoring sees all systems as healthy and officially resolves the incident.
*Query Processor: Server responsible for serving Funnelback search requests.
Customer Impact
Shortly before and during the incident clients saw an increase in 504 errors. The impact of these errors varied depending on clients' usage of Squiz Funnelback services. This included outages of systems reliant on Funnelback for login procedures and failure to customer search results
Root Cause
The root cause was identified as the servers being under-resourced for the amount of traffic, which was resolved by right-sizing for capacity. Additional mitigation was performed by increasing capacity further for additional head-room while we investigated the source of the traffic.
We have since identified the source of traffic and taken steps to reduce malicious and low-value bot traffic. This has further increased head-room on the servers. Instead of scaling resources back, we've left them scaled up to ensure no issues during clearing
Increased Capacity
During the incident the capacity of the Query Processors were doubled in-order to stabilise the system. Post incident the servers have been uplifted and now have eight times the capacity of the original size.
This action constitutes our main mitigation strategy and follow-up plan, offering ample headroom for Query processing.
Increased Monitoring
Squiz is currently enhancing our monitoring processes for the DXP Funnelback Service. This initiative encompasses a comprehensive review of our server load monitoring practices to proactively alert engineers about potential high load situations well in advance, preventing them from escalating into critical issues.
Improved identification of unoptimised code
Our Cloud Engineering team has dedicated efforts to enhance our capability in pinpointing collections that are not operating optimally. By effectively identifying these under-performing collections, we can optimise them to boost overall system performance.