Service Degradation 'All services behind CloudFlare'

Incident Report for Squiz

Postmortem

Summary

On June 12, 2025, Cloudflare experienced a significant service outage that impacted a wide array of Cloudflare services. This list of services included Workers KV, which Squiz utilizes for serving frontend content and backend administration UIs.

Squiz’s Support teams were alerted to this issue following alerts received through our internal monitoring. Some customers may have experienced 503 errors throughout the incident.

‌‌Customer Impact

Customer Impact:

  • Incident Duration: 12 Jun 2025, 18:10 - 13 Jun 03:01 (UTC)
  • Impact: Some customers experienced 503 errors when viewing pages and administration interfaces.

    • Customer systems/site pages were inaccessible during some/all of the impact window.
    • Cloudflare marked the issue as resolved at 20:28 UTC. However, some services were not fully resolved immediately; this included Workers KV. This was resolved at 03:01 UTC

Cloudflare's status page:
https://www.cloudflarestatus.com/incidents/25r9t0vz99rp

Root Cause

The cause of this outage was due to a failure in the underlying storage infrastructure used by Cloudflare's Workers KV service, which is a critical dependency for many of Squiz's Cloudflare configuration and DXP services. Cloudflare have stated that part of this infrastructure is backed by a third-party cloud provider, which experienced an outage on June 12th and directly impacted availability of the KV service.

Resolution Actions

  • Identification:

Squiz Support Teams received a number of alerts indicating service disruption, leading to quick identification of the underlying cause.

  • Cloudflare Actions:

Detailed in the Cloudflare Post Mortem; https://blog.cloudflare.com/cloudflare-service-outage-june-12-2025/

  • Squiz Actions:

Squiz Support worked closely with Cloudflare, taking swift manual action to clear KV worker errors and restore services while full remediation was underway.

‌‌Follow-up Actions

Cloudflare Actions:

Squiz have engaged with Cloudflare to improve resiliency of the services that depend on Workers KV and their storage infrastructure. Cloudflare are already undertaking the actions detailed below:

  • (Actively in-flight): Bringing forward our work to improve the redundancy within Workers KV’s storage infrastructure, removing the dependency on any single provider. During the incident window we began work to cut over and backfill critical KV namespaces to our own infrastructure, in the event the incident continued.
  • (Actively in-flight): Short-term blast radius remediations for individual products that were impacted by this incident so that each product becomes resilient to any loss of service caused by any single point of failure, including third party dependencies.
  • (Actively in-flight): Implementing tooling that allows us to progressively re-enable namespaces during storage infrastructure incidents. This will allow us to ensure that key dependencies, including Access and WARP, are able to come up without risking a denial-of-service against our own infrastructure as caches are repopulated.

Squiz Actions:

  • (In progress): Change the Cloudflare layer of the DXP to remove dependencies on Cloudflare Workers KV where possible, and provide safe fallbacks when the service is unavailable.
Posted Jun 17, 2025 - 23:26 AEST

Resolved

We have seen the lingering issues now start to resolve now and Squiz is considering this incident to be resolved.
Please get in touch with us at My.Squiz.Net if you see any further issues so that we can investigate separately.
Posted Jun 13, 2025 - 13:01 AEST

Update

Our Engineers have made progress towards fixing more of the lingering issue.
They are also continuing to monitor for new issue, we shall post here once there are any more updates.
Posted Jun 13, 2025 - 12:03 AEST

Update

We are continuing to Monitor for ongoing issues caused by this outage, Our engineers have been able to resolve most of what we have found and are continuing to monitor for any new issues.

We shall post a further update once we have more information.
Posted Jun 13, 2025 - 11:16 AEST

Update

We are making progress towards fixing the lingering issues from the incident at hand and our engineers are continuing to test and ensure the performance of sites.
We shall update again with any progress.
Posted Jun 13, 2025 - 10:13 AEST

Update

Our Engineers are continuing to Fix lingering issues caused by this incident and investigating to find any further issues.
More information will be posted as we progress.
Posted Jun 13, 2025 - 09:43 AEST

Update

We are continuing to monitor all sites involved in the incident to look for any lingering effects of the issue and ensuring that they are thoroughly investigated by our Engineers and fixed.

We shall post further updates here if there are any changes.
Posted Jun 13, 2025 - 09:03 AEST

Monitoring

At this time services have been restored across the platform and Cloudflare has confirmed the same from their end. Squiz teams will continue to monitor the situation for any potential follow up issues and provide updates here as needed.
Posted Jun 13, 2025 - 07:22 AEST

Update

We have seen further improvements to site statuses around all affected platforms and our teams are continuing to track down any remaining issues to ensure full resolution before we switch into a fully Monitoring status. Further updates as they become available.
Posted Jun 13, 2025 - 07:15 AEST

Update

We are beginning to see partial recovery to some affected systems but not all yet as of this time. Squiz teams are continuing to monitor the situation at Cloudflare and test affected sites.
Posted Jun 13, 2025 - 06:19 AEST

Update

We are continuing to monitor the situation with our services which have been affected by an outage at Cloudflare. We will provide additional updates as they become available from the Cloudflare team.
Posted Jun 13, 2025 - 05:10 AEST

Update

We are continuing to monitor the situation with our services which have been affected by an outage at Cloudflare. We will provide additional updates as they become available from the Cloudflare team.
Posted Jun 13, 2025 - 05:08 AEST

Identified

We have identified issues with Cloudflare which are affecting DXP
Posted Jun 13, 2025 - 04:39 AEST

Update

We are continuing to investigate this issue.
Posted Jun 13, 2025 - 04:29 AEST

Investigating

We are experiencing service degradation on the DXP environment.
Currently investigating
Posted Jun 13, 2025 - 04:29 AEST
This incident affected: Squiz Cloud Hosted Instances, Squiz SaaS Hosted Instances, and Squiz Funnelback Hosted Instances.