Major Incident - SaaS Customers - 21 June 2022
Incident Report for Squiz
Postmortem

Executive Summary

On the 21st of June 2022 at ~06:33 GMT, Squiz monitoring systems detected a degradation of service affecting customers hosted on our SaaS platform. Users may have received an error page from the Cloudflare Content Delivery Network (CDN) related to not being able to resolve the origin DNS, occurring for all requests.

Investigations performed by our Platform team indicated an issue with a third party vendor service, which resulted in SaaS platform services becoming unavailable. The third party followed their own major incident process, identified the root cause as an inadvertent side effect of a network change, and then rolled back the change, restoring service at 07:20 GMT. The incident was recovered externally by the third party without action by Squiz.

Customer impact

For the duration of the incident, users may have received an error page (Error 1016) from the Cloudflare Content Delivery Network (CDN) with an “error 500 status code” message related to not being able to resolve the origin DNS, occurring for all requests.

 

Root cause

Our third party vendor made a network change that caused services to become unavailable between 06:33 GMT and 07:20 GMT.

Mitigation and Follow-up actions

In response to this Incident, the third party network provider will run additional checks on their router configuration ensuring proper network traffic flows through their infrastructure. 

From our end, the Squiz Platform team is on standby in case further validation is required.

Posted Jun 22, 2022 - 16:13 AEST

Resolved
Squiz has been advised by our 3rd party service provider that a fix has been deployed for the current Major Incident. This has restored service for the subset of affected customers hosted on our SaaS platform. We apologise for this degradation of service and thank you for your patience while we worked on the resolution.

A postmortem will be provided within 72 hours via https://status.squiz.cloud once we have worked with our service provider.
Posted Jun 21, 2022 - 17:26 AEST
Investigating
Our support teams remain engaged with our upstream provider in resolving the issue.

A further update will be provided via https://status.squiz.cloud in 15 minutes, or earlier if the situation or information changes.
Posted Jun 21, 2022 - 17:15 AEST
Identified
Squiz teams have identified a third party service as the root cause of the current degradation of service Major Incident. This incident is currently affecting a subset of Squiz customers. We are currently working with the service provider to resolve the issue.

We will provide a further update in ~15 minutes, or if the incident is resolved.
Posted Jun 21, 2022 - 17:06 AEST
Investigating
Squiz monitoring has detected a degradation of service incident that is affecting customers hosted on our SaaS platform. Multiple Squiz teams are currently investigating.

A further update will be provided via https://status.squiz.cloud in 15 minutes, or earlier if the situation or information changes.
Posted Jun 21, 2022 - 16:53 AEST
This incident affected: Squiz Cloud Hosted Instances.