Major Incident - Melbourne Data Centre - 06/08/2022

Incident Report for Squiz

Postmortem

Executive Summary

On the 6th of August 2022 at 16:32 AEST, Squiz internal monitoring detected a degradation of services for a few customers hosted in our Melbourne Data Centre.

The Squiz Data Centre team was engaged and identified that a few of the Production and Disaster Recovery (DR) VMs stopped being operational. Performed actions included a restart of these impacted VM’s, promoting recovery at 17:45 AEST.

‌

Customer impact

Between 16:32 AEST and 17:45 AEST on the 6th of August 2022, some customers hosted in the Melbourne Data Centre may have experienced a degradation of service as their VM’s were unreachable.

Root cause

As part of a physical server relocation activity in the Melbourne Data Centre, at ~ 12:28 AEST on the 6th of August 2022, a server was shut down and physically moved to its new position. At ~15:12 AEST this server was reconnected to the network and powered up. No issues were observed and internal monitoring was clear. At 16:30 AEST the main dashboard hosted in the Melbourne Data Centre depicting VM information started showing errors. As a result our Data Centre team restarted a certain process running on the compute node, which, at 16:32 AEST, resulted in a few VM’s entering into a “paused” state. The impacted VM’s were rebooted one by one resulting in partial recovery at 17:15 AEST. The last VM was successfully rebooted at 17:45 AEST, resulting in a complete recovery.

Mitigation and Follow-up actions

In response to this Incident, the Squiz Data Centre team is undertaking the below actions:

Assess the best approach for conducting maintenance on the engine server (in question) without impacting compute nodes and VMs followed by an improved documentation on the same.
Enhance internal upgrade documentation to add mandatory checks if a dashboard is offline or shows errors.
Enhance internal threshold/monitoring for paused VM’s.

Posted Aug 09, 2022 - 16:10 AEST

Resolved

Services have been restored for our customers hosted in the Melbourne Data Centre. We apologise for this degradation of service and thank you for your patience while we worked on the resolution.

A postmortem will be provided via https://status.squiz.cloud .

Posted Aug 06, 2022 - 17:49 AEST

Identified

The Squiz Data Centre team have identified the cause and are performing recovery actions.

Posted Aug 06, 2022 - 17:20 AEST

Investigating

Squiz monitoring has detected a degradation of service incident that is affecting customers hosted in our Melbourne Data Centre. Multiple Squiz teams are currently investigating.

A further update will be provided in ~15 minutes.

Posted Aug 06, 2022 - 17:07 AEST

This incident affected: Squiz Cloud Hosted Instances.