2017.03.31のAzure障害

東日本再び…日本時間で3月31日23時～4月1日2時ぐらいで東日本リージョンで障害がありました。（こちらから履歴が見れます）

以下とりあえず引用。（まだ履歴になってない）

Cooling Event | Japan East

最終更新 44 分前

Starting at 13:50 UTC on 31 Mar 2017, a subset of customers in Japan East may experience difficulties connecting to their resources hosted in this region. Engineers have identified the underlying cause as loss of cooling which caused some resources to undergo an automated shutdown to avoid overheating and ensure data integrity & resilience. Engineers have recovered the cooling units and are working on recovering the affected resources. Engineers will then validate control plane and data plane availability for all affected services. Some customers may see signs of recovery. The next update will be provided in 60 minutes or as events warrant.

Multiple services | Japan East

最終更新 41 分前

Starting at 13:50 UTC on 31 Mar 2017, a subset of customers with resources which leverage Storage in Japan East may experience latency or connection issues. Impacted services include App Service\WebApps , Virtual Machines, Azure SQL DB, Azure Cache, Service Bus, Cloud Services, Stream Analytics, Event Hubs, Backup, DocumentDB, Storsimple, Site Recovery, Key Vault, Data Factory, Azure Container Service, HDInsight, Media Services, API Management, Logic Apps, Redis Cache, Azure IoT Hub, Azure Monitor, and Azure Automation. Engineers are aware of this issue and are actively investigating. The next update will be provided in 60 minutes, or as events warrant.

現在進行形なんでまた文面変わるかもしれませんが、東日本リージョン内の一部の冷却系が損失したので過熱を回避してデータの整合性と復元力を確保するために一部のリソースを自動シャットダウンしたっぽいですね。
冷却装置を回収して影響を受けたリソースの回収に取り組んでるとのこと。今時点（4時前）で周りの反応みるとだいたい復旧したようですが。取り急ぎ今のところの情報でした。

(2017.04.03 追記)

RCAと日本語情報が出たので追記。

3/31

RCA – Cooling Event – Japan East

Summary of impact: Between 13:28 UTC and 22:16 UTC on March 31 2017, a subset of customers in Japan East region may have experienced unavailability of Virtual Machines (VMs), VM reboots, degraded performance or connectivity failures when accessing those resources or/and service resources dependent upon Storage service in this region.

Customer impact: Customers who have resources or/and impacted services in this region may have experienced unavailability of those resources for the impacted time-frame noted above. Services impacted include Storage and Virtual Machines. Services with dependencies on Storage: API Management, App Service \ Web Apps, Automation, Backup, Cloud Services, Access Control Service, Azure Data Factory / Data Movement, DocumentDB, Event Hubs, HDInsight, IoT Hub, Key Vault, Logic Apps, Media Services, Azure Monitor, Redis Cache, RemoteApp Service Bus, Site Recovery, SQL Database, StorSimple, Stream Analytics, Access Control Service, Azure Machine Learning (ML) and Azure Notification Hub.

Workaround: Multiple Virtual Machines using Managed Disks in an availability Set would have maintained availability during this incident. For further information around Managed Disks, please visit the following site. For Managed Disks Overview, please visit: https://docs.microsoft.com/en-us/azure/storage/storage-managed-disks-overview. For information around how to migrate to Managed Disks, please visit: https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-migrate-to-managed-disks. Customers using Azure Redis Cache: although caches are region sensitive for latency and throughput, pointing applications to Redis Cache in another region could have provided business continuity. SQL database customers who had SQL Database configured with active geo-replication could have reduced downtime by performing failover to geo-secondary. This would have caused a loss of less than 5 seconds of transactions. Another workaround is to perform a geo-restore, with loss of less than 5 minutes of transactions. Please visit https://azure.microsoft.com/en-us/documentation/articles/sql-database-business-continuity/ for more information on these capabilities. During this incident, the Japan West region remained fully available, customers applications with geo-redundancy (for example, using Traffic Manager to direct requests to a healthy region) would have allowed the application to continue without impact or would have been able to minimize the impact. For further information, please visit http://aka.ms/mspnp for Best Practices for Cloud Applications and Design Patterns, and https://docs.microsoft.com/en-us/azure/traffic-manager/traffic-manager-overview for Traffic Manager.

Root cause and mitigation: Initial investigation revealed that one RUPS system (a rotary uninterruptible power supply system) failed in a manner which resulted in the power distribution that feeds all of the air handler units (AHU) in Japan East datacenter to fail. The result of the air handlers failing was continued increase in temperature within the entire datacenter. Japan East region is being managed by a 3rd party vendor, who owns three dedicated security spaces at the location reported Microsoft that all of spaces were impacted. The cooling system is designed for N+1 redundancy (also called as parallel redundancy) and the power distribution design was running at N+2. Microsoft and a 3rd party vendor continue investigating a root cause for why the fault RUPS system affected all power supply to the AHUs, this is currently in progress. As a part of standard monitoring, Azure Engineers received alerts for availability drops for this region. Engineers identified the underlying cause was due to a failure within the power distribution system that was running at N+2. One RUPS (rotary uninterruptible power supply) in the N+2 parallel line up failed and resulted in being unable to supply power to the cooling system in this datacenter. As a consequence of the cooling system going down, some resources were automatically shutdown to avoid overheating and ensure data integrity and resilience. At 14:12 UTC, Facility teams (a 3rd party vendor) and Microsoft’s site services personnel were onsite and restarted the cooling system air handlers, using outside airflow to force cool the datacenter. At the same time, multiple Microsoft Service Teams prepared for bringing systems back online in a controlled process to avoid automated processes from causing any potential destabilization across neighboring devices. At 16:08 UTC, temperature readings were back within operational ranges and power up processes began using safe power recovery procedures. A thorough health check was completed after RUPS system and cooling system were restored, any suspects or failed components were replaced and isolated. Suspected and failed components are being sent for analysis. At 16:53 UTC, Engineers confirmed that approximately 95% of all switches/network devices have been restores successfully. Power up processes began on impacted scale units that host Software Load Balancing (SLB) services and the control plane. At 17:16 UTC, majority of the core infrastructure was brought online, Networking Engineers began restoration of Software Load Balancing (SLB) services in a controlled process to help programming to establish a quorum promptly. Once SLB was up and running, Engineers confirmed that majority of services were recovered automatically and successfully at 18:51 UTC. Residual impacts with Virtual Machines were found, Engineers investigated and continued to recover impacted Virtual Machines to bring them online. In parallel, Engineers notified customers who had experienced residual impacts with Virtual Machines for recovery. At 22:16 UTC, Engineers confirmed Storage and all storage dependent services recovered successfully.

Next steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
1. RUPS system unit are being sent off for analysis. Root cause analysis continues with site operations, facility engineers, and equipment manufacturers to further mitigate risk of recurrence.
2. Review Azure services that were impacted by this incident to help tolerate this sort of incidents to serve services with minimum disruptions by maintaining services resources across multiple scale units or implementing geo-strategy.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/351091

日本語訳はサポートチームから出てます。

3 月 31 日夜間から発生した東日本データセンターの障害についての原因調査報告書 (RCA) の抄訳

ざっくりまとめると冷却装置に電源が供給されなくなったための自動シャットダウン（過熱による破損防止）したという感じで、そもそも冷却装置そのものもN+1で冗長化されて電源はRUPS（ロータリーUPS）でN+2で冗長化されてるのにRUPSシステムの失敗が原因で全部のアハンドラーユニットへの電源供給が止まったのが原因のようです。なんでRUPSが失敗したのかは調査中とのこと。

全体的にいろいろ改善を期待します。

ブチザッキ

いわゆる雑記

「2017.03.31のAzure障害」への1件のフィードバック

コメントを残すコメントをキャンセル

RCA – Cooling Event – Japan East

その他

共有:

関連

「2017.03.31のAzure障害」への1件のフィードバック

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル