2017.03.08 の Azure障害

というわけでだいたい3月8日の21時～23時ぐらいの間、Azureの東日本で障害が発生しました。（こちらから履歴が見れます）

以下3月9日2時時点での引用です。根本原因分析（RCA）ないしインシデントの詳細については72時間以内に公開されるようです。

3/8

SQL Database – Japan East

Summary of impact: Between 12:42 and 14:38 UTC on 08 Mar 2017, a subset of customers using SQL Database in Japan East may have experienced difficulties connecting to resources hosted in this region which related to a Storage incident that happened earlier today. Preliminary root cause: A storage backend system entered an unhealthy state and was unable to self-recover. This caused some storage requests to fail during this period which had a knock-on impact on SQL DB. Mitigation: Engineers redirected Storage requests to another backend node which mitigated the Storage issue. Engineers also updated configuration settings to prevent future occurrences. Once the impact to Storage was mitigated, SQL DB returned to a healthy state.

3/8

Storage – Japan East

Summary of impact: Between 12:42 and 14:38 UTC on 08 Mar 2017, a subset of customers using Storage in Japan East may have experienced difficulties connecting to resources hosted in this region. Services that leverage Storage in this region also experienced impact including: App Service \ Web Apps, Site Recovery, Virtual Machines, Redis Cache, Data Movement, StorSimple, Logic Apps, Media Services, Key Vault, HDInsight, SQL Database, Automation, Stream Analytics, Backup, IoT Hub and Cloud Services.

Preliminary root cause: A storage backend system entered an unhealthy state and was unable to self-recover. This caused some storage requests to fail during this period which had a knock-on impact on other services.

Mitigation: Engineers redirected Storage requests to another backend node which mitigated the Storage issue. Engineers also updated configuration settings to prevent future occurrences. Once the impact to Storage was mitigated, all other services returned to a healthy state. Any customers experiencing residual impact from this issue will receive direct messaging via their Management Portal.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. A detailed post incident report will be published within 72 hours.

現象としては東日本リージョンでStorageを使用している顧客の一部が接続できない問題が発生したようです。
関連してそのうえで動作するApp Service/Web Apps、Site Recovery、Virtual Machines、Redis Cache、StorSimple、 Media Services、Key Vault、HDInsight、SQL Database、Automation、Stream Analytics、Backup、IoT Hub、Cloud Servicesなど広範囲にわたり接続不能な状態になりました。

ストレージのバックエンドシステムがおかしくなり、自己復旧もできなかったようで該当期間中ストレージへの要求が失敗、関連するサービスへ影響が出たということのようです。
対処としてはエンジニアがストレージの問題を緩和した別のバックエンドノードに要求をリダイレクトし、その後今後の問題発生を防ぐように構成設定を更新したようです。
ストレージへの要求が正常に処理されるようになったら関連サービスも復旧していったという感じですね。

とまぁ肝心のところはまだ不明確ですが現時点での情報は以上です。

(2017.03.10 追記)

RCAが出たので追記します。

3/8

RCA – Storage – Japan East

Summary of impact: Between 12:40 and 14:38 UTC on 08 Mar 2017, a subset of customers using Storage in Japan East may have experienced difficulties connecting to resources hosted in this region. Azure services built on our Storage service in this region also experienced impact including: App Service \ Web Apps, Site Recovery, Virtual Machines, Redis Cache, Data Movement, StorSimple, Logic Apps, Media Services, Key Vault, HDInsight, SQL Database, Automation, Stream Analytics, Backup, IoT Hub, and Cloud Services. The issue was detected by our monitoring and alerting systems that check the continuous health of the Storage service. The alerting triggered our engineering response and recovery actions were taken which allowed the Stream Manager process in the Storage service to begin processing requests and recover the service health. All Azure services built on our Storage service also recovered once the Storage service was recovered.

Workaround: SQL database customers who had SQL Database configured with active geo-replication could have reduced downtime by performing failover to geo-secondary. This would have caused a loss of less than 5 seconds of transactions. All customers could perform a geo-restore, with loss of less than 5 minutes of transactions. Please visit https://azure.microsoft.com/en-us/documentation/articles/sql-database-business-continuity for more information on these capabilities.

Root cause and mitigation: On a Storage scale unit in Japan East, the Stream Manager that is the backend component that manages data placement in the Storage service entered a rare unhealthy state, which caused a failure in processing requests. This resulted in requests to Storage service failing for the above period of time. The Stream Manager has protections to help it self-recover from such states (including auto-failover), however, a bug caused the automatic self-healing to be unsuccessful.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes, to help ensure such incidents do not occur in the future. In this case, it includes (but is not limited to):
– The bugfix for the self-healing mechanism will be rolled out as a hotfix across Storage scale units.
– Implement secondary service healing mechanism, designed to auto-recover from unhealthy state, as well as additional monitoring for this failure scenario.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/313074

根本的な原因: 東日本のStorageのスケールユニットで、ストレージサービスのデータ配置を管理するバックエンドコンポーネント（Stream Manager）がまれな不健全状態になり、要求の処理に失敗しました。結果、もろもろのストレージサービスに対するリクエストが失敗するようになりました。
Stream Managerには自動フェールオーバーを含む自己修復機能がありますが、バグによって自己修復そのものが失敗していました。

次のステップ: 将来同様の問題が発生しないように以下の対応を行っていきます（ただしこれらだけに限定はされない）
・自己修復メカニズムのバグフィックスとストレージスケール単位で修正プログラムの展開
・不健全な状態からの自動回復するように設計されたセカンダリサービスのヒーリング機構を実装し、この障害シナリオの追加の監視のを実施

さてどうしておけば被害が少なかったのでしょうか。WorkaroundにあるようにSQL DatabaseはアクティブGeo-Replicationしておけばセカンダリにフェールオーバーすることで最小限にできたと思います。
VMの場合はどうでしょうか。

先日のAzure Storage障害を受けて。Azure SQL DBはアクティブ地理レプリケーション、VMはManaged Diskを構成していれば、影響はほぼなかったはずです。
— SATO Naoki (Neo) (@satonaoki) March 10, 2017

ということらしいです。なんでManaged Disks？というところですがManaged Disks（＋可用性セット）にするとVMに使うストレージも可用性セットのフォルトドメインに合わせてストレージも配置されるので、特定のストレージのスケールユニットの障害だとVM全体が停止するこはなかったはず、ということのようです。

Better reliability for Availability Sets

Managed Disks provides better reliability for Availability Sets by ensuring that the disks of VMs in an Availability Set are sufficiently isolated from each other to avoid single points of failure. It does this by automatically placing the disks in different storage scale units (stamps). If a stamp fails due to hardware or software failure, only the VM instances with disks on those stamps fail. For example, let’s say you have an application running on five VMs, and the VMs are in an Availability Set. The disks for those VMs won’t all be stored in the same stamp, so if one stamp goes down, the other instances of the application continue to run.

https://docs.microsoft.com/en-us/azure/storage/storage-managed-disks-overview

Managed Disksについてはこちらもどうぞ。

https://www.slideshare.net/ToruMakabe/3-azure-managed-disk

さっさと最新のイイ仕組みに移行すべきですね。

ブチザッキ

いわゆる雑記

「2017.03.08 の Azure障害」への1件のフィードバック

コメントを残すコメントをキャンセル

RCA – Storage – Japan East

Better reliability for Availability Sets

共有:

関連

「2017.03.08 の Azure障害」への1件のフィードバック

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル