2020.09.29 の Azure Active Directory 障害

Azure Active Directory周りで障害らしい。

Authentication errors across multiple Microsoft or Azure services – Mitigated (Tracking ID SM79-F88)

Summary of Impact: Between approximately 21:25 UTC on Sep 28 2020 and 00:23 UTC on Sep 29 2020, a subset of customers in the Azure Public and Azure Government clouds may have encountered errors performing authentication operations for a number of Microsoft or Azure services, including access to the Azure Portals. Targeted communications will be sent to customers for any residual downstream service impact.
Preliminary Root Cause: A recent configuration change impacted a backend storage layer, which caused latency to authentication requests.
Mitigation: The configuration was rolled back to mitigate the issue.
Next Steps: Services that still experience residual impact will receive separate portal communications. A full Post Incident Report (PIR) will be published within the next 72 hours.

だそうです。
Microsoft 365(Office 365)関連でサインイン周りにも影響でてますね。バックエンドストレージレイヤーの最近の構成変更の影響で認証要求の遅延が発生したとか。

2020.10.01 17:20 – RCAでてました。

RCA – Authentication errors across multiple Microsoft services and Azure Active Directory integrated applications (Tracking ID SM79-F88)

Summary of Impact: Between approximately 21:25 UTC on September 28, 2020 and 00:23 UTC on September 29, 2020, customers may have encountered errors performing authentication operations for all Microsoft and third-party applications and services that depend on Azure Active Directory (Azure AD) for authentication. Applications using Azure AD B2C for authentication were also impacted.
Users who were not already authenticated to cloud services using Azure AD were more likely to experience issues and may have seen multiple authentication request failures corresponding to the average availability numbers shown below. These have been aggregated across different customers and workloads.

  • Europe: 81% success rate for the duration of the incident.
  • Americas: 17% success rate for the duration of the incident, improving to 37% just before mitigation.
  • Asia: 72% success rate in the first 120 minutes of the incident. As business-hours peak traffic started, availability dropped to 32% at its lowest.
  • Australia: 37% success rate for the duration of the incident.

Service was restored to normal operational availability for the majority of customers by 00:23 UTC on September 29, 2020, however, we observed infrequent authentication request failures which may have impacted customers until 02:25 UTC.
Users who had authenticated prior to the impact start time were less likely to experience issues depending on the applications or services they were accessing.
Resilience measures in place protected Managed Identities services for Virtual Machines, Virtual Machine Scale Sets, and Azure Kubernetes Services with an average availability of 99.8% throughout the duration of the incident.

Root Cause: On September 28 at 21:25 UTC, a service update targeting an internal validation test ring was deployed, causing a crash upon startup in the Azure AD backend services. A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment, bypassing our normal validation process.

Azure AD is designed to be a geo-distributed service deployed in an active-active configuration with multiple partitions across multiple data centers around the world, built with isolation boundaries. Normally, changes initially target a validation ring that contains no customer data, followed by an inner ring that contains Microsoft only users, and lastly our production environment. These changes are deployed in phases across five rings over several days.

In this case, the SDP system failed to correctly target the validation test ring due to a latent defect that impacted the system’s ability to interpret deployment metadata. Consequently, all rings were targeted concurrently. The incorrect deployment caused service availability to degrade.

Within minutes of impact, we took steps to revert the change using automated rollback systems which would normally have limited the duration and severity of impact. However, the latent defect in our SDP system had corrupted the deployment metadata, and we had to resort to manual rollback processes. This significantly extended the time to mitigate the issue.

Mitigation: Our monitoring detected the service degradation within minutes of initial impact, and we engaged immediately to initiate troubleshooting. The following mitigation activities were undertaken:

  • The impact started at 21:25 UTC, and within 5 minutes our monitoring detected an unhealthy condition and engineering was immediately engaged.
  • Over the next 30 minutes, in concurrency with troubleshooting the issue, a series of steps were undertaken to attempt to minimize customer impact and expedite mitigation. This included proactively scaling out some of the Azure AD services to handle anticipated load once a mitigation would have been applied and failing over certain workloads to a backup Azure AD Authentication system.
  • At 22:02 UTC, we established the root cause, began remediation, and initiated our automated rollback mechanisms.
  • Automated rollback failed due to the corruption of the SDP metadata. At 22:47 UTC we initiated the process to manually update the service configuration which bypasses the SDP system, and the entire operation completed by 23:59 UTC.
  • By 00:23 UTC enough backend service instances returned to a healthy state to reach normal service operational parameters.
  • All service instances with residual impact were recovered by 02:25 UTC.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to) the following:

We have already completed

  • Fixed the latent code defect in the Azure AD backend SDP system.
  • Fixed the existing rollback system to allow restoring the last known-good metadata to protect against corruption.
  • Expand the scope and frequency of rollback operation drills.

The remaining steps include

  • Apply additional protections to the Azure AD service backend SDP system to prevent the class of issues identified here.
  • Expedite the rollout of Azure AD backup authentication system to all key services as a top priority to significantly reduce the impact of a similar type of issue in the future.
  • Onboard Azure AD scenarios to the automated communications pipeline which posts initial communication to affected customers within 15 minutes of impact.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: https://aka.ms/AzurePIRSurvey

認証操作の遅延ということで、地域によって変わったみたいですがアジア圏だと最初の120分間で成功率が72%、その後営業時間あたりになると32%まで低下したみたいですね。9月29日0時(UTC)にはほぼ回復。

根本原因はAzure ADサービスの更新プログラムが内部検証用テストリング(テスト環境)にデプロイされた後、サービス起動時にクラッシュ。本来ならこのプログラムはNGとなるはずがAzure ADバックエンドサービスのSafeDeployment Process(SDP)の潜在的なコードの欠陥のせいで検証プロセスがOK扱いになってしまい、問題ある更新プログラムがそのまま本番環境にデプロイされたという感じ。
本番環境は分離境界がちゃんとあって、世界中の複数データセンターにまたがる複数パーティションを持つActive-Active構成な地理分散サービスとなっており、通常変更があれば最初に顧客データがない内部リング、その後MS社内ユーザーのみの内部リング、最後に本番環境(顧客データあり)を対象に数日間にわたって計5つのリングに段階的に展開されるようになってる。このあたり展開用のメタデータなどがあってそこで管理してるっぽい。
で、展開メタデータを解釈するシステムの機能に影響を与える潜在的な欠陥が原因でSDPシステムが検証テストリングを正しくターゲットにできなかった結果、顧客データなども含む全リングに同時にターゲットされ展開された(結果クラッシュした)。

クラッシュしたので数分以内に自動ロールバックシステムを使って変更を元に戻そうとしたけどSDPシステムの潜在的な欠陥のせいで展開用メタデータが破損、機能せず。手動でロールバックプロセスを行う必要があった=問題を軽減させるための時間が大幅に延長。

最初の5分でアンヘルスな状態を検出、エンジニアによるトラシュー開始もろもろ、並行して顧客への影響を最小限に抑えたりするための手順を実施(Azure ADサービスが過負荷になるのでスケールアウトしたり特定ワークロードをバックアップのAzure ADにフェイルオーバーしたり)。
その後自動フェールオーバー実施、失敗、手動でロールバックというのをやって2時間ぐらいで全体の手動ロールバックが完了、その後2時間ぐらいかけて影響が残ってるサービスインスタンスも回復みたいな感じですかね。

対策として以下が完了済み。

  • Azure ADのSDPシステムの潜在的なコードを修正
  • ロールバックシステムを修正して正常なメタデータを復元できるようにする
  • ロールバック操作ドリルの範囲と頻度を拡大

残りは以下

  • Azure ADサービスのバックエンドSDPシステムに追加の保護を適用
  • Azure ADバックアップ認証システムの主要なサービスへの展開を最優先事項として促進、将来的に同様な問題があったときの影響を大幅に軽減
  • 影響がでてから15分以内に影響を受ける顧客に最初のコミュニケーションを取るための自動コミュニケーションパイプラインをAzure ADシナリオに乗せる

という感じです。いろいろ重なった結果長期化したみたいですね。。
何かある人はフィードバックどうぞ。 https://aka.ms/AzurePIRSurvey

コメントを残す