日本時間の明け方ですが結構大規模な障害でした。
Preliminary RCA – Authentication errors across multiple Microsoft services (Tracking ID LN01-P8Z)
Summary of impact: Between 19:00 UTC (approx) on March 15, 2021, and 09:25 UTC on March 16, 2021 customers may have encountered errors performing authentication operations for any Microsoft and third-party applications that depend on Azure Active Directory (Azure AD) for authentication.
Azure Admin Portal, Teams, Exchange, Azure KeyVault, SharePoint, Storage and other major applications have recovered. Any customers experiencing residual impact will continue to receive updates regarding these via their Azure Service Health notifications.
Preliminary Root Cause: The preliminary analysis of this incident shows that an error occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other, Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key.
Metadata about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end users were no longer able to access those applications.
Mitigation: Service telemetry identified the problem, and the engineering team was automatically engaged. The key removal operation was identified as the cause, and the key metadata was rolled back to its prior state at 21:05 UTC.
Applications need to pick up the rolled back metadata and refresh their caches with the correct metadata. Time to mitigation for individual applications varies due to a variety of server implementations that handle caching differently. Azure Admin Portal, Teams, Exchange, Azure Key Vault, SharePoint and other major applications have recovered. A subset of Storage resources experienced residual impact due to cached metadata, and we pushed an update to invalidate these entries and force a refresh. This process completed and mitigation for the residually impacted customers was declared at 09:25 UTC
Azure AD is in a multi-phase effort to apply additional protections to the backend Safe Deployment Process (SDP) system to prevent a class of risks including this problem. The first phase does provide protections for adding a new key, but the remove key component is in the second phase which is scheduled to be finished by mid-year. A previous Azure AD incident occurred on September 28th, 2020 and both incidents are in the class of risks that will be prevented once the multi-phase SDP effort is completed.
Next Steps: We understand how incredibly impactful and unacceptable this is and apologize deeply. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In the September incident we indicated our plans to “apply additional protections to the Azure AD service backend SDP system to prevent the class of issues identified here.”
- The first phase of those SDP changes is finished, and the second phase is in a very carefully staged deployment that will finish mid-year. The initial analysis does indicate that once that is fully deployed, it will prevent the type of outage that happened today, as well as the related incident in September 2020. In the meantime, additional safeguards have been added to our key removal process which will remain until the second phase of the SDP deployment is completed.
- In that September incident we also referred to our rollout of Azure AD backup authentication. That effort is progressing well. Unfortunately, it did not help in this case as it provided coverage for token issuance but did not provide coverage for token validation as that was dependent on the impacted metadata endpoint.
The Root Cause Analysis investigation relating to this incident is ongoing, and a full RCA will be published when this is completed, or if any other substantive details emerge in the interim.
障害ステータスとしては15日4時(JST)~ 18時25分とだいぶ長時間にわたって発生してました。実際は8時ぐらいには認証周りはだいぶ復旧してる感じでStorageなど一部領域で継続して調子が悪そうなところがある、みたいな状態だったと思います。
ざっくりAzure ADの認証に障害、依存する3rd Partyアプリケーションやポータル、Teams、ExchangeやKey Vault、Storageなどなどに影響。
予備的な根本原因:Azure ADでOpenIDや他の標準的なID関連のプロトコルに使う暗号化署名の操作で使用されるキーのローテーションでエラーが発生。
標準的なセキュリティの一環で自動化したシステムで期限が切れたキーの削除を行う中、複雑なクロスクラウドの移行をサポートするのに通常よりも長く維持するようにマークされた特定のキーがあったけどシステムがそれを無視してキーを削除してしまったようです(無視してしまうバグがあった)。
署名キーに関するメタデータはAzure ADがパブリックな場所で公開していますが19時USTに変更⇒Azure ADでメタデータを参照するアプリケーションは新しいメタデータを取得、削除されたキーで署名されたトークンやアサーションを信頼しなくなった⇒これらのアプリにアクセスできなくなった、という感じ。
キーの削除操作が原因というのがわかったので2時間後の21時5分(JSTだと6時)に以前の状態にロールバックされた。アプリケーション側でロールバックされたメタデータを取得、正しいメタデータを使われるようになるまでキャッシュの有無や期限などそれぞれの実装でその後の回復までの時間がかわる感じ。Key VaultやAzure Portalなど主要なアプリケーションは回復。StorageとStorageリソースのサブセットは影響が長引いたので強制的に更新するようPush、最終的に9時25分(UTC)に終息。
もともとこの手の原因(リスク)を防ぐためにバックエンドのSafe Deployment Process(SDP)に追加の保護を適用する作業を段階的に行ってるところで最初のフェーズはキーを追加する際の保護で、キーの削除(今回の原因)に関する保護は2021年上半期までには完了する予定だった第2フェーズで盛り込まれる予定だった様子。(昨年9月の障害も同様でこれらが実施されてれば防止されてたぽい)
9月の障害もふまえSDP更新の第1フェーズは終了、第2フェーズは慎重に段階的に展開してたのを継続。今回トークン発行はカバーできてたけど影響を受けるメタデータのエンドポイントに依存してたので、トークン検証のカバレッジがなかったから役に立たなかったという感じですかね。
2021.03.19更新
RCAが更新されて時系列が詳細になったのとNext Stepが増えてた。
RCA – Authentication errors across multiple Microsoft services (Tracking ID LN01-P8Z)
Summary of Impact: Between 19:00 UTC on March 15, 2021 and 09:37 UTC on March 16, 2021, customers may have encountered errors performing authentication operations for any Microsoft services and third-party applications that depend on Azure Active Directory (Azure AD) for authentication. Mitigation for the Azure AD service was finalized at 21:05 UTC on 15 March 2021. A growing percentage of traffic for services then recovered. Below is a list of the major services with their extended recovery times:
22:39 UTC 15 March 2021 Azure Resource Manager.
01:00 UTC 16 March 2021 Azure Key Vault (for most regions).
01:18 UTC 16 March 2021 Azure Storage configuration update was applied to first production tenant as part of safe deployment process.
01:50 UTC 16 March 2021 Azure Portal functionality was fully restored.
04:04 UTC 16 March 2021 Azure Storage configuration change applied to most regions.
04:30 UTC 16 March 2021 the remaining Azure Key Vault regions (West US, Central US, and East US 2).
09:25 UTC 16 March 2021 Azure Storage completed their recovery and we declared the incident fully mitigated.Root Cause and Mitigation: Azure AD utilizes keys to support the use of OpenID and other Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key.
Metadata about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC on 15 March 2021, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end users were no longer able to access those applications.
Service telemetry identified the problem, and the engineering team was automatically engaged. At 19:35 UTC on 15 March 2021, we reverted deployment of the last backend infrastructure change that was in progress. Once the key removal operation was identified as the root cause, the key metadata was rolled back to its prior state at 21:05 UTC.Applications then needed to pick up the rolled back metadata and refresh their caches with the correct metadata. The time to mitigate for individual applications varies due to a variety of server implementations that handle caching differently. A subset of Storage resources experienced residual impact due to cached metadata. We deployed an update to invalidate these entries and force a refresh. This process completed and mitigation for the residually impacted customers was declared at 09:37 UTC on 16 March 2021.
Azure AD is in a multi-phase effort to apply additional protections to the backend Safe Deployment Process (SDP) system to prevent a class of risks including this problem. The first phase does provide protections for adding a new key, but the remove key component is in the second phase which is scheduled to be finished by mid-year. A previous Azure AD incident occurred on September 28th, 2020 and both incidents are in the class of risks that will be prevented once the multi-phase SDP effort is completed.
Next Steps: We understand how incredibly impactful and unacceptable this incident is and apologize deeply. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future. In the September incident, we indicated our plans to “apply additional protections to the Azure AD service backend SDP system to prevent the class of issues identified here.”
- The first phase of those SDP changes is finished, and the second phase is in a very carefully staged deployment that will finish mid-year. The initial analysis does indicate that once that is fully deployed, it will prevent the type of outage that happened today, as well as the related incident in September 2020. In the meantime, additional safeguards have been added to our key removal process which will remain until the second phase of the SDP deployment is completed.
- In that September incident we also referred to our rollout of Azure AD backup authentication. That effort is progressing well. Unfortunately, it did not help in this case as it provided coverage for token issuance but did not provide coverage for token validation as that was dependent on the impacted metadata endpoint.
- During the recent outage we did communicate via Service Health for customers using Azure Active Directory, but we did not successfully communicate for all the impacted downstream services. We have assessed that we have tooling deficiencies that will be addressed to enable us to do this in the future.
- We should have kept customers more up to date with our investigations and progress. We identified some differences in detail and timing across Azure, Microsoft 365 and Dynamics 365 which caused confusion for customers using multiple Microsoft services. We have a repair item to provide greater consistency and transparency across our services.
Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey at https://aka.ms/AzurePIRSurvey .
最後の最新の情報や詳細のタイミングがサービスでまちまち、ステータスが反映されてないかよくわからないというのは今後改善に期待したいところです。