昨日はOffice 365(Exchange Online?)で障害、今日はFront Doorで障害らしい。
今日の障害もOffice 365やTeamsなどなど多数に影響したもよう。
Multiple Services – Mitigated
Summary of Impact: Between 00:57 and 03:40 UTC on 20 Nov 2019, customers and services utilizing Azure Front Door (AFD) services were impacted by an infrastructure service failure. This resulted in a loss of connectivity to multiple services reliant on AFD.
Azure Front Door services provide Edge caching and network entry point services to the Microsoft global network. This issue impacted a large percentage of Microsoft Services, though not all services were impacted. Many impacted services were able to initiate fail over from the AFD platform, providing immediate mitigation to their customers.
Preliminary Root Cause: During a recent periodic deployment, initial safety checks did not detect the issue and prevent the roll out. Monitoring detected once service failure was experienced and alerted engineers.
Mitigation: Engineers immediately initiated deployment rollback procedures to correct the underlying Azure Front Door issue. This was completed at 02:40 UTC on 20 Nov 2019, at which point Impacted services began recovering.
A detailed root cause analysis will be published within 72 hours.
日本のAMから昼頃にかけて3時間ぐらい?いろいろ障害でてましたね(寝てた)。
原因はAzure Front Doorで定期的なUpdate中に初期安全性チェックで問題が検出できなかったようで、ロールアウトに失敗、障害が発生したらしいです。その後すぐロールバックして直ったようですがEdgeの入り口にあたるサービスなので影響範囲が大きかったようですね。。(逆にCDN使わなかったりリージョン単独で動かしてたサービスだと接続などには問題にならなかったのかな)
RCAはそのうちでるでしょう。
RCAでました。(2019/11/23 1:22確認)
RCA – Multiple Services – Downstream impact from Azure Front Door
Summary of Impact: Between 00:56 and 03:40 UTC on 20 Nov 2019, multiple services across Microsoft including Azure, Microsoft 365 and Microsoft Power Platform leveraging the Azure Front Door (AFD) service experienced availability issues resulting from high request failure rates. During this event, some impacted services were able to divert traffic away from the AFD service to mitigate impact for them.
One of the impacted services was the Azure Status Page at https://status.azure.com. Engineering executed the failover plan to the secondary hosting location, but this resulted in a delay in status communication changes. Communications were successfully delivered via Azure Service Health, available within the Azure management portal.
Root Cause: Azure Front Door services provide network edge caching and web acceleration services to many of Microsoft’s SaaS services, in addition to the optimization offering direct to Azure customers. A routine, periodic deployment was released through our validation pipeline that, when combined with specific traffic patterns, caused service-wide, intermittent HTTP request failures for all services utilizing the AFD service.
Investigation into the faulting behavior revealed that the combination of a sequenced code deployment, a configuration deployment and specific traffic patterns triggered a dormant code bug that instigated the platform to crash. These deployed changes were tested before being shipped to the broader cloud; however, the specific traffic pattern was not observed during test and pilot phases.
Azure Front Door deploys to over one hundred points of presence (PoPs) around the globe and deploys customer configuration globally to each of these PoPs, enabling customers to quickly make changes to their service. This is done to ensure customers are able to promptly remove regional components out of specification and update configuration for network security services to mitigate attacks. Through a staged deployment, these changes passed validation and service health-checks. Having passed these validations, propagation to global PoPs was quick, by design, to meet the aforementioned service objectives. After propagation, the fault triggering behavior was instigated only by specific traffic patterns, that occurred after the deployment had completed.
This resulted in impacted customers experiencing a high, but intermittent, rate of web request failures globally while accessing shared services across the Azure and Office platforms.
Mitigation: Global monitoring detected the issue and engaged engineers at 01:04 UTC. Engineers confirmed the multiple sources of the issue to be primarily triggered by the configuration deployment and identified a fix for the issue by 01:27 UTC. Engineers immediately initiated deployment rollback procedures to return the service to a healthy state; this rolled out quickly, progressively and completely to all global platforms by 02:40 UTC. Many of the Microsoft SaaS impacted services were able to initiate failover away from the AFD service, providing mitigation to customers while the underlying AFD mitigation was deployed.
Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
- Verify that the fix deployed globally to AFD, during mitigation, is a stable release and will remain in place until all internal reviews of this issue have been completed.
- Review all service change management processes and practices to help ensure appropriate deployment methods are used.
- Review the change validation process to identify components and implement changes, required to increase test traffic diversity, improving the scope of trigger and test code paths.
- Prioritize deployment of a component independent automated recovery process so impacted deployments, like that experienced during this incident, are automatically returned to the last-known-good (LKG) state at a component layer, quickly and without manual intervention, to help reduce time to mitigate and scope of impact.
- Investigate and remediate the delay experienced with publishing communications to the Azure Status Page during the impact window.
Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey https://aka.ms/HLMF-R88
status.azure.com のAzureサービスの状態サイトも影響を受けたことにも言及。フェールオーバーした影響みたい。Azureポータル内で見れるAzure Service Healthのほうは正常だったっぽい?
根本原因はシーケンスされたコードの展開、構成の展開、特定のトラフィックパターンの組み合わせでプラットフォームの休止コードを引き起こしたらしい。テストとパイロットフェーズだとこのパターンは無かった(観察されなかった)。
新しい構成を段階的に展開するときに検証やサービスのヘルスチェックに合格した後、グローバルのPOPロケーションに展開したけどその後特定トラフィックパターンで障害が出た様子。
今回の件でフィードバックある人はhttps://aka.ms/HLMF-R88からどうぞ。