Azure Active Directory – Authentication Errors – Resolved
Summary of impact: Between 08:30 and 11:30 UTC on 06 Apr 2018, a subset of Azure Active Directory customers in East Asia and Europe may have experienced difficulties when attempting to authenticate into resources which are dependent on Azure Active Directory. Downstream impact was reported by a number of Azure services for which customers may have experienced: Backup: failures for the registration of new containers and backup/restore operations. Storsimple: new device registration failures and Storsimple management/communication failures. Azure Bot Service: bots reporting as unresponsive. Visual Studio Teams Services: higher execution times and failures while getting AAD tokens in multiple regions. Media Services: authentication failures. Azure Site Recovery: new registrations and VM replications may also have failed.
Preliminary root cause: Engineers determined that instances of a backend service responsible for processing authentication requests became unhealthy preventing requests from completing.
Mitigation: Engineers performed a recovery of the impacted backend service .
Next steps: Engineers will continue to investigate to establish the full root cause and a full root cause analysis will be available in 72 hours.
主にEast Asia、EuropeにあるAzure ADの顧客が影響を受けたようです。Azure AD認証を使う一部のサービス（Media Serviceやバックアップ、Bot、VSTS、Office 365とか）で認証が失敗するなどの状況だったようす。原因は認証要求処理を担当するバックエンドのインスタンスが不健全になったかららしい。
RCA – Azure Active Directory – Authentication Errors
Summary of impact: Between 08:18 and 11:25 UTC on 06 Apr 2018, a subset of customers may have experienced difficulties when attempting to authenticate into resources with Azure Active Directory (AAD) dependencies, the primary impact being experienced for resources located in Asia, Oceania, and European regions. This stemmed from incorrect data mappings in two scale units which caused degraded authentication service for impacted customers, impacting approximately 2.5% of tenants. Downstream impact was reported by some Azure services during the impact period, customers may have experienced for the following services:
1. Backup: failures for the registration of new containers and backup/restore operations.
2. StorSimple: new device registration failures and StorSimple management/communication failures.
3. Azure Bot Service: bots reporting as unresponsive.
4. Visual Studio Team Services: higher execution times and failures while getting AAD tokens in multiple regions.
5. Media Services: authentication failures.
6. Azure Site Recovery: new registrations and VM replications may also have failed.
We are aware that other Microsoft services, outside of Azure, were impacted. Those services will communicate to customers via their appropriate channels.
Root cause and mitigation: Due to a regression introduced in a recent update in our data storage service that was applied to a subset of our replicated data stores, data objects were moved to an incorrect location in a single replicated data store in each of the two impacted scale units. These changes were then replicated to all the replicas in each of the two scale units. After the changes replicated, Azure AD frontend services were no longer able to access the moved objects, causing authentication and provisioning requests to fail.
Only a subset of Azure AD scale units were impacted due to the nature of the defect and the phased update rollout of the data storage service. During the impact period, authentication and provisioning failures were contained to the impacted scale units. As a result, approximately 2.5% of tenants will have experienced authentication failures.
1. 08:18 UTC – Authentication failures when authenticating to Azure Active Directory detected across a subset of tenants in Asia-Pacific and Oceania.
2. 08:38 UTC – Automated alerts notified Engineers about the incident in APAC and Oceania regions.
3. 09:11 UTC – Authentication failures when authenticating to Azure Active Directory detected across a subset of tenants in Europe.
4. 09:22 UTC – Automated alerts notified engineers about the incident in Europe. As part of the earlier alerts, Engineers already investigating.
5. 10:45 UTC – Underlying issue was identified and engineers started evaluating mitigation steps.
6. 11:21 UTC – Mitigation steps applied to impacted scale units.
7. 11:25 UTC – Mitigation and service recovery confirmed.
Next steps: We understand the impact this has caused to our customers, we apologize for this and are committed to making the necessary improvements to the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Isolate and deprecate replicas running the updated version of the data store service [Complete]
2. A fix to eliminate the regression is being developed and will be deployed soon [In Progress]
3. Improve telemetry to detect unexpected data movement of data objects to incorrect location [In Progress]
4. Improve resiliency by updating data storage service to prevent impact should similar changes occur in the data object location [In Progress]
Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://survey.microsoft.com/698785