2019年5月2日(日本時間だと5月3日6時ぐらい)にAzureというかMicrosoftネットワーク全体で接続関連の障害が発生しました。
Azure上には以下のように情報が載っています。(2019.05.03 13時の情報)
Network Connectivity – DNS Resolution
Summary of impact: Between 19:43 and 22:35 UTC on 02 May 2019, customers may have experienced intermittent connectivity issues with Azure and other Microsoft services (including M365, Dynamics, DevOps, etc). Most services were recovered by 21:30 UTC with the remaining recovered by 22:35 UTC.
Preliminary root cause: Engineers identified the underlying root cause as a nameserver delegation change affecting DNS resolution and resulting in downstream impact to Compute, Storage, App Service, AAD, and SQL Database services. During the migration of a legacy DNS system to Azure DNS, some domains for Microsoft services were incorrectly updated. No customer DNS records were impacted during this incident, and the availability of Azure DNS remained at 100% throughout the incident. The problem impacted only records for Microsoft services.
Mitigation: To mitigate, engineers corrected the nameserver delegation issue. Applications and services that accessed the incorrectly configured domains may have cached the incorrect information, leading to a longer restoration time until their cached information expired.
Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. A detailed RCA will be provided within approximately 72 hours.
どうもMicrosoft管理のサービス(App Service、Azure AD、SQL Databaseなどなど)関連のDNSの委任まわりが原因のようです。
今までのDNSからAzure DNSへ移行中に一部Microsoftサービスのドメインが誤って更新されたせいで繋がらなくなった様子。Azure DNSの問題や顧客のDNSレコードの問題ではないとのこと。
すぐにDNSの委任の問題は修正されたようですが、システムによってはDNSレコードのTTLが切れたり、アプリケーションによっては再起動が必要になるケースも。まあキャッシュを見ずに新しくDNSレコードを引ければすぐに直る可能性もありました。
実際自分の身の回りでは盛大に障害中状態でも早期にDNSレコードがちゃんと引けるようになって問題なくなったケースだとか、Functionsを再起動する必要があるケースとかがありました。
他のAzureサービスを利用していない静的WebサイトホスティングでカスタムドメインあててるようなApp Serviceだと何の影響もなかったですね。Azureポータル上で情報見たりはできなくなってましたが。
またどのサービスに引っ張られていたのかわかりませんが、Office 365のSharePointに接続できなかったり、障害中はMicrosoft TeamsへのPostが怪しかったりしてました。(ニュースサイトとかでもMicrosoft 365の障害みたいな扱いのところもありましたし)
2019.05.06 追記
RCAがでていました。
RCA – Network Connectivity – DNS Resolution
Summary of impact: Between 19:29 and 22:35 UTC on 02 May 2019, customers may have experienced connectivity issues with Microsoft cloud services including Azure, Microsoft 365, Dynamics 365 and Azure DevOps. Most services were recovered by 21:40 UTC with the remaining recovered by 22:35 UTC.
Root cause: As part of planned maintenance activity, Microsoft engineers executed a configuration change to update one of the name servers for DNS zones used to reach several Microsoft services, including Azure Storage and Azure SQL Database. A failure in the change process resulted in one of the four name servers’ records for these zones to point to a DNS server having blank zone data and returning negative responses. The result was that approximately 25% of the queries for domains used by these services (such as database.windows.net) produced incorrect results, and reachability to these services was degraded. Consequently, multiple other Azure and Microsoft services that depend upon these core services were also impacted to varying degrees.
More details: This incident resulted from the coincidence of two separate errors. Either error by itself would have been non-impacting:
1) Microsoft engineers executed a name server delegation change to update one name server for several Microsoft zones including Azure Storage and Azure SQL Database. Each of these zones has four name servers for redundancy, and the update was made to only one name server during this maintenance. A misconfiguration in the parameters of the automation being used to make the change resulted in an incorrect delegation for the name server under maintenance.
2) As an artifact of automation from prior maintenance, empty zone files existed on servers that were not the intended target of the assigned delegation. This by itself was not a problem as these name servers were not serving the zones in question.Due to the configuration error in change automation in this instance, the name server delegation made during the maintenance targeted a name server that had an empty copy of the zones. As a result, this name server replied with negative (nxdomain) answers to all queries in the zones. Since only one out of the four name server’s records for the zones was incorrect, approximately one in four queries for the impacted zones would have received an incorrect negative response.
DNS resolvers may cache negative responses for some period of time (negative caching), so even though erroneous configuration was promptly fixed, customers continued to be impacted by this change for varying lengths of time.
Mitigation: To mitigate the issue, Microsoft engineers corrected the delegation issue by reverting the name server value to the previous setting. Engineers verified that all responses were then correct, and the DNS resolvers began returning correct results within 5 minutes. Some applications and services that accessed the incorrect values and cached the results may have experienced longer restoration times until the expiration of the incorrect cached information.
Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1) Additional checks in the code that performs nameserver updates to prevent unintended changes [in progress].
2) Pre-execution modeling to accurately predict the outcome of the change and detect potential problems before execution [in progress].
3) Improve per-zone, per-nameserver monitors to immediately detect changes that cause one nameserver’s drift from the others [in progress].
4) Improve DNS namespace design to better allow staged rollouts of changes with lower incremental impact [in progress].Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://aka.ms/R50C-5RZ
Azure StorageやSQL DatabaseなどいくつかのMicrosoftサービスにアクセスするために使われていたDNSゾーンのネームサーバーの1つを更新するための変更を行ったところ変更プロセスが失敗して4つのネームサーバーのうち1つのネームサーバーのレコードが空白のゾーンデータになり否定的な応答を返すDNSサーバーを指すようになったのが原因かな。結果database.windows.netなど関連するドメインに対するクエリの25%が誤った結果を返すようになりサーバーにクライアントから到達できなくなったと。
直接的な原因はネームサーバー更新のために委任を変更したけど変更するために使用したオートメーションのパラメーター設定ミスが原因と。4台あるうちのメンテナンスとして1台に対して更新したのが失敗したらしい。
2つ目としては運用保守の自動化の成果物として割り当てられた委任の意図したターゲットではないサーバー上に空のゾーンファイルが存在していた(それ自体は問題はない※そもそも扱わないので)が、変更の設定ミスでその空のゾーンを持つネームサーバーが対象になったとからしい。結果的にゾーン内のすべてのクエリに対してnxdomainな回答になったらしい。
ミスそのものは比較的はやく修正されたけどDNSリゾルバーがnxdomainなどネガティブキャッシュを持つこともあるので影響が長引いたとかなんとか。
対応策として意図しない変更を防ぐための追加チェック、実行前に問題を検出できるような実行前モデリング、ゾーン、ネームサーバーごとのモニタリングの改善、DNSネームスペースの設計の改善(段階的な変更ロールアウトで影響範囲を絞るとか)などが進行中。
これにより、XBox Liveも一度ほとんど全滅になって、今は復旧していますが、マッチングが不安定だったりまだ調子が悪い感じです。
ピンバック: Microsoft Azure 2019 年 5 月の Update - メモログ