Microsoft has acknowledged that a nearly five-hour worldwide outage in the company's network infrastructure was caused by network engineers making poor decisions about how to configure WAN routers.
Microsoft explained that changing the IP address of the WAN routers was the cause of the denial of service issue.
The company provided instructions for network engineers to modify the configuration of the routers as part of the planned maintenance of the WAN routers. Following an incorrect configuration change procedure, a cascading disconnect of Microsoft WAN routers from other routers in the network began. The devices updated their routing tables and excluded Microsoft autonomous systems and the company's traffic management systems from forwarding to optimize data flow on the WAN.
After analyzing the situation and looking into the issue, the company's network engineers manually rolled back the previously made adjustments. After some time, they successfully restored Microsoft cloud services' functionality.After analysing the situation and looking into the issue, the company's network engineers manually rolled back the previously made adjustments. After some time, they successfully restored Microsoft cloud services' functionality.
Microsoft has issued a restriction on such work following the incident. Any modifications to the configuration of network devices that do not comply with the above recommendations for secure configuration changes must be blocked.
Following a worldwide outage, Microsoft engineers raised all major services four hours later, revealing a new problem that has affected hundreds of millions of users around the world. An unusual situation occurred due to an inconvenient network infrastructure.
Enterprise and consumer customers could not access Azure, Microsoft 365, Microsoft Teams, Exchange Online, SharePoint Online, OneDrive for Business, and Microsoft Graph resources. Xbox game services, Minecraft servers, and VS Code resources also were unavailable.
After the outage, Microsoft said it was troubleshooting network systems that caused disruptions to its cloud services. The company's engineers isolated the network configuration issue and restored previous improvements to the network IT infrastructure, leading to an access problem for users around the globe.
Microsoft has added additional capacity to its network infrastructure as part of ongoing network repairs in order to speed up the restoration process.
The company's main network cloud services resumed functioning four hours after the failure. The company did not explain why they allowed this global network problem to arise as part of a planned change in their border router network configurations.
Qrator network specialists discovered that the AS8075 autonomous system lost communication with 47 other ASNs, including AS701 UUNET (Verizon). Everything happened not immediately, but in several waves, as confirmed by Microsoft engineers.