Zscaler Inc.

31/07/2024 | News release | Distributed by Public on 31/07/2024 23:55

Microsoft Outage Detected by Zscaler Digital Experience (ZDX)

At 11:00 UTC on July 30, 2024, Zscaler Digital Experience (ZDX) saw a substantial, unexpected drop in the ZDX scorefor Microsoft services around the globe. Upon analysis, we noticed HTTP 503 errors highlighting a Microsoft outage, with the ZDX heatmap clearly detailing the impact on a global scale.

ZDX dashboard indicating a widespread Microsoft outage

ZDX enables customers to proactively identify and quickly isolate service issues, giving IT teams confidence in the root cause, reducing mean time to resolve (MTTR) and first response time (MTTD).

The ZDX Incident Dashboard includes ML models to detect problems in applications, Wi-Fi, Zscaler data centers, last mile and intermediate ISP, and the endpoint, with automated AI-powered correlation. The dashboard includes incidents that have occurred in the last two weeks, with details on who was impacted, when, and where.

The Incidents Dashboard below captured the issue across the entire data path and identified the outage as an "application" issue. In the incident details page, you can drill down to further understand the area of impact, epicenter, who is affected, and where.

ZDX Score highlights Microsoft outage

Visible on the ZDX admin portal dashboard, the ZDX Score represents all users in an organization across all applications, locations, and cities on a scale of 0 to 100, with the low end indicating a poor user experience. Depending on the time period and filters selected in the dashboard, the score will adjust accordingly.

The dashboard shows that the ZDX Score for the Microsoft probes dropped to Poor during the outage window of approximately 2 hours. From within ZDX, service desk teams can easily see that the service degradation isn't limited to a single location or user and quickly begin analyzing the root cause.

ZDX dashboard showing Microsoft global issues

Also in the ZDX dashboard, "Web Probe Metrics" highlight the user impact of reaching Microsoft applications across a timeline with response times. In this case, the server responded with 503 errors, indicating the server was not ready to handle requests.

ZDX Web Probe Metrics indicating 503 errors

ZDX can quickly identify the root cause of user experience issues with AI-powered root cause analysis. This spares IT teams the labor of sifting through fragmented data and troubleshooting, helping accelerate resolution and keep employees productive.

With a simple click in the ZDX dashboard, you can analyze a score, and ZDX will provide insight into potential issues. As you can see, in the case of this Microsoft outage, ZDX highlights that the application is impacted while the network itself is fine.

ZDX AI-powered root cause analysis indicates the reason for the outage

When an application outage occurs, many IT teams initially suspect network issues as the underlying problem. However, as demonstrated above, AI-powered root cause analysis confirmed that the issue was at the application level, not in the network transport. This can be corroborated by examining the Cloud Path metrics from the user to the destination.

ZDX Cloud Path showing full end-to-end data path

Additionally, ZDX AI-powered analysis and dynamic alerts enable IT teams to swiftly differentiate between optimal and degraded user experiences by setting smart alerts for deviations in observed metrics. ZDX provides the ability to compare two points in time to discern differences between them. This feature helps teams identify what constitutes a good vs. poor user experience by visually emphasizing the disparities across application, network, and device metrics.

According to the Microsoft status page, the outage was reported at 11:45 UTC until 14:00 UTC, which correlates to the ZDX data above. However, Microsoft services started to recover pretty quickly, and Microsoft reported the issue resolved by 19:43 UTC.

Source: Microsoft

ZDX alerting enabled proactive notifications to our customers about end user issues, automatically initiating incidents with our service desk integration (e.g., ServiceNow) well before users reported issues. From a single dashboard, customers could swiftly pinpoint the problem as a Microsoft issue rather than an internal network outage, thus conserving valuable IT resources.

Zscaler Digital Experience effectively identified a Microsoft outage and its underlying cause, reassuring our customers that the issue was neither localized to a single area nor related to their networks or devices, preventing significant business disruption.

Try Zscaler Digital Experience today

ZDX enables IT teams to oversee digital experiences from the user's point of view, enhancing performance and quickly resolving issues related to applications, networks, and devices. To discover how ZDX can benefit your organization, please get in touch with us.