← Back to news

Service Incident - May 27, 2026 - Support and Chat | Pod 25

Zendesk

A two-hour outage on Pod 25 on May 27, 2026 exposed a critical vulnerability in Zendesk's internal infrastructure automation. Between 5:00 and 7:00 UTC, an automated load-balancing job triggered an unusually large data redistribution—hundreds of gigabytes—that overwhelmed the event-processing layer and cascaded into customer-visible delays of up to five minutes across chat, messaging, and ticket assignment. The incident is particularly notable because it wasn't caused by external traffic spikes or resource exhaustion, but by internal automation operating without adequate guardrails. This raises a fundamental question for teams managing Zendesk at scale: how many other automated processes are running with insufficient controls, and what's the blast radius if they malfunction?

The root cause analysis reveals a systemic gap between infrastructure design and operational safeguards. Zendesk's response—disabling the balancing job and implementing retroactive controls—addresses the symptom rather than the underlying architectural assumption that internal operations can be safely automated without circuit breakers or magnitude thresholds. For CX leaders, this incident carries immediate implications. Teams relying on real-time chat and messaging for customer engagement experienced degradation during what was likely business hours in multiple regions, directly impacting SLA compliance and customer experience metrics. The fact that this was a "small subset of customers" on a single pod offers limited comfort; pod-based architectures mean similar incidents could affect different customer segments with each outage.

Zendesk's prevention roadmap—adding safeguards, improving monitoring, and tuning capacity—is reactive rather than proactive. The incident also sits within a pattern of recent service disruptions, including a message delivery incident on May 19 affecting all pods. For support leaders evaluating platform reliability, the question becomes whether Zendesk's infrastructure is scaling faster than its operational maturity. Teams should audit their incident response procedures and consider whether their SLAs adequately account for internal automation failures—a category of outage that's difficult to predict and increasingly common as platforms grow in complexity.