In a year marked by unprecedented digital disruption, we turn to Chloe Maraina, a leading Business Intelligence expert renowned for her ability to translate complex data into clear, compelling narratives. As we reflect on the major outages of 2025, Chloe helps us look beyond the headlines to understand the intricate patterns of failure that impacted services from Asana to AWS. Our conversation explores the hidden dangers of configuration changes, the illusion of “all-green” dashboards, and the complex, often invisible, dependencies that define our modern digital infrastructure, offering critical insights into building a more resilient future.
Configuration changes were a recurring cause of major outages in 2025. What makes these updates so prone to error, and what practical steps, such as staged rollouts, can teams take to de-risk this process? Please walk us through a best-practice deployment.
It’s a classic case of a small change causing a massive ripple effect. In today’s incredibly complex systems, it’s almost impossible to test for every single interaction that might occur. We saw this with Asana in February, where a configuration change overloaded their server logs, triggering a cascade of server restarts. It’s like tipping over a single domino and watching the whole line fall. The best practice to de-risk this is to move away from the “big bang” deployment model. Instead, you adopt a staged rollout. You release the change to a small, controlled segment of your infrastructure first, watch it closely, and only proceed when you’re confident it’s stable. Asana themselves adopted this very model after their back-to-back incidents. This approach turns a potential catastrophe into a manageable, localized issue you can quickly roll back.
Incidents at services like Slack and Spotify showed that network connectivity can appear healthy while core backend functions are failing. How can operations teams look past misleading “all-green” dashboards to pinpoint the true source of a problem? What specific metrics are most crucial here?
This is one of the most frustrating scenarios for an operations team. The dashboards are a sea of green—no latency, no packet loss—yet users can’t send messages or play music. We saw this play out for a grueling nine hours at Slack. Their front door was wide open, but the machinery inside was grinding to a halt because of a database overload. The key is to stop treating the network as the only source of truth. You must combine multiple diagnostic observations. When network vitals look pristine but the application is failing, you have to immediately pivot your investigation to the backend services. It’s not about one magic metric; it’s about correlating signals. Are API calls timing out? Are database queries queuing up? Looking at the application layer and backend performance is what ultimately led investigators to the true source, a fact Slack later confirmed.
Zoom’s services became unreachable when its DNS records vanished from top-level domains, while a Cloudflare outage was tied to BGP route announcements disappearing. What are the key vulnerabilities in these foundational internet systems, and how can organizations better prepare for failures that occur outside their own infrastructure?
These incidents are terrifying because the problem lies completely outside your own four walls. With Zoom, their infrastructure was perfectly healthy and running, ready to answer requests. But their name server records, the very things that tell the internet how to find them, simply disappeared from the top-level domain. It was like their address was erased from the global map; no one could find their front door. Similarly, Cloudflare’s BGP routes vanished, making their DNS resolver unreachable. The vulnerability here is a dependency on core internet protocols that we often take for granted. You can’t control the TLD nameservers or the global BGP table, but you can build resilience by having robust monitoring that looks up the dependency chain, not just at your own servers. This way, you can at least diagnose the problem instantly, even if the fix is out of your hands.
The AWS DynamoDB outage demonstrated how a failure in a single region can cascade globally through non-obvious dependencies. How can architects better map and understand these complex dependency chains, and what strategies can they use to build more resilient, less centralized systems?
The AWS incident was a masterclass in the hidden dangers of centralization. Everyone thought of US-EAST-1 as just another region, but it turned out to be the linchpin for global services like IAM. When it failed, the impact rippled outwards, taking down major customers worldwide for over 15 hours. The problem is that these critical dependency chains often aren’t obvious from standard architecture diagrams. Architects need to go deeper, conducting failure-mode analysis that asks, “What happens if this single component disappears?” The strategy for resilience is to actively design against these single points of failure. This means avoiding hard dependencies on a single regional endpoint for global services and truly distributing not just your application, but also its core control and management functions.
An outage at Commonwealth Bank saw its mobile app, website, and ATMs all fail simultaneously. When multiple customer channels go down at once, what diagnostic process should a team follow to immediately identify the likely failure of a shared backend dependency?
When you see a simultaneous failure across completely different channels like that, it’s a huge diagnostic clue. The mobile app, the website, and the ATMs all have different front-end technologies and user interfaces. The odds of them all failing independently with a UI issue at the exact same moment are practically zero. This pattern immediately allows you to eliminate entire categories of problems. You don’t waste time looking at the app code or the website’s front-end servers. Instead, your focus should instantly shift to what they all have in common: the shared backend. The diagnostic process becomes a hunt for that single common dependency that underpins every customer touchpoint, which is precisely what was suspected in the Commonwealth Bank incident.
Cloudflare experienced an intermittent, global instability because a bad configuration file was loaded on staggered five-minute cycles. How does this type of rolling, intermittent issue complicate troubleshooting, and what specific monitoring techniques can help teams detect and resolve such problems?
Intermittent issues are the ghosts in the machine; they are incredibly difficult to troubleshoot. Unlike a hard “lights-off” outage, the service flickers. Users report problems, but by the time you investigate, everything looks fine again. In Cloudflare’s case, their proxies were refreshing a bad configuration on staggered five-minute cycles. This meant that at any given moment, only a fraction of their infrastructure was failing. This rolling nature completely complicates troubleshooting because the symptoms are inconsistent and geographically distributed. To catch this, you need high-fidelity monitoring that can spot subtle, repetitive patterns across your entire global footprint. You’re not looking for a single massive failure, but a correlated, rhythmic blip that points to a systemic issue like a staggered rollout gone wrong.
What is your forecast for network and application resilience in the coming year?
I believe we’re moving past the illusion that we can prevent every failure. The systems we’ve built are simply too complex. The focus in the coming year will shift decisively from prevention to rapid detection and recovery. We’ll see more organizations embracing techniques like staged rollouts, not as a nice-to-have, but as a fundamental part of their deployment pipeline. The real competitive advantage won’t be having perfect uptime, but rather minimizing the time between when a problem is detected and when it’s resolved. It’s about building operational muscle and deep architectural understanding so that when the inevitable failure occurs, the recovery is so swift and graceful that the customer barely notices. That is the new gold standard for resilience.
