Is Platform Engineering the Secret to AI Incident Success?

Is Platform Engineering the Secret to AI Incident Success?

Chloe Maraina has spent the better part of her career translating the chaos of big data into clear, visual narratives that drive business intelligence. As an expert with a deep vision for the future of data management, she has seen firsthand how organizations often trip over the “shiny object” of new technology while ignoring the structural integrity of their systems. Her perspective is particularly relevant now, as the industry grapples with the transition from traditional observability to AI-driven incident management. She advocates for a “platform-first” mentality, arguing that the success of any autonomous agent is entirely dependent on the quality of the human-designed ecosystem it inhabits. In this discussion, we explore the reality behind the AI hype and why the most successful tech giants are focusing on their foundational “mental models” rather than just the latest algorithms.

The conversation centers on the hard-earned lessons from industry leaders who have spent years consolidating tools and refining manual workflows before even considering the implementation of AI agents. A major theme is the rejection of the “AI magic” narrative; instead, experts emphasize that incident management is an ecosystem of code ownership, documentation, and rigorous guardrails. We also delve into the “Service Reliability Hierarchy,” which posits that monitoring and a deep understanding of production must precede any advanced automation. Finally, the dialogue addresses the psychological and professional risks of over-reliance on AI, including the potential “blunting” of critical thinking skills among engineers and the innovative ways companies are managing data logistics to keep humans in control.

Many organizations spend years refining incident workflows and consolidating observability tools before ever touching AI; what does this tell us about the current state of platform maturity?

The journey for a company like Krafton Inc. highlights that true efficiency isn’t bought off a shelf; it is forged through nearly a decade of trial and error. They spent nine years meticulously refining their incident management, moving from a fragmented mess of five different observability tools into a streamlined ecosystem within Datadog. This wasn’t just about software updates; it was about shifting the mental model of their SRE teams to focus on a unified response ecosystem rather than disparate tools. The results they achieved are staggering, as their total incident volume plummeted from 107 in 2024 to just 24 by 2026. Watching the time to detect an incident shrink from an average of 8.8 minutes down to a lightning-fast 1.6 minutes proves that the foundation matters more than the “magic” of the AI itself.

When we look at the Service Reliability Hierarchy, why is it so critical to establish a base of monitoring before moving toward autonomous agents?

The team at Getswish AB experienced every engineer’s nightmare back in 2021 when a major outage was reported in the national press before their own providers even knew it was happening. That kind of public visibility creates an intense, almost physical pressure to fix the underlying architecture, leading them to embrace the pyramid where monitoring sits as the absolute bedrock. You simply cannot skip steps by trying to automate performance testing or complex release procedures if you haven’t mastered how production actually runs on a day-to-day basis. By 2024, they were able to detect and resolve another high-profile incident within five hours, a testament to the GitOps and SRE practices they spent years bootstrapping. Their focus on the OODA loop—Observe, Orient, Decide, Act—ensures that when AI agents like Bits AI are eventually introduced, they are consuming high-quality, curated runbooks rather than feeding on chaotic, unorganized data.

There is a recurring fear that autonomous AI might take critical actions that are difficult to roll back; how are experts balancing the need for speed with production reliability?

The consensus among seasoned DevOps engineers is that while we want to reach full autonomy someday, today’s AI can still make catastrophic errors in judgment. If an agent takes an automated action on a production system that is hard to undo, the risk to reliability is simply too high to justify the speed. This is why companies are currently prioritizing AI assistance over full autonomy, focusing on tasks that help human responders work faster without taking the steering wheel. Krafton uses tools like the MCP server and Pup CLI to give agents access to context data in Kubernetes and Jira, but they keep the human in the loop for the final decision. The mean time to repair has dropped from 53.5 minutes to 10.3 minutes precisely because the AI is drafting postmortems and debugging root causes, but the humans are still the ones providing the guardrails.

Some engineers have raised concerns about “blunting the sharpness of our skills” as AI takes over log formatting and data generation; is there a risk of losing our critical thinking?

This is a profound concern that resonates with many in the field, as the shift toward AI-driven log management could lead to a loss of discipline in how developers generate output. If an AI takes care of all the formatting and detail, there is a legitimate worry that engineers will stop thinking critically about the underlying logic of their code. We run the risk of becoming overly dependent on agents, essentially handing over our agency and letting our problem-solving muscles atrophy. While these solutions allow us to iterate faster and ship more code, we must be careful that we aren’t trading deep technical understanding for a superficial boost in velocity. The goal should be to use AI to enhance our capabilities, not to replace the intellectual rigor that defines high-level engineering.

How are large institutions like US Bank innovating to ensure that developers still have the power to control their data even within an automated environment?

At an enterprise scale, not every piece of data deserves a “first-class seat” in a high-cost observability platform, which is why teams are building custom routing solutions to manage retention costs. US Bank developed a proof-of-concept application called “Kickflip” which puts the control back into the hands of application teams, allowing them to enable debug logs only when they are specifically needed. This is a brilliant way to balance the massive scale of logs routed to repositories like Amazon S3 with the need for high-powered AI features like Bits AI when a crisis occurs. It acknowledges that while automation handles the bulk of the work, the application team must retain the ability to “kick” the system into a higher gear of observability when things go wrong. It’s a sensory shift from passive monitoring to active, intentional data management that keeps the human element at the center of the strategy.

What is your forecast for the role of platform engineers as AI agents become more prevalent in incident management?

I believe we are entering an era where the platform engineer’s role shifts from being a “firefighter” to being an “architect of intelligence.” Over the next few years, the focus will move away from manual troubleshooting and toward the curation of the high-quality data environments that AI agents require to function safely. We will see a massive push to bootstrap internal documentation and runbooks, not just for humans, but as “training data” for the agents that will eventually handle low-level incidents. Ultimately, the engineers who thrive will be those who can build the most robust guardrails and mental models, ensuring that AI remains a powerful tool for acceleration rather than a source of unmanaged risk. The future isn’t about the AI replacing the human; it’s about the human building a platform so solid that the AI can finally be effective.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later