Chloe Maraina is a powerhouse in the world of business intelligence and data science, renowned for her ability to transform cold, hard numbers into vivid visual narratives. As an expert in data management and integration, she has spent years navigating the complexities of enterprise infrastructure, making her a vital voice in the conversation surrounding the rise of autonomous AI agents. In this discussion, we explore the high-stakes evolution of AgenticOps, a field where the promise of self-healing systems clashes with the messy reality of architectural reliability and the “deterministic dilemma.” We dive deep into the shifting trust levels among IT professionals, the daunting physical scale of next-generation data centers, and the economic shift toward specialized small language models that could finally bridge the gap between human oversight and full machine autonomy.
Current data indicates that IT operations has become the primary frontier for AI agents, yet many professionals are still hesitant to give these systems full control. What is driving this massive wave of adoption, and where do we currently stand on the spectrum between simple alerts and true autonomy?
It is a fascinating time to be in the trenches of IT infrastructure because we are seeing a massive psychological shift in how teams perceive automation. According to recent Omdia research involving 400 IT and cybersecurity professionals across North America, a staggering 81% identified IT operations as the main area where they are either using, piloting, or planning to deploy AI agents. This outpaces even data analytics and business intelligence, which sat at 62% in the same study, showing that the pressure to manage complex systems is outweighing the traditional roles of AI. When you look at the spectrum of autonomy, the change is even more dramatic; we’ve seen a complete collapse in the “alerts only” category, which plummeted from a significant portion of the market to a tiny 1.2% in just a year. Today, 60% of professionals are comfortable with AI providing recommendations with a manually triggered automation, while the group ready for full autonomous actions has jumped to 28.5%. It feels like a high-stakes tightrope walk where the sheer volume of work is finally forcing administrators to lean on these agents just to keep their heads above water.
Despite that growing comfort, there are some chilling stories about agents going rogue and causing catastrophic damage to production environments. How do we reconcile the urgent need for speed with the very real risks of corruption and data loss in long, multi-step workflows?
The hesitation we see is grounded in some very visceral, real-world horror stories that keep IT directors up at night, such as agents accidentally wiping out entire email inboxes or deleting production databases along with their backups. The data from Microsoft Research really underscores these fears, showing that across complex 20-step document-based workflows, agents based on frontier models actually corrupted the documents roughly 25% of the time. What’s even more frustrating for architects is that adding an “agentic harness” or extra tools to these models actually made their performance 6% worse on average, according to some benchmarks. This creates a massive trust barrier, with nearly 60% of professionals citing the time needed to validate and become comfortable with the technology as their biggest hurdle. You can almost feel the collective intake of breath when a team considers letting an agent handle remediation on its own; the fear of a “rogue” AI is a heavy weight that prevents full-scale adoption even when the potential benefits are massive.
We are hearing about a $5 trillion buildout to support AI, with data centers moving toward gigawatt scale and network traffic expected to explode. In such an environment, is the traditional “human in the loop” model even sustainable?
The reality is that we are rapidly approaching a point where the sheer scale of the infrastructure will simply outrun the human capacity to monitor it. With data centers expanding to gigawatt scale and massive rack-scale hardware like Nvidia’s Vera Rubin set to ship later this year, the complexity is becoming almost incomprehensible to the human mind. Cisco executives have already estimated that network traffic associated with AI will triple in just the next three years, creating a data deluge that no human team could ever sort through in real-time. We are reaching a tipping point where keeping a human in every loop will actually start to slow down the recovery of mission-critical systems, making the infrastructure less resilient overall. Experts are already warning that in the next 12 to 18 months, as we hit the need for megawatt racks, we will have to remove the human safety net just to maintain the speed required for these systems to function. It is a terrifying prospect for some, but when you’re dealing with server clusters spread across multiple gigawatt data centers, the machine is the only thing fast enough to keep pace with itself.
To solve these reliability issues, some propose wrapping AI agents in deterministic workflows, but others fear this will lead us back to the static, limited systems of the past. How do we find the balance between rigid rules and the adaptability that makes AI valuable?
This is the “deterministic dilemma” that is currently dividing the industry: do we constrain the AI with strict runbooks, or do we let it innovate? Some lead architects suggest that the best path is to keep agents inside deterministic workflows where the reasoning handles the orchestration, but a separate, non-AI system handles the actual execution and validation. This approach ensures that every action is bounded, observable, and reversible, which is essential for preventing a small model error from escalating into a site-wide outage. However, the risk is that we recreate the problems of the first generation of AIOps, where the rules were so static and manual that they couldn’t keep up with the changing nature of the cloud. The true promise of AgenticOps is its ability to be adaptable, to gather its own data and make conclusions about what to do next without waiting for a human to update a rule. If we lean too hard into determinism, we lose the speed and flexibility that we bought into in the first place, turning our advanced AI back into a glorified script.
The cost of AI tokens is becoming a major roadblock for many enterprises, leading to a surge in interest for specialized small language models. How are these smaller models and new observability platforms changing the economic equation for AgenticOps?
The industry is waking up to the fact that you simply cannot scale an operation if every minor IT transaction requires an expensive call to a massive frontier LLM. When you have millions of transactions happening across a global infrastructure, the cost of those tokens becomes a crushing burden that can kill the ROI of an entire project. This has led to a “new hope” in the form of specialized, lower-cost models like Galileo’s Luna or the specialized architectures being rolled out by Datadog and IBM. These small language models are designed to be foundational elements of observability, acting as a “judge” for agent behavior without the massive price tag of a general-purpose model. I’ve seen organizations that have five licenses for every major LLM vendor and still can’t solve their specific problems because they lack the specialized focus required for infrastructure. By using these specialized, open-weight models, companies can achieve the same level of accuracy but do it dramatically faster and cheaper, which is the only way AgenticOps becomes viable at a massive scale.
For a large organization looking to achieve a “self-healing” IT environment, what does the actual roadmap look like, and what are the foundational requirements that most teams overlook?
Achieving a self-healing environment is the ultimate goal, but it is the capstone of a very long and often painful multi-year journey of data management. You cannot simply drop an AI agent into a messy environment and expect it to work; it requires a massive effort to centralize data collection on things like OpenTelemetry and consolidate dozens of separate observability tools into a single repository. For example, a major mortgage lender like Fannie Mae is only now reaching a place where they can implement self-healing because they spent years on data consolidation and hygiene. Tracking 100% of agentic activity implies that your logs are perfectly clean and your data is accessible, which adds its own layer of cost and complexity. It takes a high level of organizational maturity to admit that the “AI magic” only works if the boring work of data management is done first. Without that foundation, you are essentially asking an agent to navigate a forest in the dark, but with clean data, you can finally implement those high-value use cases without the constant fear of failure.
What is your forecast for AgenticOps?
Over the next 18 months, I anticipate a massive acceleration in the deployment of autonomous orchestration as organizations realize that manual oversight is the primary bottleneck in their $5 trillion infrastructure buildouts. We will see a shift away from “generalist” AI agents toward specialized small language models that can perform root cause analysis and self-healing with a much higher degree of precision and a fraction of the token cost. While the “NoOps” vision will remain out of reach for most, the organizations that have prioritized data hygiene and central observability will finally break through the trust barrier. We will likely see the first truly autonomous, gigawatt-scale data centers where AI agents manage everything from power distribution to network rerouting with zero human intervention. It will be a period of intense trial and error, but the move toward specialized, intent-based automation is now an unstoppable force in the enterprise landscape.
