According to Forbes, the recent AWS US-EAST-1 data center failure knocked out major platforms including Snapchat, Reddit, Fortnite and financial apps for several hours, highlighting system fragility. A 2025 New Relic report puts the median cost of major outages at nearly $2 million per hour, making faster detection and recovery a financial imperative. While 87% of organizations say their AIOps investments have met or exceeded expectations, only 12% have achieved full enterprise-wide deployment due to data quality and integration challenges. Gaurav Toshniwal, CEO of Sherlocks.ai, explains that AIOps value comes from cutting alert noise and speeding fixes, which translates to higher retention, lower churn and better customer satisfaction scores. Riverbed’s global survey shows persistent barriers including infrastructure complexity are slowing broader AI adoption across IT operations.
The ROI Problem
Here’s the thing about AIOps: everyone wants it, but nobody can quite prove it’s working. The tools promise to cut through alert noise and fix problems faster, but when performance improves, it’s often unclear what actually drove the change. Is it the AI? Better data? Improved workflows? Most companies can’t easily separate these factors.
Toshniwal’s company tries to solve this by benchmarking mean time to detect and resolve before and after deployment. They track what percentage of issues get automatically triaged or resolved through their recommendations. But this level of measurement is rare across the industry. Basically, we’re in that awkward phase where everyone’s buying the tools but few can clearly articulate the return.
Different Companies, Different Realities
Startups tend to see value faster because they deploy rapidly and face frequent incidents. Automation lets smaller teams stay reliable without adding heavy operational overhead. But larger enterprises? That’s a whole different story.
Legacy systems, overlapping vendor tools, and dependence on a few key engineers make reliability both harder to measure and far more expensive when it fails. The real ROI for big companies isn’t just automation—it’s about turning implicit knowledge into explicit, reusable intelligence. When those experienced engineers leave or become unavailable, critical context disappears. AIOps becomes a way to preserve hard-earned expertise.
The Accountability Push
After the AWS outage, even major financial institutions started rethinking how they track performance and risk. Christer Holloman noted in Forbes that they’re exploring multi-cloud strategies to limit exposure. The message is clear: with downtime costs climbing, executives want evidence that their tech investments actually add business value.
Toshniwal thinks we need a “reliability scorecard” that tracks detection speed, fix times, how often updates cause failures, and avoided downtime. Consistent benchmarks would make results easier to compare and bring transparency to the market. And let’s be honest—that transparency is long overdue.
The Next Frontier
AIOps has reached a turning point. The rush of investment is giving way to a more disciplined phase where proof matters more than promise. As Alois Reitbauer, chief technology strategist at Dynatrace, noted, observability is shifting from reporting application health to informing business decisions.
If the last decade was about seeing systems more clearly, the next one will be about understanding them deeply enough to act in real time. The state of observability is evolving from reactive monitoring to predictive prevention. Reliability will sit at the center of business strategy as the clearest sign that data, not guesswork, runs the show. The question is no longer whether AI can run operations—it’s whether it can make them smarter, faster, and more accountable.
