The (Near) Future of AIOps

Although we have some ways to go before living the ultimate dream of self-driving datacenters, AIOps, the application of artificial intelligence algorithms to aid with infrastructure operations and DevOps in the tech industry, has found its way into the limelight, and more importantly, into reality. As a steadily increasing acceptance emerges in the operations (AKA “ops”) market for the use of intelligent insights, analytics & platform providers are seeking to cash in on and win the battle for an AIOps monopoly. In fact, there are multiple monopolies up for grabs in the diverse space of pain-points in today’s semi-automated, multi-cloud environments.

A brief history of Ops

The infrastructure operations space, over the past couple of decades, has seen a LOT of changes. These changes have been heavily influenced by high-impact technical innovations by virtualization leaders like VMware, as well as timely interest and adoption of these technologies by the market. Take a look at this rough timeline of the successful technical innovations that came out of VMware in the virtual infrastructure space:

The introduction of the ESX hypervisor (and others like it) in the early 2000s made it possible for enterprises of various shapes and sizes adopt the notion of a virtual infrastructure, changing the face of IT management and operations. Business-critical applications shifted from running directly on physical machines to running on virtual machines that could be powered on or off, cloned, or better yet, live migrated between the physical hosts they resided on. The system administrator and ops professional would manage the virtual environment using a console, either via the command line or via a GUI like vCenter.

In the second half of that decade, we saw more automation and centralized control for on-premises data-centers come to life. DRS, HA, vApps – all an effort to offer the IT and operations market more features to reliably control and operate their virtual environment. The first form of vRealize Operations released in 2011, moving seamless programmability up the stack yet, adhering to solving more ops pain-points in the context of the whole enterprise stack. More and more enterprises were deploying multiple data-centers; some were starting to adopt the public cloud for some of their workloads. They needed more automation. VMware’s vROps, Nutanix’s Calm, and countless other solutions emerged to solve the burning need.

Over the past 5 years, VMware has offered solutions to help manage and operate clouds across multiple providers – namely, it’s own vSphere stack and Amazon Web Services. Outside of the VMware eco-system, several “multi-cloud” solutions appear in monitoring (Datadog, SignalFx, AppDynamics), analytics, cloud cost optimization (Beam, CloudHealth), and cloud management platforms.

As the clock ticks on, it becomes apparent that the next step in this evolution has got to be beyond policy-based automation.

Enter AI

In a previous post on Cloud Management Platforms, I walked through the traits of a viable and likely to succeed CMP, namely –

  • a unified control plane
  • ease of deployment
  • metering, monitoring, cost control
  • policy-driven automation and governance
  • cloud-specific feature integration, and
  • migration & disaster recovery

While this version of a CMP solves a lot of problems that operators of an ever-changing multi-cloud environment face today, it is only one more step toward the dream of complete autonomy. Cloud admins are able to focus more on setting (deterministic) policies than manually operating low-level knobs, but they still need to be (manually) observant and decisive. As the era of AI looms above us, the possibilities of relinquishing more control to the cloud operations solution, thereby allowing the cloud operator to go yet another step up the stack, become more real than ever.

So what are the new pain-points?

As businesses get more digitized, IT operations teams need to be as efficient as possible. Some of the issues faced today with a deterministic & policy-based cloud operations platform are:

Too much noise in the data

Although most IT operations teams have access to a powerful data ingestion pipeline and a viable analytics tool at their disposal, it is still hard to get meaningful and accurate insights out of the data. This can be a huge problem, considering how “big” the data these systems handle is getting.

Anomaly Detection

Analytics tools show you the data, but the detection of anomalies with the naked eye is no easy feat. Usually, IT operations teams want to make sure the infrastructure is functioning OK, and their customers, both internal and external, are able to use the software and services they own without any technical issues. This single use-case alone places great emphasis on why outlier and anomaly detection can be a game-changer. It would eliminate some important reasons for a user to manually monitor the environment.

hard to predict future issues

Solving issues after-the-fact is a primary function. But complementary to that is the task of mitigating the possibility of an issue before it happens. Big Data affords us the ability to recognize patterns beyond what can be seen by the naked eye, but today’s analytics tools haven’t fully tackled that use-case. Users need a reliable way to be informed of the likelihood of an issue, so they can prevent it from occurring.

Root-cause analysis

Using a metrics-based monitoring tool like the Datadog-agent will help with the “finger-pointing” problem (that is, figuring out which part of the software stack is responsible for an issue). But you’d still need to SSH into the VM yourself to figure out what happened. Either that, or you’ll also need to configure and use a log management tool. Once you get your hands on the logs, you have to pray that the log verbosity level will be high enough to help you understand what happened. Then you may need to talk to the rest of your teammates to figure out if someone else has seen a similar issue. I’ll stop here, but it is needless to say that root-cause analysis can be time-consuming, and the price businesses end up paying for that crucial downtime is generally too high for comfort.

An Industry in the making

For some incumbents, like VMware, AIOps is a foray they can easily slide into. But far more exciting is the prospect of young companies venturing out and trying to solve the problem in better and unique ways, tackling one problem at a time. The problems outlined above make clear that this is a market waiting to be tapped into. Current tools push the boundaries, and yet there is no out-of-the-box solution for root-cause analysis or infrastructure failure prediction. Attempts are being made, but the playing field is still wide open.

While AIOps is only a stepping stone towards the ultimate dream of a “autonomous” ops world, there is potentially a huge amount of money to be made before we move past this phase.


Leave a Reply