The artificial intelligence for IT operations (AIOps) market will grow from $13.5 billion in 2020 to more than $40 billion in 2026, according to Mordor Intelligence. This massive growth reveals the increasing importance of continuous availability, or ensuring a business’s critical apps and services are always on and performing well.
Achieving high availability has become an absolute business imperative — just ask Slack or Facebook what happens when services go down. Downtime can cause enterprises to lose revenue (sometimes to the tune of millions of dollars per hour), halt internal operations and compromise customer loyalty.
But while maintaining availability is critical, avoiding service outages in our complex and distributed IT ecosystems is also very difficult. That is, it’s difficult without the right tools.
This is where the strategic use of AIOps can help. A next-generation AIOps solution can help DevOps and site reliability engineering (SRE) teams improve service reliability by detecting potential issues early in the incident lifecycle, before they impact the business. And a well-implemented tool can streamline the incident response by identifying who should respond, giving that team context to determine the right course of action and recognizing patterns to ensure those issues don’t happen again.
By guaranteeing less downtime and more business continuity, AIOps is rapidly becoming the solution modern businesses can’t afford to live without. But there’s a rather large caveat: Successful outcomes are contingent on good data.
SEE: Analytics: Turning big data science into business strategy (free PDF) (TechRepublic)
Garbage in, garbage out
The difference between a successful AIOps outcome and a failed one lies in the tool’s setup and implementation. Some people expect that they can buy an AIOps solution, throw data at the technology and it magically works. The reality is: Tech teams need to orchestrate AIOps solutions — and all AI-driven technology, for that matter — to yield successful business outcomes.
The best outcomes typically occur when an AIOps provider helps the customer create an AIOps strategy before implementing the tool. What’s the problem? What’s the budget? How can the technology solve the issue at hand?
In many cases, AIOps vendors help clients orchestrate the technology to solve their particular problems. They may help tech teams understand the difference between good and bad data, choose the correct data and set expectations.
If implemented correctly, AIOps can help DevOps and SRE teams resolve incidents confidently, saving time for more high-value tasks. If implementation falters, well, people will find truth behind the old adage of “garbage in, garbage out.”
How much data do you need?
Successful AI-driven outcomes are often associated with enterprise-wide, multi-billion dollar projects and big data. The reality is that most modern businesses produce plenty of data to reap the benefits of AIOps adoption. And the companies themselves don’t have to be particularly large either. As long as the AIOps tool has access to quality data, the amount of data required is very low.
For example, one of the most active AIOps customers I’ve worked with also has one of the smallest tech teams. To be clear, this client has applied modern DevOps practices to eliminate toil by automating every manual process possible and has thus maintained a svelte IT department. But as a result, the fully implemented AIOps solution does a lot of heavy lifting behind the scenes, with astounding success.
SEE: Best website monitoring tools and services 2022 (TechRepublic)
How can you get better data?
Google’s SRE Handbook describes how to improve data quality and which data is most important to monitoring. The overarching principle: Keep it simple. More data leads to confusion and complexity, which causes problems.
Google uses four specific consumer-facing metrics, what it calls the “golden signals,” to monitor how well an app or service is performing:
- Latency: the time it takes to service a successful request and failed request
- Traffic: the total demand across the network
- Errors: the number of failed requests
- Saturation: the load on services and networks
While Google’s golden signals may work for some businesses, they certainly are not a solution for all. After all, AIOps can fulfill a broad range of IT use cases.
Instead of throwing all available data at a particular problem, businesses should figure out their own golden signals. What are the business’s pain points? Which metrics can measure these pain points?
But that’s just the signal (or Service Level Indicator, in SRE language). It tells you what has happened, not why it happened. Conventional wisdom states that you should limit your data collection to only the golden signals, as everything else is noise. That’s true in terms of problem identification, but the other telemetry can be providing context, or insight into why the problem occurred. This is where AIOps helps. By clustering the contextual telemetry with the golden signals, you can identify causality rapidly, without an increase in ticket or paging volume.
Then, it’s a case of making sure the data is clean, complete and structured. With empty data streams, the AIOps tool can’t apply its machine learning (ML) capabilities. Just as important, computers like consistent, structured data. In fact, ML relies on consistent features, essentially independent variables, to produce models and make accurate predictions.
What are the benefits?
Providing an AIOps tool with targeted, clean and structured data can have expansive benefits — it can essentially do a business’s data science without having a data scientist on staff! The tool works by ingesting and normalizing data across siloed technology stacks while artificial intelligence (AI) and ML analyze this information to determine the system’s normal operating behaviors. The solution then organizes the data, giving DevOps and SRE teams a 360-degree view across the entire production stack from one central system of engagement.
The AIOps solution also reduces event noise, isolating only those alerts relevant to solving pertinent issues. And by automatically enriching data, it provides additional context to the alerts it surfaces. This context helps DevOps and SRE teams quickly understand and resolve disruptive incidents.
A robust AIOps tool with sufficient data also takes an algorithmic approach to root cause analysis. With root cause analysis, DevOps and SRE teams immediately know where to begin troubleshooting and can diagnose the problem as soon as they open an incident ticket. With deep diagnosis, teams can speed their incident response, but perhaps more importantly, fix these root causes to improve the operating model.
As businesses roll out innovations at increasing velocity, consumers and internal teams count on these innovative apps and services to work seamlessly. And AIOps is the contemporary technology that’s driving improvements in availability. But while the benefits are powerful, the key to achieving them is through data.
As Moogsoft’s chief evangelist, Richard Whitehead brings a keen sense of what is required to build transformational solutions. A former CTO and technology VP, Richard brought new technologies to market and was responsible for strategy, partnerships and product research. Richard served on Splunk’s Technology Advisory Board through its Series A, providing product and market guidance. He served on the advisory boards of RedSeal and Meriton Networks, was a charter member of the TMF NGOSS architecture committee, chaired a DMTF Working Group and recently co-chaired the ONUG Monitoring & Observability Working Group. Richard holds three patents and is considered dangerous with JavaScript.