It’s a no-brainer. Proactive ops systems can figure out issues before they become disruptive and can make corrections without human intervention.
For instance, an ops observability tool, such as an AIops tool, sees that a storage system is producing intermittent I/O errors, which means that the storage system is likely to suffer a major failure sometime soon. Data is automatically transferred to another storage system using predefined self-healing processes, and the system is shut down and marked for maintenance. No downtime occurs.
These types of proactive processes and automations occur thousands of times an hour, and the only way you’ll know that they are working is a lack of outages caused by failures in cloud services, applications, networks, or databases. We know all. We see all. We track data over time. We fix issues before they become outages that harm the business.
It’s great to have this technology to get our downtime to near zero. However, like anything, there are good and bad aspects that you need to consider.
Traditional reactive ops technology is just that: It reacts to failure and sets off a chain of events, including messaging humans, to correct the issues. In a failure event, when something stops working, we quickly understand the root cause and we fix it, either with an automated process or by dispatching a human.
The downside of reactive ops is the downtime. We typically don’t know there’s an issue until we have a complete failure—that’s just part of the reactive process. Typically, we are not monitoring the details around the resource or service, such as I/O for storage. We focus on just the binary: Is it working or not?
I’m not a fan of cloud-based system downtime, so reactive ops seems like something to avoid in favor of proactive ops. However, in many of the cases that I see, even if you’ve purchased a proactive ops tool, the observability systems of that tool may not be able to see the details needed for proactive automation.
Major hyperscaler cloud services (storage, compute, database, artificial intelligence, etc.) can monitor these systems in a fine-grained way, such as I/O utilization ongoing, CPU saturation ongoing, etc. Much of the other technology that you use on cloud-based platforms may only have primitive APIs into their internal operations and can only tell you when they are working and when they are not. As you may have guessed, proactive ops tools, no matter how good, won’t do much for these cloud resources and services.
I’m finding that more of these types of systems run on public clouds than you might think. We’re spending big bucks on proactive ops with no ability to monitor the internal systems that will provide us with indications that the resources are likely to fail.
Moreover, a public cloud resource, such as major storage or compute systems, is already monitored and operated by the provider. You’re not in control over the resources that are provided to you in a multitenant architecture, and the cloud providers do a very good job of providing proactive operations on your behalf. They see issues with hardware and software resources long before you will and are in a much better position to fix things before you even know there is a problem. Even with a shared responsibility model for cloud-based resources, the providers take it upon themselves to make sure that the services are working ongoing.
Proactive ops are the way to go—don’t get me wrong. The trouble is that in many instances, enterprises are making huge investments in proactive cloudops with little ability to leverage it. Just saying.