Downtime: It happens to the best of us--no matter what tools we use.
Many businesses faced serious challenges this week as Amazon’s AWS S3 Service went down. In instances like this, downtime can’t be avoided and comes with a real financial cost. But what about the rest of the time?
Obviously, downtime is something to be avoided, and the good news is that there are plenty of things you can do to avoid unnecessary downtime for your business and protect yourself against risk!
Today I’m going to give you some best practices that a business of any size can adopt to protect themselves against unnecessary downtime.
Proactive Patch Management
So, let’s say you have a major outage today and you get on the phone, seeking to escalate it to the manufacturer. They respond that since "You're not at the latest patch. Get it upgraded."
Here’s the problem: you now load the latest patch just based on speculation. Not only are you frustrated that the manufacturer refuses to escalate your case, you now must endure additional downtime to load that patch. Next, comes the waiting game. You're anxiously waiting to see if the initial outage will rear its ugly head. Meanwhile, you're hoping that if it does, it doesn’t happen at a critical time for your business.
By proactively staying current, you're protecting your system from unplanned outages and vulnerabilities from a security perspective. There's a real cost associated with downtime, whether it be the resources to diagnose the problem and restore service, executive escalations, or even missing a call from a new prospect. Remaining current with your patches reduces the likeliness of such outages by closing the holes already identified in the field and helps ensure that your case receives immediate traction with manufacturers. This saves you from having to endure multiple outages before you can get to the root of the problem.
So what brought Amazon's AWS 3 Server down? A typo, of all things. Changes to the environment, expected and unexpected, can have a huge impact on the stability of your system. Just about any physical move or software move between users, editions, and integrations of software or equipment has the potential to cause problems.
Change Management is a big area that if followed, can make a huge difference in avoiding unnecessary downtime. Even our largest, most complex, customers with high levels of change management in their organization don't follow it to a tee. I understand why!
It feels like something that delays things because it prevents people from doing things quickly. Waiting a week to meet with the change advisory board for approval can be frustrating for something as simple as adding a user. The reality is most organizations don't do it well, and even those with a robust plan in place deviate from it often.
While it is best to have a change of management process in place, it's essential to trust your support partner to help with these changes, even on a day to day basis! This will ensure that you're documenting the steps you plan to take and creating a fallback plan before making a change.
If those changes don't work, we know how to reverse them quickly. Additionally, we could have a full test plan to make sure that once we've made the changes, we don't impact any other areas of your business. Both things could significantly reduce the downtime experienced during changes.
Maybe a robust change management process is not a fit for your business. That doesn’t mean there isn’t something else you can do to prevent or minimize downtime. A very simple step you can take with a huge impact is to create a simple shared Changelog between systems. A Changelog is a simple text document, kept in a common folder. When anyone makes changes to your system or network, they log what they changed and when in a clean diary format.
It's not uncommon for customers to make changes in their network and those changes could have adverse effects on their communication environment. When you run into issues like this, this changelog can be the difference between diagnosing problems quickly or three days of, you guessed it, unnecessary downtime. Sharing this list of changes to your environment will allow your support provider some visibility into what may have gone on in your system.
Funny enough, you’d be surprised at how often customers hide this stuff for fear of being charged! But at what cost? By documenting and sharing system changes we can correct problems faster, and your business suffers less! Plus, most support providers are a lot less likely to charge you for a quick look and a “you ought to go back in and reverse that.”
But if we spent three days troubleshooting? You bet.
Another great area to invest in is the ability to detect outages as they happen so you can quickly resolve them. I mean, we’re talking about technology. Problems do happen and will happen, especially as technology gives us more feature functionality and fun applications for end users. As technology becomes more complicated, things tend to break a little more, especially when it comes to software.
For example, let’s say that your business is an eight-to-five operation and you’re not open on weekends, and something happens in your system on Friday at 9 pm. You won't know about that problem until you get into the office Monday morning!
What happens if that application is down and can’t send an alert? What happens if the building goes dark? With Advanced Monitoring, you'll have the necessary pieces in place to detect and resolve problems when traditional alerts fail. Also, all of this can happen before the start of business.
The alternative to this, of course, is that you get into the office Monday morning and realize things aren't working. So, you call BrantTel or your support provider who starts looking at it remotely, determining that they need to dispatch. Two hours later they're on site through traffic with parts to repair that problem. At this point, you’ve been down nearly a half day to potentially even longer.
So, of course, you can minimize the risk to the business and the downtime to the business by having some Advanced Monitoring in place.
What is Advanced Monitoring?
Advanced Monitoring services like BrantTel’s TotalView Monitoring and Management Platform, are services that give you the peace of mind in knowing that somebody is keeping an eye on your applications. Many systems can send SNMP trap alerts to notify us of a problem 24/7. This alert is, of course, very handy. But it also allows us to take the necessary steps to correct that problem before you even knew it was there. And that might be after hours, right?
It’s pretty cool what this sort of service can cover. From basic ICMP Monitoring to more enhanced protocol validation which can help validate that a system is functioning at the application level with HTTPS connections, SNMP Queries, Raw TCP Socket checks and much more.
In addition to these features, there's the ability to monitor any network devices that exist. We can receive heartbeats from our monitoring tools to identify if a site happens to lose power. So, when the system can't send out alerts, we're notified by our monitoring tool that something is going on. We'll know we've lost contact to our probe, and in turn, can get the ball rolling and get the problem solved.
Whether you're looking to protect your growing small business or an Enterprise IT Manager reducing the risk for your large organization, these steps can help you keep your essential services running, and you open for business. As is often the case, taking a few minutes to schedule a patch, or a few seconds to add to a changelog can save you from a world of hurt when things go wrong. Moreover, Advanced Monitoring can protect your investment from unseen risks of all sorts.
If you have an essential technology of any kind, there is likely a system to monitor it--use it. Who wants to be blindsided Monday morning at 8 am?