How To Find (and Eliminate) Single Points of Failure

Posted by Steven Vigeant on 5/22/15 2:10 PM

blue_screen_of_deathIT is the lifeblood of nearly every company.  And much like in the cardiovascular system that pumps blood through the human body, a single point of failure in your IT network can create big problems down the road.

 

A single point of failure (SPOF) can be generally defined as any non-redundant part of a system that, if dysfunctional, would cause the entire system to fail. In the world of IT, this can be anything from a faulty switch to an ISP outage.

 

SPOFs are common amongst companies who don’t have a massive IT budget and need to keep costs down. However, with IT playing such an integral role in day-to-day operations, companies can’t afford to ignore potential single points of failure.

 

In order to identify the single points of failure in your organization, let’s look at three common places they tend to show up.

 

Hardware:

This is the most commonly recognizable source for single points of failure. If any piece of hardware (whether it’s on the server side or the user side) fails or is damaged without a backup or failover to seamlessly take its place, you have a single point of failure. These hardware types could be from not having a redundant power supply in a server or storage device, to not having a failover firewall connecting you to the internet.

 

Services/Providers:

Your ISP or offsite data storage location can be a single point of failure if you don’t have redundancies built in to account for problems on their end. If a service provider experiences a technical problem or emergency outage that directly impacts your operations, how are you set up to continue doing business? That’s why we often recommend that companies that absolutely need access to the internet at all times (whether it be because they use cloud services, need to access data offsite, or simply need constant access to email or VOIP to be productive) invest in a backup internet provider in case their main ISP goes down.

 

People:

People are the most commonly overlooked single points of failure in most organizations. It’s easy to hand IT responsibilities over to a solo IT consultant or someone in your organization and say, “you take it from here.” But by handing the responsibility off to one person, you’ve actually created a single point of failure should anything ever happen to that employee.

 

Once you understand how to seek out single points of failure, it’s surprising how quickly you’ll begin to identify weak links that you never saw before. If business continuity is important to you (and it should be), then consider conducting a SPOF audit to identify potential failure points.

 

Conducting A Single Point of Failure Audit

Much of this process comes down to documenting the many pieces of your IT infrastructure. This might sound like a tedious task, but here are a few tips to help you get past the technical jargon and conduct your own high-level audit.

 

Step 1: Establish Stakeholders

This is especially important for companies that don’t have a fully staffed IT department. Assign the role of IT stakeholder to one person in the organization who will be responsible for keeping your technology functions up and running and works with your IT service provider. Remember, while this person is responsible for overseeing the IT operations, there should also be a secondary stakeholder who understands the functionality of your systems just as well and is kept in the loop regarding any changes.

 

Step 2: Document Your IT and Communication Systems

This should be a “50,000 ft. view” document that gives you an immediate visual of every technical component and its purpose. This document should include information on:

  • Your ISP
  • Email provider
  • Cloud service providers
  • Switching and network infrastructure
  • Local servers and storage devices
  • Etc.

 

If it’s connected to your network, document it. Alongside this document should be detailed information on each piece of equipment, including its age and the status of any support contracts. This serves as a general document that any IT stakeholder can access in the event of an emergency and quickly gauge what’s not functioning and how to fix it.

 

 

Step 3: Identify Potential SPOFs and Mitigate Risk

Once you have a detailed network diagram in place, identify the points in your system where you don’t have any redundancies in place. This could be a single router that you don’t have a replacement for or a cloud based service that would disrupt operations in the event it went down.

 

Consider what kind of continuity you can put around each point of failure. Ask the “what if” should this device or service fail, what is the process to remedy the issue and get the outage resolved. What is the business impact? For instance, if you notice that you have no spare or backup hard drive for the servers in your server closet, determine if it’s better to keep cold spares or set up hot failovers. Or if your ISP experiences outages, should you consider having a second ISP on hand, or is there better option?

 

An extensive audit of your current infrastructure will likely reveal many single points of failure. Some are obvious, and some are not. Bringing in an expert to help you conduct a thorough audit and identify potential pitfalls can ensure that your continuity program is up-to-date and give you peace of mind.

 

If you don’t have a full-time IT person who can help you run through a SPOF audit, consider working with a managed IT provider.  To learn more about how an outside IT partner can help you evaluate and support your IT needs, download our free guide: The Ultimate Small Business Guide to IT Outsourcing.

New Call-to-action


 Comments