Mitigating downtime risks: Strategies for ensuring uptime in critical IT systems

April 29, 2025

• 6 min read

No one likes trying to access a critical business system and instead landing on the dreaded service disruption notification.

One hundred percent uptime—or even “five-nines,” referring to 99.999% accessibility—is difficult, at best, to achieve. Data center operators have to juggle maintenance and upgrades while accounting for unpredictable, external events like network outages and extreme weather events.

Yet they’re far from helpless. Here’s some of the basics on downtime, and how data center operators can set themselves up for success.

Damaging your rep

Daniel Saroff, a group vice president for IDC consultancy, wrote for CIO that strong performance uptime (and some other metrics like operational efficiency, security, and compliance) have become “table stakes for most IT departments.” Saroff cited IDC polling showing that systems availability was tied with IT project delivery as one of the top three metrics used to measure CIO success in 2024, arguing the focus on operational metrics shows a lingering mindset that “still pigeonholes IT as a support function rather than a strategic driver of innovation.”

Failure to maintain consistent uptime can have serious effects—recent research by enterprise resilience firm Splunk indicated Forbes Global 2000 companies lose around $400 billion per year to unplanned downtime, or around $9,000 a minute. According to the report, average unplanned downtime due to cybersecurity issues amounts to about 466 hours annually for a Global 2000 firm, while app or infrastructure-related downtime stacked up slightly lower at an average of 456 hours annually.

The authors of the Splunk report noted that while uptime at these firms generally stood at “multiple 9s” (i.e., 99% or higher), costs are still high considering they can stack across numerous systems. Just shy of three out of ten respondents to Splunk’s research said they had lost clients due to downtime, while 44% reported brand damage.

“A major IT service outage always makes front-page news,” Todd Traver, vice president for IT optimization and strategy at industry certification body Uptime Institute, told IT Brew. According to Uptime’s 2024 survey of data center owners and operators, 54% said their most recent outage had cost over $100,000.

Tools and technologies

According to Traver, two systems are crucial for ensuring the reliability of infrastructure within a data center. The first is a building management system, which uses probes distributed across a data center’s operational technology (OT) to alert personnel in the control center of any alarming readings. The other is the computerized measurement management system (CMMS), which is a single point of reference for data on any given device in the building.

“Basically, they gather all of the data about those thousands of assets, so when you know when the device was installed, the commissioning records, when it was tested—when it was brand new, what it was capable of doing, all the maintenance history on that device,” Traver said. He added they also often catalog any processes or procedures associated with the equipment.

CMMS systems often utilize barcodes or QR codes, allowing technicians to quickly look up the history of any component they may run into—important for both preventative and predictive maintenance. Uptime’s 2024 survey found 54% of the most recent impactful incidents or outages reported by respondents involved on-site power issues, and an additional 13% reported cooling issues, highlighting the importance of tracking components of OT systems.

Redundancy

Traver said data centers can be rated for their resiliency across four tiers. The most basic certification, Tier I, requires data centers to have basic power and cooling systems with no guarantee of uninterrupted service. The Uptime Institute doesn’t rate those centers, Traver said, and Tier II data centers (which feature increased redundancies) don’t count for much either.

For a data center to score a Tier III certification, Traver said, Uptime experts examine the design of electrical and mechanical systems to ensure every single component can undergo routine maintenance without affecting IT load. To get to Tier IV, data centers have to be fault tolerant; that means their backup systems kick in automatically in the event of a failure, without requiring manual switching. These tests can take days to perform, in part because any given data center may vary dramatically in its design, infrastructure, and components.

Uptime researchers wrote in the 2024 report the other factors like training and incident response plans should possibly be higher priorities for operators than improving redundancy, as respondents felt many incidents didn’t need to happen.

“Four in five operators believe their most recent significant downtime incidents were preventable with better management, processes, or configuration,” they concluded. Traver noted communications gaps between IT and facilities teams are often a problem.

App-related or software outages are also a major concern, whether it’s a glitch or the sudden loss of a crucial dependency. Traver said Uptime performs digital resiliency assessments, which looks at the entire tech stack to see how applications are designed and implemented. While such tests aren’t part of their tier system, he noted he’d seen banking systems crash due to unanticipated software issues.

They all put “millions of dollars into designing data centers and millions of dollars into architecting the IT systems, but they didn’t understand some of the simple dependencies that took them down hard,” Traver added. “They were down and they were completely blindsided by it.”

Taking action

“Unfortunately, typically, clients bring us in after the fact they have said they’ve had a major outage,” Traver told IT Brew.

Most commonly, he said, Uptime finds upgrades or vendor changes in key components like the building management system or CMMS lead to a “splintered” knowledge base. For example, operating procedures or capacity and commissioning records can become hard to find. The problem is exacerbated by turnover, as new hires take a “couple years until they’re 100% up to speed” and documentation is often spotty, Traver added.

IT and facilities teams can avoid self-inflicted downtime by sharing an integrated change control system—a formal process in which all teams are notified of upcoming maintenance, rollouts, or other disruptions—to ensure both teams are aware of possible conflicts, Traver said. It’s also important for operators not only to have an incident response plan, but for it to be well-documented and establish clear plans for escalation and authority.

“Without that, if you have a site go down, or the site is just not operating properly, and there’s not a clear line of authority, who’s making the decisions?” Traver said. “Yes or no, are we doing this, who’s communicating to who? If that’s not all documented and understood and practiced, it’s a mess.”

Top insights for IT pros

From cybersecurity and big data to cloud computing, IT Brew covers the latest trends shaping business tech in our 4x weekly newsletter, virtual events with industry experts, and digital guides.