When John Lee clocked in at the University of Illinois’s Grainger College of Engineering at 7:30am just a few Fridays ago, the IT manager’s computer wasn’t working—and he was about to learn that approximately 2,500 machines on campus weren’t working either.
After unexpectedly seeing the “blue screen of death,” along with other lifeless computers around the office, Lee drove home to gather his personal machine and his thoughts, communicating with IT leaders about the major disruption. (One email’s subject line demonstrated the task at hand: “All hands on deck.”)
The culprit, a faulty CrowdStrike update causing IT outages at banks, airlines, hospitals, resulted in delayed flights, blank billboards, and canceled surgeries. While chaos reigned across the globe, Lee and his university team executed a plan from their disaster-recovery playbook—a rollout that included prioritizing fixes and strategically distributing helpers across campus to find as many BSODs as possible. The result: a steady, machine-by-machine recovery effort.
“I think, for the most part, this was something that we were ready for,” Lee said.
What happened?! On Friday, at 04:09 UTC, CrowdStrike released a configuration update that led to a Windows-system crash for some hosts that “were online between Friday, July 19, 2024 04:09 UTC and Friday, July 19, 2024 05:27 UTC,” according to company details.
With reporting tools, the Grainger College of Engineering IT team discovered that at 8:00am on Friday they had 2,464 impacted machines.
Room for error. To address the downed devices, the U of I turned a conference room into a situation room—a tense location at first, as the university’s IT managers grasped the level of disruption and what to do about it, according to Lee. “Friday morning was pretty rough,” he told IT Brew.
By lunchtime, Lee said, spirits improved as the team determined the gameplan:
First: Keep some support-techs onsite to handle walk-in visits and remote, over-the-phone support. Next: send many of the remaining 45 to 50 IT staff members all around campus.
The situation room became a watering hole of sorts—home to system administrators testing out fixes and IT leaders communicating on Teams channels to members on the ground.
A triage plan prioritized which assets to restore first. VIP devices—like department-head computers, business-office systems, or tech required for Friday classes—led the top of the to-do list. (The team assisted the dean via text message while he was on a flight, according to Lee.)
Top insights for IT pros
From cybersecurity and big data to cloud computing, IT Brew covers the latest trends shaping business tech in our 4x weekly newsletter, virtual events with industry experts, and digital guides.
“We had most people actually out and touching computers, trying to go into the different offices and buildings, roaming the halls and looking for anyone who needed help,” Lee said.
The fix is in. Fixing the affected computers required hands-on-keys: restarting a device in Safe Mode, running a command, finding a CrowdStrike-directory file, and deleting it.
The tougher challenge, Lee said, involved remote assistance, because the campus IT teams could not take over affected computers, and instead had to talk clients through the reboot; that often meant retrieving BitLocker encryption keys.
Who got struck: About 25% of Fortune 500 companies experienced a service disruption due to the July 19 outages, resulting in an estimated, combined loss of $5.4 billion, according to a report from cyber insurer Parametrix.
R. Hal Baker, SVP: Chief Digital Information Officer at WellSpan Health, participated in an org-wide effort to reactivate “a little over 3,000 computers,” following the CrowdStrike clatter.
“I had 70 people volunteering at the drop of a hat to give up their whole Saturday and come help us,” Baker told IT Brew.
Lee, too, called on weekend volunteers. As of July 23, according to Bobbi Hardy, communications and customer relations coordinator at the University of Illinois’ Grainger College of Engineering, the school had 530 unaddressed systems.
“This [cyberevent] reminded most of us that we need to do our practice of our emergency plans. So even though we’ll have a post-mortem, we should also then have another exercise, down the road, and keep practicing how we get together in a room [to determine] who does what and how we organize ourselves to respond better in the future,” Hardy said.
In an emailed statement to IT Brew, Kevin Benacci, Senior Director, Corporate Communications, wrote: “We are grateful for the support and partnership of our customers as we worked together to remediate the recent Channel Update 291 incident. Our teams immediately mobilized to support customers, initially via manual remediation, to restore systems as quickly as possible. We subsequently introduced automated techniques and deployed teams to assist customers with recovery efforts, while providing continual and transparent updates. We can confirm that almost all systems have restored operations and are back online, and we will continue to work tirelessly to help ensure every customer is remediated.”