In the wake of the July 19 global IT outage, Duke’s Office of Information Technology worked around the clock to get hundreds of critical services back online.
At 6:40 a.m., the University community woke up to a DukeALERT stating there had been a “global IT outage … affecting Windows-based computer systems.” CrowdStrike, a widely used cybersecurity company, had issued a faulty software update that crashed Windows devices.
Duke was one of the many institutions impacted by the IT outage — one of the largest in history — alongside global universities, hospitals and police stations.
At noon, a second DukeALERT was issued with assurance that the OIT team was “making progress” recovering from the outage. According to the second alert, “most critical systems [had] been brought back online, and nearly two-thirds of computers/laptops [had] been restored.”
OIT was responsible for restoring the University’s critical servers and core systems, as well as aiding students, faculty and staff in recovering their personal devices.
OIT’s response by the numbers
“Roughly 6,600 devices at the University were affected,” said Nick Tripp, OIT’s interim chief information security officer. According to Tripp, an “affected” device was one that received the file from CrowdStrike and crashed because of it.
Complicating the situation, 40,656 health system devices received the file. According to Tripp, the health system was aware of 21,973 crashed and nonfunctional devices, which amounted to roughly 37% of workstations and 32% of servers.
“For both the University and the health system, the only way to fix this issue was to actually put hands on these devices, which is pretty much the most painful way to fix any IT problem,” Tripp said. This made the outage an “all hands on deck situation” for OIT staff, who reported receiving 968 service requests in three days.
“There were things that needed to be fixed so that business at the University could go on the next day,” Tripp said.
Staff began working by 2:00 a.m. July 19 to recover affected devices. Within an hour, the team had identified the issue and could deploy the CrowdStrike-issued solution, according to Tripp. The team first prioritized recovering the University’s critical services and devices used by the Duke University Police Department.
According to Tripp, most critical services and OIT Windows devices were back online by 6 a.m. on the day of the outage. By that point, they had briefed departmental and local OIT support teams.
Within the first three days after the outage, the OIT team helped fix hundreds of devices, both in person and over the phone.
“To do this over the phone and coach someone through fixing this, fixing their own device, that is a much harder task,” Tripp said.
Tripp noted that the July outage did not occur at “the peak of the academic calendar,” meaning OIT was under less intense pressure to restore service to University devices. However, he noted that it was critical to get health system devices online and ready for care the next day.
As of July 26, over 90% of the health system’s affected devices have been fixed. Tripp estimates that 550 devices — mostly laptops and desktops — remain to be fixed.
Executive Vice President Daniel Ennis sent a July 26 email to some members of the Duke community commending OIT staff for working proactively and efficiently to recover from the outage. He shared that 22,389 devices from both the University and health system had been restored by July 21 — only two days after the outage began.
“We are deeply grateful to our IT support staff who quickly pivoted from their regular duties and projects to meet the needs of this crisis,” Ennis wrote. “I am proud to work with such a dedicated community of people.”
Preventing another crash
While many global institutions — including universities and airlines — were forced to suspend operations, the OIT team’s response successfully mitigated the effects of the unprecedented outage.
Owen Astrachan, associate director of undergraduate studies and professor of the practice of computer science, said the outage was far-reaching because CrowdStrike is “ubiquitous” when it comes to cybersecurity technology. According to Astrachan, when companies use CrowdStrike, they tend to use the service across all devices. Otherwise, a computer virus on just one device could impact the entire network.
As a solution, Astrachan suggested that companies should first deploy software on a “smaller sample.”
“Normally when you make a fix in your software, you test it, and you would test it not by rolling it out globally,” Astrachan said. “First, you test it on some subset of what you were going to do to make sure that there were no problems.”
Despite the outage, Astrachan still views CrowdStrike as an effective cybersecurity service. He noted that switching to a different provider would be a time-intensive and costly process for an institution like Duke.
“What was ineffective was [CrowdStrike’s] software update and testing, but the product itself seems to still be regarded as effective at what it does,” Astrachan said. “I don’t know if there was an opening for other companies to come in and try to do that.”
Get The Chronicle straight to your inbox
Signup for our weekly newsletter. Cancel at any time.
Ava Littman is a Trinity sophomore and an associate news editor for the news department.