Industries critical to everyday life – including airline, finance, healthcare, and shipping – have for the most part slowly recovered from the global impact of the damaging software update issued by CrowdStrike, that, according to Microsoft, affected some 8.5 million devices worldwide. While this number accounts for less than 1% of all Windows machines, the effect was catastrophic. Within hours, over 5,000 flights had been cancelled around the world and, in the UK, the government activated its COBRA emergency team.
The Absence of a Contingency Plan
This incident brings to mind the failure of the National Air Traffic Service (NATS) system on August 28, 2023, which affected over 700,000 passengers. According to the Civil Aviation Authority (CAA), the failure was triggered by the inability of the NERL flight plan processing system to manage flight plan data for a specific flight from Los Angeles to Paris. Both primary and secondary systems generated critical errors and entered maintenance mode, preventing data transfer to air traffic controllers. The workaround involved manually inputting flight data, reducing capacity to only 60 flight plans per hour, compared to the usual 800. With NATS managing around 2.5 million flights annually in UK airspace, the impact was severe.
Lessons from Past Failures
How could both Plans A & B fail, leaving NATS reliant on a Plan C that operated at only 7.5% effectiveness?
In recent years, several tech-related failures have illustrated the same issue. In early 2023, service outages disrupted both United and Hawaiian Airlines, and the FAA’s database failure triggered a national ground stop that halted all take-offs. Meanwhile, a backup failure at the New York Stock Exchange led to abnormal market swings when systems incorrectly continued the previous day’s trading on January 25. In November, an IT failure left half of Australia without phone service when Optus, the country’s second-largest telecom provider, was down for 12 hours. More recently, BT was fined £17.5 million after a network fault left thousands of 999 calls unanswered for over 10 hours.
The Case for Robust Backup Plans
All of these incidents point to one clear conclusion: a reliable Plan B, one that’s failsafe-tested regularly, is too often missing.
In many offices, the default advice for any IT problem is a standing joke: “Have you tried turning it off and on again?” But this humorous response reflects a much larger issue. For non-critical infrastructure, restarting a device might work as a backup plan. However, when the stakes are national, affecting emergency services or critical infrastructure, an effective Plan B is a necessity, not a luxury.
The Dangers of Over-Reliance on Technology
With technology advancing exponentially, it seems that organisations—and even entire nations—have come to trust these systems almost implicitly. This reliance is likely to increase as AI technologies are integrated more widely. Already, AI is seen as a cure-all, with the promise to foresee and forestall a host of issues, perhaps leaving less room for investing in robust, destruction-tested Plan Bs.
The Urgency of a Well - Constructed Plan B
In the era of digital miracles, it might seem old-fashioned to advocate for robust Plan Bs. However, as more of our lives and livelihoods depend on digital infrastructure, we are likely to see these failures, and their impacts, grow. A proper Plan B, well-designed, tested, and reliable, is more essential than ever.