BugZero | Part 1: CrowdStrike Outage Exposes Critical Gap in IT Operational Resilience

Incident ResponseIT Operational ResilienceThird-Party Risk Management

Part 1: CrowdStrike Outage Exposes Critical Gap in IT Operational Resilience

The CrowdStrike outage in July 2024 exposed a critical weakness in IT operational resilience, impacting millions of users and causing significant disruptions across various sectors. While most IT organizations have robust systems to handle security vulnerabilities, they often lack similar tools for managing operational bugs, leaving them vulnerable to unexpected outages. This article explores the disparity between vendor security and operational risk management, highlighting the challenges faced by IT Operations teams and offering proactive strategies to address vendor-caused outages, such as automated bug tracking, enhanced vendor communication, and leveraging specialized tools like BugZero.

Eric DeGrass

September 1st, 2024

“Your PC ran into a problem and needs to restart.”

This message appeared on an estimated 8.6 million Windows computers on July 19th, 2024, and, unfortunately, a simple restart didn’t fix the problem. Flights were canceled, vital health information wasn’t accessible, and even 911 calls went unanswered. Rumors of a cyberattack spread, but they were quickly dispelled by an announcement from the CEO of cybersecurity firm CrowdStrike that a software bug in a recent update to their Falcon kernel sensor was causing the issue.   

The root cause appeared to be an insufficient testing process, and CrowdStrike's reputation and stock have suffered accordingly. However, the damage caused by the outage extended far beyond CrowdStrike. Customers of CrowdStrike, like Delta Airlines, paid the price in IT staff time to reboot computers, productivity and revenue loss, and, in Delta’s case, a class action lawsuit from their customers.   

Software bugs are always going to slip through IT vendor testing processes. With the devastation from the CrowdStrike outage, one question is fresh on every IT leader's mind: how can their teams mitigate vendor-caused outages in the future?   

A litany of articles is ricocheting across the internet, spewing common answers to this question: 

“Develop better incident response plans.” 
“Improve testing procedures for vendor updates.” 
“Avoid single point of failure vendors.” 
“Improve vendor risk assessment processes.” 
“Update vendor contracts to ensure vendor accountability.”

  Beyond these solutions, there is also a generally accepted understanding that software bugs happen, and outages are the cost of doing business.   

We disagree.    

In this article, we want to shed light on an overlooked problem in IT operational resiliency - the proactive detection and prioritization of IT vendor operational bugs. Specifically, we’ll cover:   

The imbalance in IT vendor security vs operational risk management 
The impact on IT Operations teams  
Methods to proactively address vendor operational bugs

The Imbalance in IT Vendor Security vs. Operational Risk   

Every IT vendor publishes newly discovered security vulnerabilities (CVEs), and IT organizations spend $10 billion every year on dedicated security systems that consolidate these CVEs, assess their risk, and help IT Security teams prioritize and remediate them.   

IT vendors also publish newly discovered operational bugs. However, most of the IT world is only able to address these bugs after an outage has occurred. IT Operations teams are then tasked with cleaning up the vendor’s mess, and in some cases, unfairly take the blame.   

According to UpTime, publicly reported outages caused by IT vendors represent 25% of all outages. The reported cost to the enterprise is more than $300k per hour, which results in tens of millions of financial impact annually.   

This significant imbalance between vendor security and operational risk mitigation is clearly a problem. Given the fresh reminder from CrowdStrike of the devastation operational bugs can have, we must ask ourselves why we aren’t able to do more to proactively address outage risk from operational bugs. 

The Impact on IT Operations 

The Head of IT Operations at a Fortune 500 bank is jolted awake by a 3 am phone call reporting their core retail banking software is down due to a Microsoft bug. She and her team spend the next 14 hours researching the issue on Microsoft’s portal and other technical forums online, assessing possible solutions, and implementing the fix that will hopefully get the software back online.   

When everything is back up and running, her team’s heroics go unnoticed, as executives are mostly concerned with the $4.2 million in lost revenue from the 14 hours of downtime. The IT world is full of thankless work, but some say IT Operations takes the brunt of it. 

Today, IT Operations teams are being asked for operational resiliency yet are forced to work with one hand tied behind their back. They can rely on established systems to proactively address CVEs, but without the tools to identify and assess operational bugs, outages will continue to take their toll on companies and the people that keep them running. 

Methods to Proactively Address Vendor Operational Bugs 

Beyond waiting for the next outage to address operational bugs, IT Operations leaders are taking a few approaches to be proactive and stay ahead of outages. 

Team Cadence for Tracking Core Vendor Announcements 

Fortune 500 IT organizations may have anywhere from 20-40 critical IT vendors. Ninety-nine percent of them publish announcements about bugs as they are discovered and resolved. Depending on capacity, IT Operations should identify a few of the most crucial vendors and set up a system and process for tracking bugs as they are announced. The process must include a system of checking which bugs impact which configuration items (CIs) and setting up a method of prioritization. 

In most cases, this process may be highly time-intensive, which is why it may not be realistic for more than a handful of key vendors. It also requires consistency and accountability across the team. One other challenge with this method is that vendor bugs are announced constantly, and so even the most frequent cadence can miss risks that require immediate action. 

Dedicated Account Managers from Core Vendors 

Larger organizations might find it valuable to invest in dedicated vendor-provided resources that are familiar with your infrastructure and can advise on prioritization of patches and outstanding risks. The cost can be substantial, but it can offload some of the manual work around risk mitigation to someone with dedicated knowledge of that vendor’s activity. 

While expertise can be beneficial, there is still the challenge of the complexity of bugs and internal infrastructure that makes it difficult to sift through bugs to identify the highest risks. Also, vendor account managers will still be dependent on some degree of manual cadence and internal comms that can miss risks that require immediate attention. 

Automated Consolidation and Filtering 

The number of core vendors and volume of bug announcements means that any degree of manual process will undoubtedly miss risks. Exploring automated solutions to ingest vendor bug data and prioritize it within your IT workstreams is necessary. Start with your most critical vendor data, identify how bugs are announced, and develop automation to pull data into one central repository. A centralized and normalized view of all vendor bugs can then enable filtering and more advanced logic to prioritize bugs before they cause an outage.  

Automation is likely the only realistic solution to scale operational risk management, and solutions like BugZero can offer this out of the box through their free Operational Defect Database or an enterprise license with integrations with new vendors, real-time correlation with your IT inventory, customizable risk scoring, and a seamless integration with ServiceNow for IT Ops teams to prioritize the highest risks alongside all other workstreams. 

Closing the IT Risk Gap 

The biggest gap in IT risk management is operational defects from your vendors. Addressing this gap necessitates a combination of proactive monitoring, efficient bug tracking, and leveraging technology to manage the complexity and volume of updates and vendor interactions. By adopting a holistic approach to managing vendor bugs, companies can significantly reduce the availability risk and protect their operations from unforeseen disruptions. BugZero is the only software solution that’s tackling this problem.  

Learn more about how BugZero is the New Standard in IT Risk Mitigation in part 2 of this article.