How did the CrowdStrike 2024 Outage Happen?

Planit, a provider of testing services worldwide, has emphasized the crucial role of thorough testing before releasing software following the recent CrowdStrike 2024 outage.

The incident, triggered by a corrupted CrowdStrike Windows system driver update, resulted in the dreaded blue screen of death (BSOD) on affected machines, causing a boot-loop. The problem was compounded for devices utilising Windows' BitLocker disk encryption, complicating the removal of the faulty driver.

BitLocker, widely used by corporations to secure disk contents, posed an additional challenge as recovery keys stored in Active Directory were also inaccessible due to the crash.

Experts speculate that the update responsible for the error was not subjected to routine pre- release regression checks, including testing in a sandbox environment. This oversight prevented the identification of potential issues with the update, ultimately resulting in extensive disruption experienced by organisations and everyday people across the world.

So how did this situation escalate to the point of disrupting local supermarket checkouts? The likely cause lies in CrowdStrike’s build-to-deploy process, where essential integrity tests were apparently skipped.

However, the responsibility doesn't lie solely with CrowdStrike. While they did release an unchecked update, their clients also lacked effective patch management strategies to mitigate the impact of problematic patches on mission-critical systems.

Many organisations seem to have blindly trusted third-party updates from CrowdStrike, allowing automatic deployment across their entire infrastructure. This approach, fraught with risk, highlights the importance of staging updates. Deploying them first to less critical assets and gradually to more critical ones can help catch and block faulty updates before they cause significant damage.

Patch management is also a cornerstone of ITIL in many organisations, but another reason why this outage occurred was because the CrowdStrike driver update was delivered through channel updates, which cannot be staged like signature updates.

Additionally, Windows’ handling of third-party system control components within the kernel space, without thorough sanity checks, exacerbated the issue. These components, once loaded, become part of the Windows kernel, and if they fail, they can crash the entire system.

The financial repercussions of this incident are stark. CrowdStrike, once valued at $US83 billion, saw its market cap plummet to $US58 billion, losing $US25 billion due to the release of a single corrupted file.

This begs the question - why was thorough testing skipped?

Organisations often face tough decisions balancing cost versus risk. As IT budgets shrink or are redirected towards cybersecurity, the importance of testing before deploying code to production diminishes. Last week's outage starkly demonstrates the critical nature of rigorous testing and quality engineering practices - without them, billions, if not trillions, of dollars in revenue can be jeopardised by a single faulty update.

Planit recognises the significance of robust testing in preventing such incidents and is committed to helping organisations mitigate risks going forward. By subjecting updates to rigorous testing procedures, potential logic errors, vulnerabilities, and compatibility issues can be identified and addressed before deployment.

This proactive approach ensures the highest level of quality and reliability for all updates, safeguarding against costly outages and their associated financial and reputational consequences.

What was the key takeaway from the CrowdStrike outage?

Prioritise testing and ensure coverage up to and into production release builds. Testing should never be an afterthought – your business, your clients’ businesses, privacy, and lives are at stake.

http://www.planit.com/