By Raja Bhattal, CTO | Mechdyne Corporation
As Europe and Asia logged into work the morning of July 19, 2024, they were met with a nasty shock: A global IT outage. Soon, major news outlets across the globe were rapid-fire reporting about a software patch introduced by a widely adopted cybersecurity provider— CrowdStrike— gone awry impacting Windows systems. From airports in Hong Kong to banks in Brazil, even notable corporate players like Amazon Web Services felt the enormous pinch of suddenly being without the proper IT and end point support.
Unfortunately, this disruption quickly became a direct public safety concern: 911 dispatch centers in multiple metropolitan centers across the United States reported limited or down operations, while some healthcare systems/hospitals had to put procedures and surgeries on complete hold. This put some patients at great personal risk, with one American family reporting an emergency surgery being delayed because of the failure.
At this point, CrowdStrike responded by issuing a public set of patch instructions, which made their rounds on social media and email inboxes alike. However, IT personnel found applying these instructions to be a greater challenge than anticipated. Many found that only those with physical access to machines could deploy the patch, leaving remote resolution out of the question.
As I watched these events unfold in real-time, and as they affected some of our own clients that relied on CrowdStrike, I realized something that may sound counterintuitive: By relying solely on technology, the entire world was left stranded, in the dark, and unable to do day-to-day tasks as normal.
Here, I want to break down the top three reasons we’ve seen as an organization why something like this happens— and how you can prepare for any emergency like this in the future in a way that leaves you empowered, not without it.
Lesson 1: Every organization needs a solid set of “lights out” management capabilities.
As defined by Gartner, a lights-out recovery effort is “the management of a remote (and largely unmanned) recovery data center [using] remote management software.” This idea derives from the concept of setting up a server room, then shutting off the lights and letting your IT technicians work remotely on said server. It’s a crucial part of planning for emergency or disaster situations, which can include anything from natural disasters to an everyday power outage caused by construction.
In the case of CrowdStrike, lights-out recovery planning would’ve been the key to avoiding most issues in real-time. For organizations using a company like CrowdStrike for end point management, a lights-out maturity assessment is key to this kind of advanced planning, and can include analyzing:
- Network infrastructure access
- Virtual infrastructure
- Server infrastructure
This kind of assessment leads to strategic planning that provides various avenues to empowerment and proactivity, no matter what the emergency may be. And, it can ensure that you remain able to leverage your full remote management capabilities per the standards your company sets— not from an outside force.
Lesson 2: Get a pilot program in place, and make sure to test it fully within the context of your internal processes long before an emergency has the chance to arise.
Speaking of organizational standards, it’s not just enough to have the presence of a pilot program alone.
Listen, balancing zero-day attack vulnerabilities vis a vis implementing proper change control process coupled with an internal patch pilot strategy is challenging. It takes time to test the vendor updates in your unique environment, thus delaying the deployment of critical updates to mission critical systems. This recent global event is a stark reminder of how globally interconnected we are, and that tried-and-true practices still have a place in the modern IT landscape.
The CrowdStrike incident has painfully reminded us that organizations cannot blindly trust third-party vendors to deliver error-proof and/or bug-proof products. Many organizations will be forced to revisit a staggered (pilot) approach in order to valuate patches in a much more controlled manner. Organizations will need to wrestle with balancing security expediency against protecting operational status of mission critical systems.
There is a lot that goes into designing, implementing and maintaining an adequate pilot program to effectively verify that the “update” will not negatively impact your business operations. It is not just an IT task and responsibility; it requires investment in both infrastructure and people. For organizations struggling with implementing a solid pilot program, I encourage business leaders to collaborate with your IT teams and external partners to better under the options and quickly implement a program based on your unique business requirements.
Lesson 3: Don’t wait for an emergency to happen. Assume it will happen and use scenario analyses to your advantage.
As I saw the CrowdStrike news gain more traction, I wondered just how many organizations could have circumnavigated by having results of a scenario analysis in hand and full preparation for any circumstance that could arise.
Weather, economic impacts, or global IT shortages— these are just a few of the many obstacles or emergencies that can absolutely impact your company’s abilities to digitally tap into day-to-day work, regardless of industry or sector. Through a thoughtfully executed scenario analysis, you can truly strategize a custom solution that will leave no stone unturned in any emergency that you and your organization may encounter.
At Mechdyne, I’m always thrilled to see how our talent is able to work with our clients to flush out the strategic scenarios that could have some kind of negative impact on day-to-day working. We work together with our clients to identify:
- Tech footprints
- Current capabilities and their future trajectory
- The level of investment in a tech stack
Proactively Preparing for Anything and Everything
While some organizations and support desks are still feeling the pinch of the debacle, the worst seems to be over for CrowdStrike and their own clients. And while this was certainly a quite unfortunate situation, I also see it as a chance to examine some truly brutal realities and make positive changes from the learnings. Knowing how and when to test and plan for emergencies— and by being fully accepting of the fact emergencies will occur— both organizational and IT leaders alike can work together to head off any situation before it strikes.
Ready to get your own emergency plan in place? We’re here to help. Contact us today to get prepared for anything that comes your way.