Our Vulnerability: CrowdStrike Outage, Infrastructure Fragility and Subscription Software

The CrowdStrike outage has me thinking about how fragile our digital infrastructure is and how vulnerable we are. We depend on digital technologies and infrastructures for nearly everything we do individually and as a society. Our most basic services, including water, electricity, and transportation, to name just a few, depend on the stability and reliability of digital infrastructure. This outage was caused by a mistake in a software update. Imagine what happens when there is a full-scale cyber attack.

As I was writing this, I received the following text from my CFO:

From … - The DC leadership trip team is stuck in DC after waiting 6 hours at the airport. The flight is canceled. They are one of the thousands of flights canceled due to MS meltdown today. I am going to tell … and … to use WCA card to book a hotel. We have 13 in this trip.

In addition to good backups, I am thinking through how best to protect my digital assets and my family in the event of a significant and prolonged disruption of essential services. This is probably something we all need to be thinking about. Not in a paranoid manner, but a prudent one.

The catastrophic situation reflects the fragility and deep interconnectedness of the internet. Numerous security practitioners told WIRED that they anticipated or even worked with clients to attempt to protect against a scenario where defense software itself caused cascading failures as a result of malicious exploitation or human error, as is the case with CrowdStrike. “This is an incredibly powerful illustration of our global digital vulnerabilities and the fragility of core internet infrastructure,” says Ciaran Martin, a professor at the University of Oxford and the former head of the UK’s National Cyber Security Center.


The impact of subscription software

This brings me to software. It is no longer a product we buy and install under the assumption that all the bugs have been massaged out of the system. Rather, today software is packaged as a service, requiring repeat subscription fees and regular maintenance – such as this CrowdStrike update – to fix bugs that are expected to emerge during its use. As a service, software has been assetized: Annual subscription fees generate far more revenues than product sales, while giving companies significantly more control over how their software is used.

Other companies are or have already gone down this assetization route. Software in tractors can stop farmers from doing repairs on their own machines, requiring them to pay the tractor manufacturers instead. Software in automobiles means car owners are increasingly being asked to pay for heated seats and other basic functions. Software updates by printer manufacturers can brick the printer, rendering it inoperable, if generic ink is used. And so on.

Opinion: CrowdStrike-Microsoft outage is what happens when software becomes a subscription - The Globe and Mail.

ADDENDUM From the Washington Post:

Consumers of technology expect software to perform, and it usually does. But that invites complacency and digital illiteracy: We don’t remember anyone’s phone number because on a smartphone you just tap the name and the call goes through. We don’t carry cash because everyone takes plastic.

Life in the 21st century is pretty magical — until it’s not.

Marcus fears that society will become even more vulnerable as we rely increasingly on artificial intelligence. On X, he wrote: “The world needs to up its software game massively. We need to invest in improving software reliability and methodology, not rushing out half-baked chatbots. An unregulated AI industry is a recipe for disaster.”

The AI revolution — which did not come up a single time during the June presidential debate between President Biden and former president Donald Trump — is poised to make these systems even more interdependent and opaque, making human society more vulnerable in ways no one can fully predict.

4 Likes

I have a few thoughts on this:

The first thought is our digital infrastructure should not be a monoculture - meaning most people should not be on solely Windows. There should be a good mix of Mac and Linux for end users.

A Prairie, consisting of hundreds of different kinds of grasses is much more durable to ecological stressors than a monoculture of wheat, for instance.

Second, is that in some cases paper (analog) is better than digital. I usually do my scheduling in my paper planner, which I just lost. It’s faster than using the digital calendar because I have so many small appointments a day. I usually keep it with me constantly, but it’s summer, no AC at home and I think I’m not getting enough sleep. So, how do you back up paper :).

Third, I do agree that a slower release cycle would be better than subscription as a service. One nice thing about Agenda is they don’t come out with lots of features at once. It takes a long time for them to come out with another major release.

5 Likes

This is the risk of technological progress. People are distanced more and more from understanding how the things they use work, especially from the idea of being able to fix them, but this has been a slippery slope since the late 1800s.

On the topic of Macs and Linux being unaffected by today’s issue, there was a time about 10 - 15 years ago where Microsoft tried to lock down the Windows NT Kernel and exclude 3rd party extensions, this would have made Windows far more secure, but they were prevented from doing so by the courts due to it being “anti-competitive”

Bullshit like this leaves Windows more vulnerable, and with consolidation in security software, it increases the risk of something like today hitting a significant number of companies.

3 Likes

So true! One of the things many of us here who started our computer journeys in the early 80s and before is the technology grew up with us. Sticking strictly to technology, many people don’t understand what goes on in their computers because of all the layers of abstraction that we didn’t have. Many of us endured the frustration of messing around with jumpers or the joy of trying to figure out how to program our soundblaster cards and modems to automate prank calls — well, I don’t personally know anybody who tried that but I’ve heard stories.

Remember, computer manuals used to come with tables explaining every component in the machine, their functionality, ranges, and the like?

I’m not trying to go down memory lane but for making the point that many of us started at ground zero and have this historical knowledge about the innards of the machine that ordinary users don’t have today. Obviously, sysadmins, low-level developers, and the like learn all this today.

But that lack of historical insight makes it difficult to solve, and sometimes even comprehend, problems like these. We didn’t have to work for it because we lived it.

2 Likes

I don’t see an alternative for subscriptions. Antivirus/security software has never been a one time purchase. The last system I managed checked for updates every hour, so we had to pay for support in order to keep our computers up to date.

The same is true of other systems, our email protection server, our firewalls, etc. Just one of our call center servers sold for around $50,000 and still cost $1700/month for support.

If we didn’t have a support contract with IBM, for our midrange computer, and had a problem we could get support and pay for time and material. But customers with a support contract would always have priority. Until an IBM technician actually started work he could be called away, even if he was already at our location.

Subscriptions aren’t new, they’ve just been known as support contracts for the last 60 or 70 years.

3 Likes

That’s easier said than done. Cloud or server based solutions are the only way that Windows, Macs, and Linux users can run the same software.

More than 60% of the customer cores on Azure are running Linux but they are just virtual machines running on Windows servers.

1 Like

There is no piece of software to which this applies, all software has bugs.

12 Likes

After seeing too many “hot takes” on Mastodon (and no I am not referring to this post in the same bucket) I couldn’t help but post this:

There is no bogeyman here… just people who didn’t understand risk. True of the perpetrators and true of the users of their services.

3 Likes

That used to be true, but more and more locally installed cross-platform end-user software continues to become available thanks to frameworks like Electron, and that trend is likely to accelerate as more resource-efficient options emerge and become popular.

1 Like

The popularity of Electron for this just means it is another single point of failure. There is no silver bullet.

2 Likes

I didn’t say it was a silver bullet. I was only addressing WayneG’s sentence that I quoted.

That said, as more cross-platform frameworks emerge, the less risk there will be that a user will be only relying on any one of them. Even now, they’re making users less dependent on a single platform, which is another single point of failure.

Btw, your point sheds a rather negative light on the ideal some Mac users have of only using apps coded in Swift.

1 Like

I believe we should start with the premise of what we need to protect so you can have a plan to protect.

In my opinion, problem is not subscription services, but will my software operate in case of internet disruption for example?
There are multiple levels of protections and you can go as protective as you can afford, like storing digital photos in printed format as well. So it is comfort-cost-benefit analysis of what you believe is crucial vs important vs nice-to-have.

In my opinion here are some categories I need protected:

  1. Shelter (having a secondary location to live in and can be close relatives that you can move to quickly)
  2. Power (backup power)
  3. Finances (some backup cash for example)
  4. Transportation
  5. Health (usually covered by organizations like government)
  6. Food
  7. Finally digital systems

By the way, in the last week I have witnessed people in mass impacted by all the above. Between massive thunderstorm and todays attack.

Much of the software we used was originally Windows only for decades and was rewritten to run in a browser about six years ago. As one vendor put it, “we got tired of having to rewrite our software every time Microsoft updated their OS”.

1 Like

It was strange to have this in the news, forcing out the previous headlines here, which were all about the first report of the UK Covid enquiry. This report was about the UK’s readiness and planning for a pandemic and in general for any major, disruptive event.

That strange juxtaposition prompted a few thoughts:

  • we are already long past the point where individuals or communities can be self-sufficient, especially if they want access to any of the many benefits of a modern civilisation. My guess is that tipping point happened somewhere around 1400 for most of the planet. We depend on social and technological systems whether we want to or not.
  • any system will fail, often in ways and at times which no one can easily predict
  • the effects of any failure will have consequences that are very difficult to predict, but are likely to include further, cascading failures
  • resilience to failure means having sufficient resources (people, places, equipment, transport etc.) of sufficient quality already in reserve, and practised plans to mobilise those resources and deploy them flexibly and effectively depending on need-
  • it also means having effective recovery plans and resources for important or critical systems (e.g. backups, fall-backs, parallel systems etc…)-
    This was all well understood and moderately well followed when the cold war (and fairly recent experience of the second world war) meant that there were obvious, fundamental threats but we’ve now had many decades in almost all Western countries, in which there has been a determined wish to shrink “government”, cut public spending by making “efficiency savings”, remove “unelected bureaucrats” and the like. Many critical systems and much critical infrastructure is in private ownership and corporations cannot or will not invest in reserves that everyone hopes will never be called on. The Covid report recommends structures and processes for the UK government to build back some of that lost resilience, not least because so much was wasted (and probably corrupted) by trying to scramble together a response in a hurry.

Of course, individuals, families and communities should make sensible plans to contribute to their own resilience and flexibility but unless you are very rich indeed you can’t plan to replace a failed airline flight or crashed airport or broken railway system. If we want better responses when systems fail, we need to build more reserves into those systems and we all have to be willing to pay for those reserves.

2 Likes

And just to add: the over-reliance on Windows and a single security software supplier is all about “efficiency” defined in a simplistic way as “paying the least possible”.

1 Like

From media snippets, it seems the update was pushed out to thousands if not more Windows desktops. Null-pointer? That caused the “blue screen” seen in many of the media photos. That means, it seems, that manual fixing of thousands if not more computers is required as how do you reach the machine via network when they are “blue-screened”.

All that aside, many things odd about the whole thing which requires a lot of investigation. We probably will never know.

We have friends who were planning visit us after transatlantic flights yesterday … have not arrived and no ETA.

Yeah, from what I’ve read it’s because a lot of systems auto-update overnight, so there was no way to prevent so many devices being affected once the update had hit external networks/devices.

There’s a HUGE question -from my point of view- about how the catastrophic nature of the bug wasn’t picked up during quality checks and testing, but that aside, the developers have no control over how people implement updates.

And from the corporate world’s point of view, of course we (as a society) usually choose to auto-update security packages at night. It’s the quietest time and least likely to inconvenience business.

@Bmosbacker Science fiction has of course been pondering this problem for decades, but one of the most recent iterations is Louis De Bernières new book, Light over Liskeard, which looks at how a cyber failure would affect society at some indeterminate point in the near future.

[There’s a separate interesting point here that what was once the purview of science fiction, always treated as a lesser literature genre for fake reasons, is now being picked up by mainstream “serious literature” novelists. It speaks about the cultural shift in society I think (we are now living in “the future” and what was once side-lined fantasy now affects us all).]

1 Like

It’s been well reported that a routine update included a bad configuration file. It really does seem to be human error (something like someone clicking on the wrong file), not caught (as it should have been) by even a simple checklist procedure, adequate testing before release, or rolling release with rapid response to any initial problems reported.

There’s nothing odd about this at all. I’ve worked in enough places where no-one realised that “the way we always did things” was going to bite us one day, until it did. As we say in blighty, “cock-up is always more likely than conspiracy” and can be just as damaging.

1 Like

It’s almost never a conspiracy, 99% of the time it’s just incompetence :joy: We attribute far too much intelligence, diligence and forward-planning to the average human being.

And I say that coming from a field dealing with the biggest conspiracy successfully executed in the history of humankind. Most the time, humans are just not that scheming :joy: A well-executed conspiracy is the exception, not the rule.

4 Likes