Delta CEO says CrowdStrike-Microsoft outage cost the airline $500 million

MicroWave@lemmy.world · 2 months ago

Delta CEO says CrowdStrike-Microsoft outage cost the airline $500 million

Poem_for_your_sprog@lemmy.world · 2 months ago

Why do news outlets keep calling it a Microsoft outage? It’s only a crowdstrike issue right? Microsoft doesn’t have anything to do with it?

Rekhyt@lemmy.world · 2 months ago

It was a Crowdstrike-triggered issue that only affected Microsoft Windows machines. Crowdstrike on Linux didn’t have issues and Windows without Crowdstrike didn’t have issues. It’s appropriate to refer to it as a Microsoft-Crowdstrike outage.

Poem_for_your_sprog@lemmy.world · 2 months ago

I guess microsoft-crowdstrike is fair, since the OS doesn’t have any kind of protection against a shitty antivirus destroying it.

I keep seeing articles that just say “Microsoft outage”, even on major outlets like CNN.

Dran@lemmy.world · 2 months ago

To be clear, an operating system in an enterprise environment should have mechanisms to access and modify core system functions. Guard-railing anything that could cause an outage like this would make Microsoft a monopoly provider in any service category that requires this kind of access to work (antivirus, auditing, etc). That is arguably worse than incompetent IT departments hiring incompetent vendors to install malware across their fleets resulting in mass-downtime.

The key takeaway here isn’t that Microsoft should change windows to prevent this, it’s that Delta could have spent any number smaller than $500,000,000 on competent IT staffing and prevented this at a lower cost than letting it happen.

Echo Dot@feddit.uk · 2 months ago

Delta could have spent any number smaller than $500,000,000 on competent IT staffing and prevented this at a lower cost than letting it happen.

I guarantee someone in their IT department raised the point of not just downloading updates. I can guarantee they advise to test them first because any borderline competent I.T professional knows this stuff. I can also guarantee they were ignored.

ricecake@sh.itjust.works · 2 months ago

Also, part of the issue is that the update rolled out in a way that bypassed deployments having auto updates disabled.

You did not have the ability to disable this type of update or control how it rolled out.

https://www.crowdstrike.com/blog/falcon-content-update-preliminary-post-incident-report/

Their fix for the issue includes “slow rolling their updates”, “monitoring the updates”, “letting customers decide if they want to receive updates”, and “telling customers about the updates”.

Delta could have done everything by the book regarding staggered updates and testing before deployment and it wouldn’t have made any difference at all. (They’re an airline so they probably didn’t but it wouldn’t have helped if they had).

corsicanguppy@lemmy.ca · 2 months ago

Delta could have done everything by the book

Except pretty much every paragraph in ISO27002.

That book?

Highlights include:

ops procedures and responsibilities
change management (ohh. That’s a good one)
environmental segregation for safety (ie don’t test in prod)
controls against malware
INSTALLATION OF SOFTWARE ON OPERATIONAL SYSTEMS
restrictions on software installation (ie don’t have random fuckwits updating stuff)

…etc. like, it’s all in there. And I get it’s super-fetch to do the cool stuff that looks great on a resume, but maybe, just fucking maybe, we should be operating like we don’t want to use that resume every 3 months.

External people controlling your software rollout by virtue of locking you into some cloud bullshit for security software, when everyone knows they don’t give a shit about your apps security nor your SLA?

Glad Skippy’s got a good looking resume.

ricecake@sh.itjust.works · 2 months ago

Yes, that book. Because the software indicated to end users that they had disabled or otherwise asserted appropriate controls on the system updating itself and it’s update process.

That’s sorta the point of why so many people are so shocked and angry about what went wrong, and why I said “could have done everything by the book”.

As far as the software communicated to anyone managing it, it should not have been doing updates, and cloudstrike didn’t advertise that it updated certain definition files outside of the exposed settings, nor did they communicate that those changes were happening.

Pretend you’ve got a nice little fleet of servers. Let’s pretend they’re running some vaguely responsible Linux distro, like a cent or Ubuntu.
Pretend that nothing updates without your permission, so everything is properly by the book. You host local repositories that all your servers pull from so you can verify every package change.
Now pretend that, unbeknownst to you, canonical or redhat had added a little thing to dnf or apt to let it install really important updates really fast, and it didn’t pay any attention to any of your configuration files, not even the setting that says “do not under any circumstances install anything without my express direction”.
Now pretend they use this to push out a kernel update that patches your kernel into a bowl of luke warm oatmeal and reboots your entire fleet into the abyss.
Is it fair to say that the admin of this fleet is a total fuckup for using a vendor that, up until this moment, was generally well regarded and presented no real reason to doubt while being commonly used? Even though they used software that connected to the Internet, and maybe even paid for it?

People use tools that other people build. When the tool does something totally insane that they specifically configured it not to, it’s weird to just keep blaming them for not doing everything in-house. Because what sort of asshole airline doesn’t write their own antivirus?