Crowdstruck (Windows Outage) - Computerphile

Поділитися
Вставка
  • Опубліковано 15 жов 2024
  • Nearly nine million Windows machines were taken out by the Crowdstrike problem in July 2024, but why was the impact so problematic? Dr Steve Bagley and Dr Mike Pound of the University of Nottingham discuss the problem.
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan...
    Thank you to Jane Street for their support of this channel. Learn more: www.janestreet...

КОМЕНТАРІ • 1 тис.

  • @james_chatman
    @james_chatman 2 місяці тому +923

    I got dragged into this and I'm now at 48 hours of overtime. Thanks CrowdStrike.

    • @jklax
      @jklax 2 місяці тому +5

      ​@NigelfarijI was about to say

    • @FrietjeOorlog
      @FrietjeOorlog 2 місяці тому +59

      @Nigelfarij Tell that to the taxman.

    • @sunefred
      @sunefred 2 місяці тому +7

      Thats crazy. Whats your patch rate / hour? How many machines?

    • @Artifactorfiction
      @Artifactorfiction 2 місяці тому

      Ujjjj😢😢😢😢😢😢😢😢😢😢😢

    • @rationalbushcraft
      @rationalbushcraft 2 місяці тому +16

      Did you guys get the USB microsoft created to automatically fix it? What is cool is the winpe usb drive just boots into safe mode and runs repair.cmd file it creates. I am keeping this as it will be easy to change that batch file and have it do other things in the future if I want to.

  • @luicecifer
    @luicecifer 2 місяці тому +763

    "Well, well, well. Tell me, young gentlemen, why is it always you two when something bad happened??"

    • @throwaway6478
      @throwaway6478 2 місяці тому +19

      Because we rule the world, and a one in a billion chance is next Tuesday for us.

    • @SubTroppo
      @SubTroppo 2 місяці тому +4

      I am reminded of Cheech & Chong, - but high on technology. I mean man, what can you do?

    • @reallyWyrd
      @reallyWyrd 2 місяці тому +6

      "It's a gift." -- the 4th Doctor

    • @Nicolas-L-F
      @Nicolas-L-F 2 місяці тому

      ⁠well put

    • @nahco3994
      @nahco3994 2 місяці тому +6

      That's a bit unfair, isn't it? Crowdstrike managed to crash tons of Linux systems with the exact same software this April. Same software (Falcon), same problem (kernel panic). Only nobody made a big deal about it back then. Dr. Begley even mentions it briefly in the video.

  • @leighhaynes
    @leighhaynes 2 місяці тому +214

    McAfee did something similar several years ago. A bad definition quarantined core system files. The McAfee CTO from that era is now CEO at Crowdstrike.

    • @somethinglikethat2176
      @somethinglikethat2176 2 місяці тому +68

      To borrow a comment from elsewhere "real men test in production on a Friday"

    • @acrazydurian
      @acrazydurian 2 місяці тому +17

      A fine example of "failing up"

    • @alvintollah
      @alvintollah 2 місяці тому +4

      1 time is a mistake to be learned from. 2 times are a pattern of behaviour, signalling deeper flaws.

  • @TheAnonymmynona
    @TheAnonymmynona 2 місяці тому +293

    So there were 3 seperate failures from Crowdstrike.
    1. The kernel Driver didn't have proper input validation
    2. The Channel File was broken
    3. The testing was so abysmal that they didn't notice before sending the update out to customers.

    • @torbjornlindh5108
      @torbjornlindh5108 2 місяці тому +38

      It’s quite scary that they get their kernel driver signed, despite it not meeting the standard of validating all input! That’s a systemic problem with their entire solution! (Well, so is the third, but testing is not you build quality into the system, so I think the first is the fatal flaw.)

    • @jbird4478
      @jbird4478 2 місяці тому +32

      4. They didn't even notice that every client that updated went down, or at least they didn't respond. How that is even possible is beyond me. Their entire product is based on monitoring systems, but it took them hours to respond, and that was after Google had called them out for the chaos everywhere.

    • @SkandiaAUS
      @SkandiaAUS 2 місяці тому +15

      I think #3 is the worst and why their share price is tanking. Such an utter lack of responsibility to Yolo this into prod.

    • @ReverendTed
      @ReverendTed 2 місяці тому +18

      It does call into question the WHQL testing that allowed the driver to be signed, which does push some degree of responsibility back to Microsoft.

    • @jimfoye1055
      @jimfoye1055 2 місяці тому +4

      @@ReverendTed Bingo.

  • @oourdumb
    @oourdumb 2 місяці тому +400

    The real worry is the lack of QA at Enterprise companies. A state actor infiltrating one of these orgs would be absolutely devastating.

    • @SuperWolfkin
      @SuperWolfkin 2 місяці тому +44

      The real issue and worry is a monoculture. This sort of problem will always happen. Someone is always going to be affected and there's always going to be a cohort of people who are unfairly affected by things that are out of their control. The problem is the cohort here happens to be extremely big because of there's a monoculture of this type of software monopolies lead to monocultures and monocultures lead to unique weaknesses. This unique weakness was able to take out. You know millions of computers all around the world cuz everyone was using this software. We need more companies in this space. Even now the fact that after this happens, everyone basically have to look to crowdstrike because that's who everyone uses. It sounds there's no competitive alternative

    • @vincei4252
      @vincei4252 2 місяці тому +2

      It has and still is devastating. Didn't need the boogieman to show this.

    • @BongoBaggins
      @BongoBaggins 2 місяці тому +5

      If you can think of it, someone has already done it.

    • @NoahSpurrier
      @NoahSpurrier 2 місяці тому

      There are probably already some bad actors out there. Just look at the catastrophic instances of espionage inside the CIA. See Robert Hanssen and Aldrich Ames.

    • @sandwich2473
      @sandwich2473 2 місяці тому +6

      Agile!!!!!!!!!
      I love Agile development practices!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

  • @solimm4sks510
    @solimm4sks510 2 місяці тому +428

    Heh the BSOD at 0:40 is cool
    "For more information about this issue and possible fixes, do not ask us"

    • @DailyFrankPeter
      @DailyFrankPeter 2 місяці тому +38

      But it's about as helpful as a genuine one!

    • @T_GingerDude5416
      @T_GingerDude5416 2 місяці тому +26

      also LEET% complete

    • @paulmichaelfreedman8334
      @paulmichaelfreedman8334 2 місяці тому +8

      @@T_GingerDude5416 All hail 1337!

    • @telebubba5527
      @telebubba5527 2 місяці тому +3

      Haven't come across that for years. Had totally forgotten how it looks like.

    • @crazymonkeyVII
      @crazymonkeyVII 2 місяці тому +1

      Could've been a genuine message from M$ then!

  • @wcmatthysen
    @wcmatthysen 2 місяці тому +172

    The problem is rolling out an update (that might not have been tested so well) TO EVERYONE ON THE PLANET AT THE SAME TIME. I can't believe Crowdstrike is operating like this. If you did a phased roll-out to a couple of smaller customers initially, and then monitored whether the updates didn't have any glaring issues this whole situation could have been averted.

    • @ChrisM541
      @ChrisM541 2 місяці тому +26

      That's the nuts & bolts of it. Zero QC/QA before release. In an unregulated industry, this is damningly the norm.

    • @lever2k
      @lever2k 2 місяці тому +9

      I can't believe huge customers don't have a tiered approach to allowing patches to be deployed.

    • @Jai-xj7vy
      @Jai-xj7vy 2 місяці тому +6

      ​@@lever2k what company do you work at that tiers endpoint protection updates? Never heard of such a thing. Crowdstrike may not even offer that capability.

    • @rolfs2165
      @rolfs2165 2 місяці тому +8

      @@lever2k That's assuming the software even allows tiered deployment and doesn't expect _everything_ (including the main server) to be working on the same version - and any machine that isn't updated yet can only connect to update.

    • @TjPhysicist
      @TjPhysicist 2 місяці тому +10

      @@lever2k based on what i've bbeen hearing from others online: a lot of companies **do** have tiered approach for updates, including crowdstrike, but this update - deemed by crowdstrike to be very critical, ignored ALL such settings and was deployed unilaterally to everything.

  • @IstasPumaNevada
    @IstasPumaNevada 2 місяці тому +65

    "As I said online, you should just go outside and enjoy the sunshine."
    Okay, but what are people in the U.K. supposed to do?

    • @QuantumHistorian
      @QuantumHistorian 2 місяці тому +13

      Shots fired. But not seen in the UK, because of the dense cloud cover.

    • @blucat4
      @blucat4 2 місяці тому +1

      😄

  • @bilalsadiq1450
    @bilalsadiq1450 2 місяці тому +113

    If Dr Bagley and Dr Pound had a podcast, I'd definitely listen to them talk for hours lol.

    • @paulmichaelfreedman8334
      @paulmichaelfreedman8334 2 місяці тому +9

      "The IT podcast with Bagley and Pound" Does that sound interesting to you?

    • @learningCodingWithMe
      @learningCodingWithMe 2 місяці тому

      ​@@paulmichaelfreedman8334 oh yeah it does

    • @Turbo3032
      @Turbo3032 2 місяці тому +4

      A Computerphile podcast as a sister podcast to the Numberphile Podcast would be amazing!

    • @whathappenedman
      @whathappenedman 2 місяці тому +3

      Fr. I like listening to them speak

    • @scottydawg1234567
      @scottydawg1234567 2 місяці тому +3

      ​@@paulmichaelfreedman8334 Yes, actually.

  • @adityavardhanjain
    @adityavardhanjain 2 місяці тому +60

    I was waiting for this video with extreme excitement for the last 2 days. I jumped on UA-cam as soon as I saw the notification.

  • @BruceAngus
    @BruceAngus 2 місяці тому +34

    I was stuck in Atlantas airport because of this. It was absolute madness and everyone that talked about it, either from the airline or passengers, said it was a Microsoft issue. That's all most people are going to remember.

    • @0LoneTech
      @0LoneTech 2 місяці тому +6

      That's not entirely wrong. Microsoft did bless this software as permitted the privileges to do whatever to the entire system. They're in turn blaming this on EU, but EU only mandated they provide access to security software at the same level their own has; it's Microsoft's choice to make that this risky. Then there's the trust placed in Crowdstrike; they're likely selected for being a known name, never mind they ran a previous company into the ground in this particular manner. It's like the hotel manager decided to install an entry counter in their front door and nobody asked why it's also a guillotine.

  • @satysin630
    @satysin630 2 місяці тому +153

    Nice touch with the 13.37% in the BSOD 😁

    • @3Ppaatt
      @3Ppaatt 2 місяці тому +3

      Hardly related, but I'll always remember my daughter was born at 13:37

    • @satysin630
      @satysin630 2 місяці тому +1

      @@3Ppaatt That's because your daughter is elite!

    • @laurenlarmour2220
      @laurenlarmour2220 2 місяці тому

      Haven't seen a leet reference in the wild in ages. Two thumbs up!

    • @lanarkorras4411
      @lanarkorras4411 13 днів тому

      @@3Ppaatt Sorry you're not related though. 😜

  • @LunarcomplexMain
    @LunarcomplexMain 2 місяці тому +232

    I swear this is only the beginning for tech companies that are losing valued senior staff over the many, many decades...

    • @DoubleOhSilver
      @DoubleOhSilver 2 місяці тому +30

      Honestly I see why. This career is mostly miserable and the pay seems to be going down.

    • @kaseyboles30
      @kaseyboles30 2 місяці тому +55

      Senior staff that in case probably cautioned against allowing running code in kernel space before it's tested on a test system because that's a fast track to exactly what happened. Senior staff likely tired of their expertise being ignored by suits who cannot comprehend anything outside their niche might matter.

    • @vincei4252
      @vincei4252 2 місяці тому

      Losing? They think they can do things cheaper elsewhere and AI can replace everyone. I wish them luck in the wars to come. Yes, this was a fun career and all I've see is degradation of quality of life on a massive scale. Where everything is micromanaged by 100% non-technical types. I don't miss it at all.

    • @vincei4252
      @vincei4252 2 місяці тому +24

      @@DoubleOhSilver UA-cam censored my comment. Wanted to say that I totally concur with the sentiment. Not only is it miserable, the hiring process that is adopted across the board seems to be nonsensical hazing rituals that do not map to real world problems or realistic development tasks and activities. The golden age is well and truly over.

    • @Abdega
      @Abdega 2 місяці тому +16

      Especially the ones who are losing senior staff who know the ins and outs of the product, and replacing them with “Business guy who does business things and doesn’t need to know how the technology works”

  • @era_s
    @era_s 2 місяці тому +35

    "If you put everything on the cloud, and then the cloud's not there, you've got nothing."

    • @kevinmcfarlane2752
      @kevinmcfarlane2752 2 місяці тому

      The clouds have multiple redundancies though, depending on how much the customer is willing to pay.

    • @tadeob_
      @tadeob_ 2 місяці тому +1

      what if the could and its redundancies were affected?😮

  • @vincei4252
    @vincei4252 2 місяці тому +198

    In the modern version of Battlestar Galactica, Admiral Adama absolutely refused to have Galactica networked to other systems and ships in fleet because of the risks to their it critical system. Yet here we are, allowing a root kit to operate unconstrained on millions of machines. Fun times ahead.

    • @MrJegerjeg
      @MrJegerjeg 2 місяці тому +2

      Wow, I thought exactly the same! 😃

    • @evannibbe9375
      @evannibbe9375 2 місяці тому +4

      A lot of the computers that businesses give out to employees (such as ATM screens and point-of-sale devices) where those computers are so cheap that they become completely useless without a network connection (like a Chromebook), and so the system is working “correctly enough” that it detected a problem in those (theoretically) cheap end computers, and it cut them off of the network. The failure was that the wrong thing was found to be a threat, and all those end computers were cut off.

    • @rolfs2165
      @rolfs2165 2 місяці тому +2

      @@evannibbe9375 "Oops, it's all malware."

    • @thefrub
      @thefrub 2 місяці тому +18

      @@evannibbe9375 I'm amazed, literally everything you just said in that comment is wrong. It's like I just watched Calvin's dad explain computers

    • @ivonakis
      @ivonakis 2 місяці тому +1

      And kernel level anticheat is a thing ...

  • @minxythemerciless
    @minxythemerciless 2 місяці тому +101

    The guilty in this instance are both CrowdStrike and their Customer Security Managers.
    CrowdStrike has a history of shipping stuff that breaks systems, most recently their Linux product.
    The Customers said: Yes CrowdStrike just put whatever you want on our systems without monitoring. And by the way, we have no adequate disaster recovery plan.
    As a corollary, letting CrowdStrike put stuff on your systems also allows bad people to compromise CrowdStrike and deliver unlimited hurt.
    If I was a baddie I'd spend my every effort to subvert CrowdStrike!

    • @ipadista
      @ipadista 2 місяці тому +3

      There will most likely be a lot of QA positions opening on Crodstrike in the aftermath of this. Bad actors just need to get one of "their guys" in through that recruitment process.

    • @LimitedWard
      @LimitedWard 2 місяці тому +12

      ​@@ipadistaI'd sooner expect more attorney positions to open up before QA

    • @justgame5508
      @justgame5508 2 місяці тому +2

      What an awful take

    • @haqvor
      @haqvor 2 місяці тому +6

      @@justgame5508 welcome to the corporate mindset. Protection against liability is more important than delivering a working product. Who do you think the company is prepared to pay the most, the lawyers or the engineers? That reflects how they value their respective services.

    • @jbird4478
      @jbird4478 2 місяці тому +1

      @@lintfordpickle Yeah, but when our security software screws up it will a) first crash the test machine which would block the rest from receiving the update, and b) if that somehow fails our system would allow us to reboot with a previous system snapshot. To see these massive and vital organizations not have _any_ backup plans while putting full trust in an external company is mind boggling.

  • @piranniayt
    @piranniayt 2 місяці тому +131

    Perfect storm: no fuzzy testing the driver code, no staged deployment, no os blue/green boot partition

    • @Ash_18037
      @Ash_18037 2 місяці тому +4

      No not really, a perfect storm implies the issue was due to various timing / bad luck factors. ie It lessens the culpability of ClownStrike. Each of the issue you mention were just plain incompetence.

    • @baumkuchen6543
      @baumkuchen6543 2 місяці тому +1

      I am afraid there was not testing at all in this mess. Everything points out to that...

    • @draoi99
      @draoi99 2 місяці тому +4

      Third Party apps operating in kernelspace... FFS

    • @colinhobbs7265
      @colinhobbs7265 2 місяці тому

      ​@@draoi99All operating systems do this. If you are saying FFS about that you don't know how computers work. Yes, including MacOS.

  • @jeraldbottcher1588
    @jeraldbottcher1588 2 місяці тому +8

    This boggles my mind as an IT professional. I was part of a team that deployed patches and software for years. This included OS deployment patch deployment, software deployment the whole thing on both Workstations and Servers. We tested our patches extensively before pushing them out to the entire population of the environment. This 1st included a sandbox environment, then a select user / system environment, then we would stage our patches out over several hours so if something happened we could back out before catastrophe struck. And honestly sometimes we would find problems with the patches, and we would be able to immediately stop, suspend and even back out.
    Yes we would use 3rd party vendor solutions to help with this, and any time we changed ANYTHING we would follow our testing procedures and matrix, normal business. We would never shirk our procedures to test 1st, then deploy. To me this is a total failure of IT Governance and failure to maintain standards. (IT Governance is setting and maintaining standards and policies for the IT Infrastructure)

  • @BigMcLargeChungus
    @BigMcLargeChungus 2 місяці тому +33

    I think it's important to point out that Crowdstrike did the same thing back in April but it affected Linux machines (causing kernel panic).

    • @Techmagus76
      @Techmagus76 2 місяці тому +10

      But not much talk about that, why probably because you have a rollback mechanism in booting previous working kernels in nearly all distros.

    • @heinzk023
      @heinzk023 2 місяці тому +3

      Maybe CrowdStrike's management thinks and acts like Boeing's?

    • @nosuchthing8
      @nosuchthing8 2 місяці тому +1

      Really??

    • @ChrisM541
      @ChrisM541 2 місяці тому

      And they've caused a massive c*ck-up a few years ago. Seems they are 'too big' to fail.

    • @sinaghaderi9184
      @sinaghaderi9184 2 місяці тому

      ​​@@Techmagus76bcz no one install an anti-virus on linux.

  • @CheddarKungPao
    @CheddarKungPao 2 місяці тому +97

    When talking about this incident it's worth remembering that hospitals were affected and she people may have died because of this. So it's all well and good to say when everything goes down, go outside and touch grass. But also, we do need to think seriously about whether we're doing enough to ensure software safety. We take it way less seriously than, for example, car safety. When a new model of car comes out it has to go through all kinds of testing to ensure its safety. But we are doing nothing to ensure software safety, we are just 100% trusting the vendors. I've been a software engineer professionally for 25 years and have long thought that the current approach is madness and incidents like this one only make more sure we need to have standards that all critical system software meets in its development, deployment and implementation.

    • @Nadia1989
      @Nadia1989 2 місяці тому +10

      Someone left a message in an Spanish dev stream saying their aunt had a miscarriage and couldn't be operated on because the all the hospital computers had BSOD'ed. She had an emergency procedure hours later.

    • @SuperWolfkin
      @SuperWolfkin 2 місяці тому +13

      100% true. It's definitely a big deal that this incident took down not just School computers or corporate businesses but hospitals that need them to keep people alive. people were missing their medications and for some people like me missing medication means you end up throwing up for a couple nights for other people the consequences can be much more dire.
      At the end of the day as technology begins to run more and more of our lives I do agree there's nothing you can do to prevent hospitals from being part of the affected class these things will happen and hospitals will be affected just like any other computerized business. The problem is we don't need to have so many hospitals affected in a single incident that is purely the result of a monoculture which is the result of monopolistic practices which is a result of the form of capitalism that we have in North America and its effects around the world.
      And that's just on a philosophical level without even approaching all the specific problems that could have been prevented in this case

    • @mohammednazir3249
      @mohammednazir3249 2 місяці тому

      bro is secretly working for the government

    • @jismeraiverhoeven
      @jismeraiverhoeven 2 місяці тому +10

      while i agree with your statement, digitalization also played a huge role in this. nowadays everything needs to be "smart", even things that dont make sense like refrigirators. if those hospitals had alternatives to the computers they used (like for example have paper copies of documents alongside digital versions) this would have hurt them far less significantly. we are too dependant on digital computers

    • @tyrand
      @tyrand 2 місяці тому +6

      Anyone using this horseshit on hospital computers needs sacking

  • @Arthur-1337
    @Arthur-1337 2 місяці тому +164

    The frowny face is absolutely necessary

    • @user-yv6xw7ns3o
      @user-yv6xw7ns3o 2 місяці тому +3

      Yes I agree. Absolutely necessary, even if not strictly so :(

    • @ICanDoThatToo2
      @ICanDoThatToo2 2 місяці тому

      I dunno, I'm starting to like 😉👍

    • @phizc
      @phizc 2 місяці тому +2

      ​@@ICanDoThatToo2 any of these would work too:
      🤪 🤯 🥳 🥶 😱 💀 💩 🍐 🌋 🆘️ 🏳
      Or an animation:
      🤣
      😂 🔫
      😅 🔫
      🥺🔫
      🤯💥🔫
      🧠💀

    • @blucat4
      @blucat4 2 місяці тому

      If Mike Pound says it, it must be true. Therefore you are wrong! 😁

  • @kaseyboles30
    @kaseyboles30 2 місяці тому +27

    The fix is simple, do not push untested code onto live systems where it will run as part of a must run to boot kernel level driver. Run it on a test system first. And never trust a 'security company' who says you should do otherwise (except in rare cases, such as a very bad zero day being exploited where it's a gamble either way). If they allowed this for a run of the mill non-emergency update then they don't know cyber security and safety well enough to protect a home gaming system, let alone major systems. This goes past gross incompetence to the point where I wouldn't blame anyone from suspecting malice. Though I personally think it was "we don't screw up, we stop screw ups" level hubris.

    • @ChrisM541
      @ChrisM541 2 місяці тому +6

      EXACTLY!
      Unfortunately, this braindead policy of offloading all QC/QA onto the end user is being practiced my an increasing majority of devs...all thanks/empowered by The Internet. Software development is the most uncontrolled, unregulated industry in existence. Governments MUST act...before it really is too late!

    • @haqvor
      @haqvor 2 місяці тому +4

      I quote Grey's law: "Any sufficiently advanced incompetence is indistinguishable from malice."
      It doesn't really matter if Crowdstrike did it out of malice or just cut corners to cheap out on development costs. They sell a product that is obviously not robust enough to be used on mission critical systems and they have made the decision to risk their customers business to make more money for themselves.
      In turn Microsoft allows their OS to hard crash due to a faulty third party driver. That can not be tolerated on mission critical systems so a large part of the blame goes to them as well. The end users seems to be pretty naive as well, they have hopefully learnt the expensive lesson on how to not build infrastructure.

    • @BillAnt
      @BillAnt 2 місяці тому

      There's also a small chance that the files got corrupted during the transfer to a CDN which served the corrupted update to millions of computers. We shall see....

  • @wily_rites
    @wily_rites 2 місяці тому +23

    Software running in the kernel pretending to be a driver, when in reality it is a parser, what could go wrong?

  • @mfaizsyahmi
    @mfaizsyahmi 2 місяці тому +4

    Seeing two academicians discuss this issue is so refreshing. So many ideas thrown back and forth.

  • @WilliamLeeSims
    @WilliamLeeSims 2 місяці тому +105

    The CrowdStrike bug was what Y2K wished it could be.

    • @ZiggyGrok
      @ZiggyGrok 2 місяці тому +27

      Fortunately we fixed Y2K before it could cause this chaos. If we had done nothing, it would've been far far more devastating.

    • @davidmcgill1000
      @davidmcgill1000 2 місяці тому +4

      @@ZiggyGrok Y2K only affected those that were too lazy to add 2 more characters to their dates. If your code was vulnerable, it was terrible code to begin with.

    • @nosuchthing8
      @nosuchthing8 2 місяці тому +8

      The world was not as interconnected then too.

    • @AySz88
      @AySz88 2 місяці тому +10

      ​@@davidmcgill1000 You realize that non-programmers use two digits for years too? A lot of it was a (lack of) standards issue, not just code

    • @davidioanhedges
      @davidioanhedges 2 місяці тому +14

      ​@@davidmcgill1000too lazy... No, using software originally designed when memory was small and expensive, and saving two characters per entry won them pay rises
      There were huge and expensive efforts put in to check and update to get around the issues many years later, and so near nothing happened, but it doesn't mean there wasn't a problem

  • @blenderpanzi
    @blenderpanzi 2 місяці тому +22

    Windows can in fact boot with the failing driver automatically disabled the next time, except for drivers that are marked as absolutely necessary for booting itself, and this driver is marjed as such.

    • @irql2
      @irql2 2 місяці тому

      nah it wasnt marked as boot critical, common talking point though. Doesnt change anything though, unless you get to a desktop windows considers it a failed boot, do that 3x and you end up in the recovery console.

    • @grokitall
      @grokitall 2 місяці тому +1

      @@irql2 yes it was, but the decision as to if it can be downgraded should be Microsofts.
      just because they want it to prevent booting if it cannot start does not mean that windows cannot start without it.

    • @irql2
      @irql2 2 місяці тому

      @@grokitall stop parroting talking points and go look at how the driver is configured in the registry. People super confident about things and wont even verify when its very easy to do.

    • @grokitall
      @grokitall 2 місяці тому +4

      @@irql2 according to retired microsoft engineer dave plumer, they had it marked as boot critical according to his sources.
      i have no reason to doubt his statement.
      despite how unimpressed i am with various choices Microsoft has made, i have no reason to doubt the quality of their engineers. that is why i am sure they are capable of determining if it is actually boot critical when the driver is being signed.
      i am also sure that they are capable of writing code which will use that determination to down grade the driver and disable it if it is too broken to boot, and to check if it is stuck in a boot loop.
      for any os, as long as you can get to startup, and use the net, you can fix the driver with an update without having to manually login to all the locked down machines.
      the fact that they have not bothered to implement such a measure when this has happened before is disappointing.

    • @irql2
      @irql2 2 місяці тому

      @@grokitall Thanks for confirming you wont even go look and you'll just parrot whatever anyone says. David is wrong too and he would admit it if he looked. We're human, it happens... He probably doesnt have a dump to go and check.
      and honestly doesnt matter.
      Whats more concerning is how confidently wrong people and they have no interest in learning anything that wasnt hand delivered to them by some source they consider trustworthy. This is a huge problem and our political climate is evidence enough of this.
      If you would have asked "How do I verify this?" since you obviously don't know or even care to, I would have shared that information with you so that you could be more informed on the topic... but nah, polly wants a cracker instead.
      For those that are interested in learning, csagent's Start value is set to 1. Meaning its just another driver, its not special in regards to booting. If it were, you'd get a 7b on boot. This entire interaction is disappointing. What happened to the days when people went "Oh yea? Show me".

  • @3Ppaatt
    @3Ppaatt 2 місяці тому +4

    Working for a Bank we had drills where we simulated losing our systems for a few hours and had to do everything (and I mean every conceivable thing we might be asked to do in a normal day) without any computers. Including driving physical records to central processing locations.

  • @stco2426
    @stco2426 2 місяці тому +3

    Enjoyed this. Glad I watched the recent 'Dave's Garage' video where he explained the problem. Here I saw and got a good understanding of the wider consequence management. Well werth wathing both I think.

  • @bbellefson
    @bbellefson 2 місяці тому +21

    Typical "Management Bug?" A CrowdStrike engineer or two urges more testing before release. Some executive then pounds the conference table and shouts, "No more f**king EXCUSES! I want that update NOW gawdammit!"

    • @wcmatthysen
      @wcmatthysen 2 місяці тому

      Yeah, and I want it rolled out to everyone, NOW!!! Phased roll-outs are for pussies!

    • @aixtom979
      @aixtom979 2 місяці тому +13

      Especially seeing that the CEO of Crowstrike *now* was the CTO at McAffee back *then* , when McAffee brought down XP Machines by deleting Windows core files in 2010. The common factor ist the manager.

  • @tocsa120ls
    @tocsa120ls 2 місяці тому +19

    Crowdstrike did more harm to its clients, and to the Western world, that it could ever have possibly prevented for the entire duration of its existence as a company. How they ONLY lost 20% of their share value is mind-boggling.

    • @AlBoulley
      @AlBoulley 2 місяці тому +1

      Love the point you've made.

    • @nicostigliano6393
      @nicostigliano6393 2 місяці тому

      You said the most obvious thing

    • @Valgween
      @Valgween 2 місяці тому

      robot movie pfp

    • @tocsa120ls
      @tocsa120ls 2 місяці тому

      @@nicostigliano6393 nobody's saying it out loud tho

  • @m4rt_
    @m4rt_ 2 місяці тому +34

    The new update to CrowdStrike falcon included some corrupted channel files (they contained just zeroes instead of the intended data), and because the core driver that loaded the channel files didn't do enough input validation, it continued on using the messed up channel files, and this revealed a bug that likely had been there for a while. The bug caused the driver to attempt to dereference a null pointer, which caused the BSOD.

    • @David-bi6lf
      @David-bi6lf 2 місяці тому +7

      Yeah and probably crowd strike have not fixed the bug because it would require a new release of the driver and that would have to go again through the Microsoft WHQL signing process which the use of these channel files seeks to avoid.

    • @MatthijsvanDuin
      @MatthijsvanDuin 2 місяці тому +6

      Note that this corruption claim is afaik coming from one random twitter user and has been denied by Crowdstrike who says there was a logic error in the updated rules file that caused the problem. It seems extremely unlikely to me that crowdstrike does no validation on these files given that they're being updated frequently on a huge number of machines and are therefore liable to get corrupted (due to power failures and such) on a regular basis.

    • @MatthijsvanDuin
      @MatthijsvanDuin 2 місяці тому +3

      I found a twitter post from someone that the problematic channel file was _not_ zero-filled on any of the systems he had to manually fix that day.

  • @daanwilmer
    @daanwilmer 2 місяці тому +4

    Thanks for being the first source I found that actually explains what crowdstrike is and what went wrong here, and nice to hear some nuance amd perspective as well.

    • @IceMetalPunk
      @IceMetalPunk 2 місяці тому

      If you want a little more detail: apparently, the definition file they pushed out left some index entries uninitialized, so some memory addresses that were meant to hold pointers ended up with junk data that, when dereferenced, pointed to invalid memory locations.

    • @Tahgtahv
      @Tahgtahv 2 місяці тому

      @@IceMetalPunk Thanks, this is the best explanation I've heard so far. IMNSHO, the software should have been written in such a way such that the definitions don't directly map to memory. Then when you create data structures in memory, they always point to something valid. But nobody asked me.

    • @alazarbisrat1978
      @alazarbisrat1978 2 місяці тому

      @@Tahgtahv I think what you're talking about is Rust. but apparently there were numerous cracks in the program even before then that was caused by the same QA issues that caused this current crash, the crash was just everything finally fell apart

  • @zhandanning8503
    @zhandanning8503 2 місяці тому +19

    when the computer goes down, that is a sign to photosynthesize, nice

    • @Abdega
      @Abdega 2 місяці тому +1

      It’s thunderstorming where I’m at so I’d have to wait

  • @DragoniteSpam
    @DragoniteSpam 2 місяці тому +3

    A number of years ago Tom Scott did a fun talk called "Single Point of Failure." I think about that sometimes.

  • @PE4Doers
    @PE4Doers 2 місяці тому +5

    I am a recently retired Cyber Security (though being heavily involved in Computer Security for over 30-years, and a software developer for 20 years prior to that, I prefer the traditional names of Computer or Systems Security) Compliance Officer. Although the systems I monitored were involved with critical infrastructure and not open to regular users of business systems, they were still peripheral dependent on many such systems. Since I was a stickler for avoiding the Cloud and third-party security products, my former employer has taken steps to ensure I never know if they were severely affected by the CrowdStruck (accepting the pun) event.
    The real issue is something you two gentlemen mentioned but did not go deeply into. What if there were malicious embeds (i.e. spies) working for that organization, or for Windows System development? We would not be face a bad day or so, but it could been lights-out until every critical system were completely rebuilt and data backups restored. I can understand why discussion of that scenario would be avoided, but should it be avoided. If I were a critically ill patient in the hospital I would want to know so I could prepare for the aftermath.

  • @sunefred
    @sunefred 2 місяці тому +31

    Falcon is using definition files which are NOT part of the WHQL process which Falcon obviously is! I don't know how this works on Linux or MAC, but maybe it should not be allowed for Windows driver makers to deliver _anything_ to the kernel that does not go through the WHQL certification.

    • @roippi3985
      @roippi3985 2 місяці тому +15

      This is the part that’s wild for me. WHQL is supposed to be this Highest Level Of Scrutiny thing, and somehow WHQL reviewed this workaround to inject arbitrary runtime behavior without requiring WHQL recertification and said F It Ship It.

    • @IceMetalPunk
      @IceMetalPunk 2 місяці тому +11

      My only suspicion is that someone, somewhere thought requiring WHQL for definition files could delay definitions too long when new vulnerabilities are discovered and need to be monitored. Like, "if we do WHQL on every definition, by the time it gets released, so many people could be affected by this exploit!"

    • @sunefred
      @sunefred 2 місяці тому

      @@IceMetalPunk I think that's the reason, and I can't say I have any insights in the WHQL process to tell you how long the process normally is. Would be interested to know though, do you know? I would imagine most of it is automated.

    • @playground2137
      @playground2137 2 місяці тому

      Yeah that is an important part that they didn’t mention, I think.

    • @bierrollerful
      @bierrollerful 2 місяці тому +2

      Maybe definition files do not contain any code and are thus exempt from WHQL process? It could be that the definition file was simply corrupted and unreadable and the kernel driver crashed when trying to read it.

  • @lenwe33
    @lenwe33 2 місяці тому +65

    13.37% complete... ISWYDT 🙃

    • @blackholesun4942
      @blackholesun4942 2 місяці тому +1

      What does that mean

    • @alazarbisrat1978
      @alazarbisrat1978 2 місяці тому

      @@blackholesun4942 I see what you did there

    • @playground2137
      @playground2137 2 місяці тому +6

      @@blackholesun4942I am not sure which part you didn’t get. The custom blue screen of death (BSOD) is something they fabricated. 1337 is often used in gamer culture to mean LEET (or elite rather). Usually indicating something like highly skilled (1337 player for instance). ISWYDT : I see what you did there. So it is used a bit ironically here, because it was of course not a skilled update. Hope that helps.

    • @jeremytrees7266
      @jeremytrees7266 2 місяці тому

      ​@@blackholesun4942 🏴‍☠️

    • @JonBrase
      @JonBrase 2 місяці тому +2

      ​@@playground2137TBF, 1337 is specifically turn-of-the-millennium gamer culture (late GenX, elder millennial). I'm not sure I've even seen younger millennials using it, let alone Gen Z.

  • @mythofechelon
    @mythofechelon 2 місяці тому +2

    As someone who led the deployment of EDR and EPP to 18,000+ endpoints last year, agents are absolutely installed on Windows servers, yes. Updates like this that don’t go through change control are a calculated risk for more up-to-date protections. Problem is that the risk mitigation is that the vendor does testing and releases competently..

  • @Moose_33
    @Moose_33 2 місяці тому +10

    Yesssssss, twas waiting for this. You beautiful channel you. The dynamic duo returns

  • @pnwlady
    @pnwlady 2 місяці тому +3

    Are there no standards for deploying updates that run in the kernel?

  • @eructationlyrique
    @eructationlyrique 2 місяці тому +26

    Linux has a feature that allows the sandboxing of channel updates using eBPF, although Crowdstrike doesn't use it yet. In theory, that could have prevented the BSODs had Windows had a similar feature.
    Also, I don't ncessarily agree that Windows is blameless here. While Crowstrike is definitely at fault, Windwos did certify their driver, and that validation somhow didn't include testing for corrupted or invalid channel files. There's no reason the driver should blindly trust those files without validation.

    • @reybontje2375
      @reybontje2375 2 місяці тому +1

      Yeah, Microsoft also allows eBPF, but it's in an alpha, very early state. Also, the people opining that "this isn't a Windows' issue" are right to a degree, but when you realize that there are design deficiencies around how Microsoft handles drivers, it can only be said, "they're right to a degree," especially when you can specify kernel command line options to disable drivers that are acting bad, or have a fallback initramfs that doesn't load the CrowdStrike driver, which Windows doesn't really allow.
      I believe that CrowdStrike is also on the eBPF design foundation alongside some other industry giants like Apple, Google, Microsoft, etc. I think CrowdStrike also uses eBPF for Linux in their newer agent after the debacle back in March/April with Debian.

    • @JonBrase
      @JonBrase 2 місяці тому +2

      My understanding is that CrowdStrike does use some type of interpreted code in their definition files, which would imply that there was some bug in the interpreter (or code downstream of it) that allowed a null-pointer dereference through (or made a null pointer dereference on its own).

    • @TheFPSPower
      @TheFPSPower 2 місяці тому +5

      @@reybontje2375 Windows does have self-recovery functions for bad acting drivers, but they do not work on boot drivers and Crowdstrike's driver is a boot driver so the system is not allowed to boot if it crashes by design unless you use safe mode.

    • @JonBrase
      @JonBrase 2 місяці тому +1

      @@forbidden-cyrillic-handle Lol. Your username.

    • @sinaghaderi9184
      @sinaghaderi9184 2 місяці тому

      But who would install this on linux? I never seen a linux server with anti-virus or edr. it sounds dum.

  • @Vospi
    @Vospi 2 місяці тому

    Very enjoyable format of two people discussing. Sounds less monotonous, too. Great job.

  • @lborate3543
    @lborate3543 2 місяці тому +27

    My local pub went down.. no fish and chips for me..

    • @jklax
      @jklax 2 місяці тому +3

      No cash in hand?

    • @Abdega
      @Abdega 2 місяці тому +20

      “This was a phishing attack and a chip level attack?”
      “No, no… the cash register system is down thanks to broken Windows update”
      “They broke your windows and stole your cash?!”
      “No, the money is still here!”
      “Okay, I’ll just pay you in cash then”
      “I can’t do that! The register is locked unless the computer tells it to open! Besides, each purchase is required to update the inventory as well”
      “I don’t see what the Tories have to do with anything in this case”
      “… I don’t have time for your Monty Python shenanigans”
      “I’d think this stuff would be programmed in C and not Python”
      “GET OUT!”

    • @paulmichaelfreedman8334
      @paulmichaelfreedman8334 2 місяці тому +4

      @@Abdega 😂

    • @KarimY-119
      @KarimY-119 2 місяці тому

      in my local pub i can order by sending a SMS to their fax. cash-only place

    • @dhillaz
      @dhillaz 2 місяці тому +5

      ​@@Abdega When the best comment is buried in a thread

  • @lis6502
    @lis6502 2 місяці тому +7

    Crowdstruck? We gave this overtime event a codename of 'clownstrike'

  • @rooboy69
    @rooboy69 2 місяці тому +3

    Crowdstrike didnt do any validation control(or not enough) in their Driver to check the .sys file before running it to confirm it wasnt just full of Null values etc.

  • @vincentfiestada
    @vincentfiestada 2 місяці тому +1

    Finally, FINALLY, some informed and cogent commentary on this issue that isn't just "Tech influencer says Windows is a mess and this would never happen in Linux or macOS"

  • @phasm42
    @phasm42 2 місяці тому +17

    Crowdstrike sounds like a nickname for Mustangs 😅

  • @TS6815
    @TS6815 2 місяці тому +1

    These IT disasters always have the upside of flushing Dr Pound and Dr Bagley out of whatever else they’re up to, to give us these great explanations!

  • @sunefred
    @sunefred 2 місяці тому +3

    Its going to be very interesting to see what Crowdstrike learns from this. One thing they didn't seem to use is a canary or blue/green deployment scheme. Hoping for some enlightening blog-posts on the topic eventually.

    • @vincei4252
      @vincei4252 2 місяці тому +4

      nothing. The guy in charge oversaw something exactly similar when he was at McAfee

    • @spartanj2957
      @spartanj2957 2 місяці тому +1

      Microsoft,CS,Black rock the WEF and more are tied together .was no accident

  • @HubrisInc
    @HubrisInc 2 місяці тому +2

    Never fails, something big happens in the field of cybersec, we can guarantee that we'll get a Computerphile video starring Dr Bagley &/or Dr Pound :)

  • @tubehellcat
    @tubehellcat 2 місяці тому +5

    😂 the example bluescreen at around 0:36 , 13.37% 😂 love it 😁

  • @Scum42
    @Scum42 2 місяці тому +1

    Every time there's some outage, or bug, or virus big enough to get in the news, I get excited about the inevitable computerphile video explaining it.

  • @rubenreyes2000
    @rubenreyes2000 2 місяці тому +17

    You didn't mention that in order to install kernel drivers, the code needs to be submitted to Microsoft's to be tested, approved and digitally signed. As you mentioned, the bug was not present in the main kernel, but in the "channel files" that are updates without following that same process. It is not clear to me if those "channel files" are code or just configuration, but maybe Microsoft is partially at fault here for allowing these channel files in the first place, or for not sufficiently checking the kernel driver had the necessary logic to gracefully crash without taking down the entire system.

    • @throwaway6478
      @throwaway6478 2 місяці тому +6

      Clownstrike apparently uses a P-code interpreter to sneak unsigned code into their driver. You'd be a millionaire by Saturday if you invented a heuristic that can reliably detect a P-code interpreter and/or the P-code itself (which of course can be in any format the writer desires) running in kernel mode.

    • @nosuchthing8
      @nosuchthing8 2 місяці тому +1

      As I understand it, if something fails in ring zero or kernel mode, the entire OS goes down.

    • @TheFPSPower
      @TheFPSPower 2 місяці тому +1

      @@throwaway6478 In this case it's not that hard, it's a new file getting loaded from system32, the kernel knows every file you open so you could absolutely block unsigned files in system folders from loading, but as they said it would interfere with competing products so they can't do that, they signed an agreement to allow kernel drivers to work.

    • @ChrisM541
      @ChrisM541 2 місяці тому

      There are exceptions to requiring to get your code MS Certified - code that needs to respond to Day 0 attacks don't need certified, for obvious speed reasons. Fortunately/unfortunately.

    • @irql2
      @irql2 2 місяці тому

      the "bug" was in csagent.sys, thats the driver that was referencing an invalid memory address. Important to note that.

  • @jjdawg9918
    @jjdawg9918 2 місяці тому +2

    I cant find one UA-camr talking about proper sysadmin practices at the enterprise level that would have caught this before getting rolled out. I have never worked at a company where PCs weren't locked down from software installs and every update (even ones from MS) were tested by local QA before rolling them out to your enterprise PCs. Unbelievable that airlines are being run this way. Unless Cloudstrike installed some rootkit that bypasses all these processes I'm shocked at the state of sloppiness in IT.

    • @egria
      @egria 2 місяці тому +1

      I am trying to voice out the same thing but not even tech guys understand. CS Falcon updates bypass everything but still i don't understand how admins allow live updates on supposedly closed system like airports, banks, POS etc. And the loophole seems like the same windows update server used fir both live and testing, or just plain network connection to outside world to allow CS Falcon updates so that it can prevent zero day security issues. It is just absurd!

  • @TechSY730
    @TechSY730 2 місяці тому +3

    UPDATE: Thanks tma2001 letting me know the zero file was not the cause. And in fact there is validation in place. The error was somewhere else.
    So the below is inaccurate
    Seems it was a lack of input validation.
    Apparently the root cause of the crash was that one of the files in the definition update was just a file filled with zeros for whatever reason. Leading to a null pointer dereference (which always crashes, by design)
    But that makes me go like: Input validation anyone?! Does CrowdStrike Falcon fail to at least make sure the definition file makes sense as a definition file before blindy following its directions?

    • @necuz
      @necuz 2 місяці тому +1

      Everyone who is even remotely competent knows to put headers on files, network packets and the like. A magic byte or two and some metadata goes a long way when validating.

    • @tma2001
      @tma2001 2 місяці тому +1

      no that was a red herring - for some people it wasn't all zeros and CS confirmed in a technical blog post that null bytes in the channel file were not the cause. There are many possible reasons why it was a file of zeros for some folks - pre-allocated ahead of time before updated or wiped clean as a post processing step for security.
      Valid channel files have a magic signature at the beginning and they actually contain code in the form of byte code for a VM interpreter in the actual kernel driver. The logic error was in the byte code. Of course this means the actual driver can have gone through WHQL but is actually a dynamic entity.

    • @TechSY730
      @TechSY730 2 місяці тому

      @@tma2001 Ooh, thanks for the correction. I hadn't heard any technical detail updates since the original 0'ed file finding

    • @tma2001
      @tma2001 2 місяці тому +1

      @@TechSY730 you were not alone - I too was confused by what little folks had to go on initially. None of it made any sense!
      There is a full explanation by the Cloud Architect B Shyam Sundar on Medium website to breaks it down.

  • @Shoey
    @Shoey 2 місяці тому +1

    that os/house/hotel analogy was really good!

  • @michipeka9973
    @michipeka9973 2 місяці тому +7

    "Dave's Garage" a former microsoft software engineer just did a video about what he thinks happened about this. Very comprehensive and very clear.
    He also speaks extensively that this was possible because Crowdstrike works in kernel mode.

    • @murzilkastepanowich5818
      @murzilkastepanowich5818 2 місяці тому +2

      why would anyone want to watch that scammer?

    • @cidercreekranch
      @cidercreekranch 2 місяці тому +3

      @@murzilkastepanowich5818 WTF?

    • @michipeka9973
      @michipeka9973 2 місяці тому +4

      @@murzilkastepanowich5818 Sorry, I am not aware about any of that or don't even know what you are talking about. Just found about it yesterday, the video in question seems fine and basically makes some of the same points as this one, but is a bit more detailed.

    • @murzilkastepanowich5818
      @murzilkastepanowich5818 2 місяці тому +2

      @@cidercreekranch your wholesome 100 big le epic reddit content creator aint that wholesome 100 eh?

    • @Razzy_D9111
      @Razzy_D9111 2 місяці тому

      @@murzilkastepanowich5818 take your meds

  • @shaneedwards3144
    @shaneedwards3144 2 місяці тому

    How long were the systems down? 1 or 2 days? Were they down for 1 day and it caused 4 or 5 days delays? Or were they still trying to recover the network 4 or five days later?

  • @akashaabeysundara8454
    @akashaabeysundara8454 2 місяці тому +12

    1:13 if that hotel is like linux then the guests would carry their own air conditioners 😂

    • @SanderEvers
      @SanderEvers 2 місяці тому +4

      and smart guests will build their own hotel next to the original, with only a small difference.

    • @davidioanhedges
      @davidioanhedges 2 місяці тому +4

      Linux can run CrowdStrike, and had a worryingly similar issue a few weeks ago, since it was in the kernel there was nothing Linux could do either... But only on a couple of distros and only if you had installed Falcon CS ...

    • @dhillaz
      @dhillaz 2 місяці тому +2

      Room key is not in the sudoers file. This incident will be reported.

    • @timsmith2525
      @timsmith2525 2 місяці тому +1

      And to get your room cleaned, the instructions would be, "Run make, look for any errors, and correct them."

  • @paulyardley383
    @paulyardley383 2 місяці тому +1

    What happened to the testing? I could understand if it was an edge case, but this seems to have impacted all machines. How was it missed?

  • @SyphistPrime
    @SyphistPrime 2 місяці тому +10

    It also doesn't help that Microsoft took away the key combo to tell the OS to boot into safe mode on startup. If that was a thing I'm sure this would've been at least a bit smoother.

    • @throwaway6478
      @throwaway6478 2 місяці тому +3

      It amazes me how many of you don't know about bootmenupolicy legacy.

    • @SyphistPrime
      @SyphistPrime 2 місяці тому +4

      @@throwaway6478 because I don't specialize in the black box that is Windows. Also why should I have to dig through layers of archaic settings to change this when it's a sensible default?

    • @throwaway6478
      @throwaway6478 2 місяці тому +4

      @@SyphistPrimeYou use an operating system where you have to edit dotfiles to configure your mouse. 🤣

    • @irql2
      @irql2 2 місяці тому +4

      @@SyphistPrime oh stop it, you're not reading the source code for linux to figure out how something works, no one does that... you "can" do it, but thats not a thing an average person does. You're reading documentation just like people do with windows. Stop it.

    • @SyphistPrime
      @SyphistPrime 2 місяці тому +3

      @@irql2 The documentation on Linux is leagues better than Windows. There's so many undocumented and hidden features in Windows where as with Linux it's all out in the open. Also I have read bits of source code when AUR packages failed to compile. I've very much used that to help fix issues with PKGBUILDs and compiler errors. It's not usually necessary to read source code because all the documentation is out in the open, unlike Windows.

  • @fatonaoladimeji9697
    @fatonaoladimeji9697 2 місяці тому +1

    I would have listened to these guys talk about it for an hour

  • @miravlix
    @miravlix 2 місяці тому +3

    Not seeing much understanding of administration. A system I was admining involves testing updates before they get installed on the live environment and with this many computers, you don't install it on all of them at the same second, you install it in segments and don't continue until you have successfully restarted the first batch of computers.
    This all about GREED admining, they didn't want to pay for doing to properly, my way of admining was developed in the 19xx, we have INTENTIONALLY dropped security to save money.

    • @egria
      @egria 2 місяці тому

      Yep, admin practices is the key and not a particular bug. Live updates in closed system is big NO no matter what sweet voice of software vendor tells you. And the most common phrase nowadays is: "it is for you security" - be it the people or the machines.

    • @egria
      @egria 2 місяці тому

      Some companies had staging environments but they use the same windows update server for both live and staging/testing so this update just bypassed software enforced policies and gone live. Those are mine speculations git from admins sharing their cases. Yet no in depth public case analysis. Hush practice fir reputation.

  • @kyrpichko
    @kyrpichko 2 місяці тому +1

    I read somewhere that this config file containing the update C-00000...sys contained only 0 (zero)s and was an empty file which caused the BSODs.
    Is this true?

    • @0LoneTech
      @0LoneTech 2 місяці тому

      Parts of the rumour mill say no; the file shouldn't have loaded if it was only 0, and mostly 0 is normal. Crowdstrike have indicated they pushed incorrect code, not merely corrupt, but the fact remains their architecture creates the opportunity for this level of fault.

  • @stefanreindel9888
    @stefanreindel9888 2 місяці тому +11

    Wondering how it got past QA?
    Seems like installing the update on a docker instance or vm would have found this bug.

    • @ytechnology
      @ytechnology 2 місяці тому +6

      Also, how was rollout conducted? Normally it would be tiered / staggered to minimize damage from faulty code. I haven't found any confirmation, but this looked like a "big bang" release.

    • @Tahgtahv
      @Tahgtahv 2 місяці тому +4

      @@ytechnology It sounded like from the video, what they pushed out was definition files, and not code per se? Normally I would not expect that kind of thing to cause a kernel panic, so maybe they didn't either. Hopefully, this incident will make them take a hard look at how they do/deploy things in the future, no matter what it is.

    • @MrThebigcheese75
      @MrThebigcheese75 2 місяці тому

      Friday update before the holidays strikes. Just like Friday built cars. Just push into production and go down the pub, will deal with problems when we get back.

    • @muhdiversity7409
      @muhdiversity7409 2 місяці тому

      QA is a cost center. Everyone is getting rid of that. Why not have the devs responsible for QA, oh and deploying the stuff to the customers and datacenters. The above is not a joke, I've lived it for 5 year now.

    • @ChrisM541
      @ChrisM541 2 місяці тому

      "Wondering how it got past QA?" - there was none. This industry is unregulated. The mentality is "push now, patch later". Maybe governments will finally wake up to the certainty of more timebombs.

  • @HopliteSecurity
    @HopliteSecurity 2 місяці тому +1

    Computer Phile is amazing!
    I love your content and calm but casual demeanor. Your explanations and ability to break things down is superb!
    Keep it up 🙏🙏🙏🙂❤️

  • @paranic7
    @paranic7 2 місяці тому +4

    There is a bottle of water under the desk !

  • @snack711
    @snack711 Місяць тому

    are these updates not tested beforehand? if so many machines were affected, should this not have been easily detected beforehand?

  • @---ox1lg
    @---ox1lg 2 місяці тому +34

    "There's no problem with Microsoft. There's no problem with Windows."

    • @shiroyasha_007
      @shiroyasha_007 2 місяці тому

      Perhaps 😢

    • @ChuckleDuck
      @ChuckleDuck 2 місяці тому +2

      lol, lmao even.

    • @yurisebastiao1872
      @yurisebastiao1872 2 місяці тому +8

      It's actually right .... only those windows machines with Crowd strike software were affected by such zero day attack (self attack actually, more like a buggy one:😂)

    • @yurisebastiao1872
      @yurisebastiao1872 2 місяці тому +1

      They've created their own zero day attack by not testing pieces of codes in their software update release. 😂

    • @titaniummechanism3214
      @titaniummechanism3214 2 місяці тому +2

      nothing wrong...
      ...other than the usual stuff

  • @HelloKittyFanMan
    @HelloKittyFanMan 2 місяці тому

    OK, so when the facebook thing happened while nobody was inside the QH building or whatever, how did they finally fix that? Break down a door frame or a window, or...?

  • @lambda653
    @lambda653 2 місяці тому +9

    8:42 It can happen and indeed DOES happen on mac and particularly linux machines but the difference is those operating systems have safety mechanisms in place so that mass IT outages like the kind that just occurred can't fail to the point of individually booting every single device into safe mode and deleting a driver file. As you said, there was a kernel panic error on clownstrike's linux distributions, yet it didn't crash the world's infrastructure because the error was handled correctly. So microsoft should be at fault in some part for not providing these error handling systems.

    • @Formalec
      @Formalec 2 місяці тому +4

      This could be exactly as bad for linux machine if the driver is at ring 0.

    • @ipadista
      @ipadista 2 місяці тому

      @@Formalec the x86 family supports four rings, but for reasons Linux didn't continue the tradition used in VMS and some other contemporary mini computer operating systems, where kernel is ring 0, drivers are ring 1 and shared libraries are in ring 2. Choosing to do the same as NT did, skipping rings 1 & 2 only leaving kernel and user processes. Since essentially nothing uses more than ring 0 & 3 nowadays most new CPU designs only implement 2 rings

    • @JonBrase
      @JonBrase 2 місяці тому +1

      Linux allows you to specify a kernel command line from the bootloader, and you can blacklist individual drivers in the kernel command line, so recovery would be simpler.

    • @ipadista
      @ipadista 2 місяці тому +2

      @@JonBrase Same as with BSoDs, you would still need some techie typing in the fix at the Console. On cloud servers, it could be automated, same as with BSoD fixes, but I doubt it could be done on standalone machines

    • @genehenson8851
      @genehenson8851 2 місяці тому +2

      Mac has not allowed kernel level access since Big Sur.

  • @taniasofiatrujillorivera4852
    @taniasofiatrujillorivera4852 2 місяці тому

    So, if you have stocks on Crowdstrike should you sell it or just hold it

  • @minigunnboy21
    @minigunnboy21 2 місяці тому +42

    This is like 9/11 for computerphile

    • @Elesario
      @Elesario 2 місяці тому +2

      Not sure how you're making that comparison. The issue wasn't even malicious. I'd hesitate to compare an act of extreme violence causing the loss of so many lives, so much pain and misery, to a technical mistake that at worst is very expensive and financially damaging to many, but is mostly just at the level of an strong inconvenience.

    • @alazarbisrat1978
      @alazarbisrat1978 2 місяці тому +2

      @@Elesario the hospitals tho

  • @steevf
    @steevf 2 місяці тому +2

    It's ironic that a bit of software intended to prevent a system from getting taken out ends up taking out the system.

  • @steveftoth
    @steveftoth 2 місяці тому +6

    "Sorry Elon"? Never apologize to that man.

  • @paultasker7788
    @paultasker7788 2 місяці тому

    Finally, a really good explanation of crowdstrike and what it does and what went wrong.

  • @spookycode
    @spookycode 2 місяці тому +3

    Honestly I would have called it crowdstroke :p

  • @aungthuhein007
    @aungthuhein007 2 місяці тому

    Should Crowdstrike have much more rigorous testing before they deploy anything? Shouldn't there be a way to detect this on a VM or something as a test?

  • @TimothyWhiteheadzm
    @TimothyWhiteheadzm 2 місяці тому +6

    "They may have implemented something badly, we don't know". Yes, we do know. It happened, therefore they implemented something badly. This sort of thing is why we have canary deployments, and apparently they have the infrastructure for that, and allow customers to have settings for which computers get updates first in order to validate them, but they also have some updates that simply ignore those settings, and this one one of them. Yes, they 'implemented something badly'.

    • @alazarbisrat1978
      @alazarbisrat1978 2 місяці тому +2

      it was definition files not the drivers themselves that broke so it's held under less scrutiny

    • @TimothyWhiteheadzm
      @TimothyWhiteheadzm 2 місяці тому +2

      @@alazarbisrat1978 'Held under less scrutiny' by whom? The reality is that it crashed computers, and this isn't the first time similar updates by Crowdstrike have caused crashes (including on linux). The fact that they know this is a possibility but failed to implement proper testing before pushing out to everyone, means the 'implemented something badly'.

    • @alazarbisrat1978
      @alazarbisrat1978 2 місяці тому

      @@TimothyWhiteheadzm they didn't know that would happen, sorta how this ever got out in the first place. but companies always neglect QA, it's just how it is. and also definition files themselves couldn't do any of this without a huge screw-up so they're not as important to defend, but had they tested it there would be no problem. some programmers just prefer to test after failure tho, just a complete miss

    • @0LoneTech
      @0LoneTech 2 місяці тому

      @@alazarbisrat1978 What makes this remarkable is that the entire purpose of this product and company is to address that QA neglect. They've demonstrated they're among the worst at the one thing they're claiming to do better.

    • @alazarbisrat1978
      @alazarbisrat1978 2 місяці тому

      ​@@0LoneTech not really, most companies do that, just that this one was widespread and broke something fundamental. they just got unlucky with their neglect and this slip-up got all the way and broke everything. legend has it that there have been many other issues in their code over time that went totally unnoticed and only now caused catastrophic failure

  • @PowerShellWizard
    @PowerShellWizard 2 місяці тому

    As an Ex MS employee and one that worked at Windows, I appreciate what was said at 7:42 :)

  • @NoahSpurrier
    @NoahSpurrier 2 місяці тому +3

    The cure was worse than the disease.

  • @m4rt_
    @m4rt_ 2 місяці тому +1

    "Anything that can go wrong will go wrong.."
    - Murphy's Law
    Another one I like is the variation of Murphy's law from Interstellar:
    "Anything that can happen will happen."

    • @ChrisM541
      @ChrisM541 2 місяці тому

      Murphy also says...
      "Remove QC/QA and you're f*d !!"

  • @johnhudson9167
    @johnhudson9167 2 місяці тому +4

    Loving how social media is making comp sci lecturers get trendy haircuts and dress properly 😂

    • @AlanCanon2222
      @AlanCanon2222 2 місяці тому

      Never, I say! NEVER! *puts on sandals over socks*

  • @Tomcat-rj5tp
    @Tomcat-rj5tp 2 місяці тому +1

    My school's coding club was faster to respond than our IT helpdesk, and they were more helpful too. They posted a document with detailed step-by-step instructions, while IT just said "come see us." Thankfully I got rid of Falcon at the end of spring semester, as we're not required to have it over summer break.

  • @rodolphenemr9064
    @rodolphenemr9064 2 місяці тому +3

    Been waiting for this 🍿

  • @LaurentBonnaud
    @LaurentBonnaud 2 місяці тому +2

    On Linux an EDR software can use the eBPF kernel subsystem to probe system activity. And an eBPF program cannot take down the Linux kernel by design.

  • @dgo4490
    @dgo4490 2 місяці тому +3

    It's been obvious for a while now - MS does NOT DO software testing, nor Crowdstruck evidently. They are delegating the testing straight to the end user. They pushed a bad binary to an "on-the-fly" update, and after the updated binary was first touched, it crashed the system. That's criminal negligence, brought to you by industry's greatest security providers.

  • @Amonimus
    @Amonimus 2 місяці тому

    Can't you partition in Windows and set the boot order manually?

  • @tscoffey1
    @tscoffey1 2 місяці тому +5

    Apple has the luxury of being able to force changes to their OS like that because only a minuscule percentage of the world infrastructure relies on it. Microsoft must remain backwards compatible as best they can with their OS upgrades precisely because they aren't a tiny player in this arena.

  • @ianflint4610
    @ianflint4610 2 місяці тому +1

    The wider issue is that, while Windows acts in a way to mitigate the consequences of a malicious act (which this failed update mimicked), there has seemingly been no thought into how to manage, contain and recover from such a problem when it is happening at scale on massive numbers of end-points at a very rapid rate. The rate of 'infection' is happening far faster than it can be contained. Microsoft's kernel code policy on top of Crowdstrikes error has exacerbated the problem.
    The impact isn't a theoretical one, it is real with potentially life threatening consequences (like the Highways Agency being unable to control Smart motorways when their displays were not reflecting what signs were saying and they couldn't change them - that left people in Refuges being unable to rejoin live motorway lanes). It has exposed many weaknesses.

  • @chipndahla
    @chipndahla 2 місяці тому +3

    This was nor very well informed with a lot of lacking info and some facts clearly missing. Much better videos already out there. That being said, normally a fan ❤️

  • @rogeratygc7895
    @rogeratygc7895 2 місяці тому

    Would it be easier to boot (each machine manually) with a "live Linux distro" disk and then delete the offending file?

  • @IsYitzach
    @IsYitzach 2 місяці тому +8

    12:50 don't apologize to Elon. He deadnames one of his kids. If he can do that, you can deadname his company. The best he's going to get out of me is ex-Twitter.

    • @spht9ng
      @spht9ng 2 місяці тому +2

      And then uses his child as a culture war pawn publicly. Gross

  • @josef596
    @josef596 2 місяці тому +1

    Currently having the same problem with Bitdefender. Goes to BSOD on every reboot.

  • @choleralul
    @choleralul 2 місяці тому +3

    Thanks Lord Targaryen

  • @Asidders
    @Asidders 2 місяці тому +1

    I love listening to these engaged guys 😁

  • @flammungous3068
    @flammungous3068 2 місяці тому +4

    CrowdStrike definitely is installed on servers. It took down our VPN so we couldn't work from home. Yeah, we got off mildly.

    • @David-bi6lf
      @David-bi6lf 2 місяці тому

      Indeed I have seen stories of companies with it installed on domain controllers and those domain controllers being the only database for bitlocker keys. That's a double whammy, your bit-lockered endpoints can't be fixed till the domain controllers are back up.

    • @Bpinator
      @Bpinator 2 місяці тому

      @@David-bi6lf Oh yea its on every DC for the companies I work with (which typically makes sense.) But having your bitlocker keys all stored behind bitlocker, its actually insane.

  • @dunebasher1971
    @dunebasher1971 2 місяці тому

    I'm just fixated on why the 2-shot is 25fps and the close-ups are 50fps. Presuming different settings on different cameras, as the close-ups are clearly shot on separate cameras in different positions and not just punch-ins on a 4K master shot.

    • @Computerphile
      @Computerphile  2 місяці тому

      The wide shot is on a camera not capable of 4k50p - it was a choice between frame rate and resolution, perhaps I should have chosen frame rate then for most resolutions it would have worked whereas choosing resolution, all resolutions suffer the frame rate issue. -Sean (usually that camera is on a piece of paper and therefore not as noticeable and the resolution is handy to be able to reframe the shot of the paper)

    • @dunebasher1971
      @dunebasher1971 2 місяці тому

      @@Computerphile Yeah, these videos don't need to be 4K :) For anything like this, 720/1080p50 always looks better.

  • @Slarti
    @Slarti 2 місяці тому +6

    My goodness, that must be simply the worst analogy I have ever heard for an operating system 🤣

  • @Ancient_Hoplite
    @Ancient_Hoplite 2 місяці тому

    What do you think of the whql certification process seeing as it was totally bypassed by the way the uncertified channel configuration files were being used by the certified driver.

    • @Mar184
      @Mar184 2 місяці тому +1

      With certification of channel files being impossible / impractical in a timely manner I'd say the whql certification process has to be more restrictive to enforce the program to be certified behave safely on all possible inputs of such files. That makes the certification process of the program itself longer, but means the channel files need not be certified at all. That's the only way I see of making this a safe system without certifying each channel file.