That's a bit unfair, isn't it? Crowdstrike managed to crash tons of Linux systems with the exact same software this April. Same software (Falcon), same problem (kernel panic). Only nobody made a big deal about it back then. Dr. Begley even mentions it briefly in the video.
McAfee did something similar several years ago. A bad definition quarantined core system files. The McAfee CTO from that era is now CEO at Crowdstrike.
Did you guys get the USB microsoft created to automatically fix it? What is cool is the winpe usb drive just boots into safe mode and runs repair.cmd file it creates. I am keeping this as it will be easy to change that batch file and have it do other things in the future if I want to.
So there were 3 seperate failures from Crowdstrike. 1. The kernel Driver didn't have proper input validation 2. The Channel File was broken 3. The testing was so abysmal that they didn't notice before sending the update out to customers.
It’s quite scary that they get their kernel driver signed, despite it not meeting the standard of validating all input! That’s a systemic problem with their entire solution! (Well, so is the third, but testing is not you build quality into the system, so I think the first is the fatal flaw.)
4. They didn't even notice that every client that updated went down, or at least they didn't respond. How that is even possible is beyond me. Their entire product is based on monitoring systems, but it took them hours to respond, and that was after Google had called them out for the chaos everywhere.
The problem is rolling out an update (that might not have been tested so well) TO EVERYONE ON THE PLANET AT THE SAME TIME. I can't believe Crowdstrike is operating like this. If you did a phased roll-out to a couple of smaller customers initially, and then monitored whether the updates didn't have any glaring issues this whole situation could have been averted.
@@lever2k what company do you work at that tiers endpoint protection updates? Never heard of such a thing. Crowdstrike may not even offer that capability.
@@lever2k That's assuming the software even allows tiered deployment and doesn't expect _everything_ (including the main server) to be working on the same version - and any machine that isn't updated yet can only connect to update.
@@lever2k based on what i've bbeen hearing from others online: a lot of companies **do** have tiered approach for updates, including crowdstrike, but this update - deemed by crowdstrike to be very critical, ignored ALL such settings and was deployed unilaterally to everything.
The real issue and worry is a monoculture. This sort of problem will always happen. Someone is always going to be affected and there's always going to be a cohort of people who are unfairly affected by things that are out of their control. The problem is the cohort here happens to be extremely big because of there's a monoculture of this type of software monopolies lead to monocultures and monocultures lead to unique weaknesses. This unique weakness was able to take out. You know millions of computers all around the world cuz everyone was using this software. We need more companies in this space. Even now the fact that after this happens, everyone basically have to look to crowdstrike because that's who everyone uses. It sounds there's no competitive alternative
There are probably already some bad actors out there. Just look at the catastrophic instances of espionage inside the CIA. See Robert Hanssen and Aldrich Ames.
I was stuck in Atlantas airport because of this. It was absolute madness and everyone that talked about it, either from the airline or passengers, said it was a Microsoft issue. That's all most people are going to remember.
That's not entirely wrong. Microsoft did bless this software as permitted the privileges to do whatever to the entire system. They're in turn blaming this on EU, but EU only mandated they provide access to security software at the same level their own has; it's Microsoft's choice to make that this risky. Then there's the trust placed in Crowdstrike; they're likely selected for being a known name, never mind they ran a previous company into the ground in this particular manner. It's like the hotel manager decided to install an entry counter in their front door and nobody asked why it's also a guillotine.
Senior staff that in case probably cautioned against allowing running code in kernel space before it's tested on a test system because that's a fast track to exactly what happened. Senior staff likely tired of their expertise being ignored by suits who cannot comprehend anything outside their niche might matter.
Losing? They think they can do things cheaper elsewhere and AI can replace everyone. I wish them luck in the wars to come. Yes, this was a fun career and all I've see is degradation of quality of life on a massive scale. Where everything is micromanaged by 100% non-technical types. I don't miss it at all.
@@DoubleOhSilver UA-cam censored my comment. Wanted to say that I totally concur with the sentiment. Not only is it miserable, the hiring process that is adopted across the board seems to be nonsensical hazing rituals that do not map to real world problems or realistic development tasks and activities. The golden age is well and truly over.
Especially the ones who are losing senior staff who know the ins and outs of the product, and replacing them with “Business guy who does business things and doesn’t need to know how the technology works”
This boggles my mind as an IT professional. I was part of a team that deployed patches and software for years. This included OS deployment patch deployment, software deployment the whole thing on both Workstations and Servers. We tested our patches extensively before pushing them out to the entire population of the environment. This 1st included a sandbox environment, then a select user / system environment, then we would stage our patches out over several hours so if something happened we could back out before catastrophe struck. And honestly sometimes we would find problems with the patches, and we would be able to immediately stop, suspend and even back out. Yes we would use 3rd party vendor solutions to help with this, and any time we changed ANYTHING we would follow our testing procedures and matrix, normal business. We would never shirk our procedures to test 1st, then deploy. To me this is a total failure of IT Governance and failure to maintain standards. (IT Governance is setting and maintaining standards and policies for the IT Infrastructure)
In the modern version of Battlestar Galactica, Admiral Adama absolutely refused to have Galactica networked to other systems and ships in fleet because of the risks to their it critical system. Yet here we are, allowing a root kit to operate unconstrained on millions of machines. Fun times ahead.
A lot of the computers that businesses give out to employees (such as ATM screens and point-of-sale devices) where those computers are so cheap that they become completely useless without a network connection (like a Chromebook), and so the system is working “correctly enough” that it detected a problem in those (theoretically) cheap end computers, and it cut them off of the network. The failure was that the wrong thing was found to be a threat, and all those end computers were cut off.
When talking about this incident it's worth remembering that hospitals were affected and she people may have died because of this. So it's all well and good to say when everything goes down, go outside and touch grass. But also, we do need to think seriously about whether we're doing enough to ensure software safety. We take it way less seriously than, for example, car safety. When a new model of car comes out it has to go through all kinds of testing to ensure its safety. But we are doing nothing to ensure software safety, we are just 100% trusting the vendors. I've been a software engineer professionally for 25 years and have long thought that the current approach is madness and incidents like this one only make more sure we need to have standards that all critical system software meets in its development, deployment and implementation.
Someone left a message in an Spanish dev stream saying their aunt had a miscarriage and couldn't be operated on because the all the hospital computers had BSOD'ed. She had an emergency procedure hours later.
100% true. It's definitely a big deal that this incident took down not just School computers or corporate businesses but hospitals that need them to keep people alive. people were missing their medications and for some people like me missing medication means you end up throwing up for a couple nights for other people the consequences can be much more dire. At the end of the day as technology begins to run more and more of our lives I do agree there's nothing you can do to prevent hospitals from being part of the affected class these things will happen and hospitals will be affected just like any other computerized business. The problem is we don't need to have so many hospitals affected in a single incident that is purely the result of a monoculture which is the result of monopolistic practices which is a result of the form of capitalism that we have in North America and its effects around the world. And that's just on a philosophical level without even approaching all the specific problems that could have been prevented in this case
while i agree with your statement, digitalization also played a huge role in this. nowadays everything needs to be "smart", even things that dont make sense like refrigirators. if those hospitals had alternatives to the computers they used (like for example have paper copies of documents alongside digital versions) this would have hurt them far less significantly. we are too dependant on digital computers
The guilty in this instance are both CrowdStrike and their Customer Security Managers. CrowdStrike has a history of shipping stuff that breaks systems, most recently their Linux product. The Customers said: Yes CrowdStrike just put whatever you want on our systems without monitoring. And by the way, we have no adequate disaster recovery plan. As a corollary, letting CrowdStrike put stuff on your systems also allows bad people to compromise CrowdStrike and deliver unlimited hurt. If I was a baddie I'd spend my every effort to subvert CrowdStrike!
There will most likely be a lot of QA positions opening on Crodstrike in the aftermath of this. Bad actors just need to get one of "their guys" in through that recruitment process.
@@justgame5508 welcome to the corporate mindset. Protection against liability is more important than delivering a working product. Who do you think the company is prepared to pay the most, the lawyers or the engineers? That reflects how they value their respective services.
@@lintfordpickle Yeah, but when our security software screws up it will a) first crash the test machine which would block the rest from receiving the update, and b) if that somehow fails our system would allow us to reboot with a previous system snapshot. To see these massive and vital organizations not have _any_ backup plans while putting full trust in an external company is mind boggling.
No not really, a perfect storm implies the issue was due to various timing / bad luck factors. ie It lessens the culpability of ClownStrike. Each of the issue you mention were just plain incompetence.
The fix is simple, do not push untested code onto live systems where it will run as part of a must run to boot kernel level driver. Run it on a test system first. And never trust a 'security company' who says you should do otherwise (except in rare cases, such as a very bad zero day being exploited where it's a gamble either way). If they allowed this for a run of the mill non-emergency update then they don't know cyber security and safety well enough to protect a home gaming system, let alone major systems. This goes past gross incompetence to the point where I wouldn't blame anyone from suspecting malice. Though I personally think it was "we don't screw up, we stop screw ups" level hubris.
EXACTLY! Unfortunately, this braindead policy of offloading all QC/QA onto the end user is being practiced my an increasing majority of devs...all thanks/empowered by The Internet. Software development is the most uncontrolled, unregulated industry in existence. Governments MUST act...before it really is too late!
I quote Grey's law: "Any sufficiently advanced incompetence is indistinguishable from malice." It doesn't really matter if Crowdstrike did it out of malice or just cut corners to cheap out on development costs. They sell a product that is obviously not robust enough to be used on mission critical systems and they have made the decision to risk their customers business to make more money for themselves. In turn Microsoft allows their OS to hard crash due to a faulty third party driver. That can not be tolerated on mission critical systems so a large part of the blame goes to them as well. The end users seems to be pretty naive as well, they have hopefully learnt the expensive lesson on how to not build infrastructure.
There's also a small chance that the files got corrupted during the transfer to a CDN which served the corrupted update to millions of computers. We shall see....
Thanks for being the first source I found that actually explains what crowdstrike is and what went wrong here, and nice to hear some nuance amd perspective as well.
If you want a little more detail: apparently, the definition file they pushed out left some index entries uninitialized, so some memory addresses that were meant to hold pointers ended up with junk data that, when dereferenced, pointed to invalid memory locations.
@@IceMetalPunk Thanks, this is the best explanation I've heard so far. IMNSHO, the software should have been written in such a way such that the definitions don't directly map to memory. Then when you create data structures in memory, they always point to something valid. But nobody asked me.
@@Tahgtahv I think what you're talking about is Rust. but apparently there were numerous cracks in the program even before then that was caused by the same QA issues that caused this current crash, the crash was just everything finally fell apart
Enjoyed this. Glad I watched the recent 'Dave's Garage' video where he explained the problem. Here I saw and got a good understanding of the wider consequence management. Well werth wathing both I think.
Windows can in fact boot with the failing driver automatically disabled the next time, except for drivers that are marked as absolutely necessary for booting itself, and this driver is marjed as such.
nah it wasnt marked as boot critical, common talking point though. Doesnt change anything though, unless you get to a desktop windows considers it a failed boot, do that 3x and you end up in the recovery console.
@@irql2 yes it was, but the decision as to if it can be downgraded should be Microsofts. just because they want it to prevent booting if it cannot start does not mean that windows cannot start without it.
@@grokitall stop parroting talking points and go look at how the driver is configured in the registry. People super confident about things and wont even verify when its very easy to do.
@@irql2 according to retired microsoft engineer dave plumer, they had it marked as boot critical according to his sources. i have no reason to doubt his statement. despite how unimpressed i am with various choices Microsoft has made, i have no reason to doubt the quality of their engineers. that is why i am sure they are capable of determining if it is actually boot critical when the driver is being signed. i am also sure that they are capable of writing code which will use that determination to down grade the driver and disable it if it is too broken to boot, and to check if it is stuck in a boot loop. for any os, as long as you can get to startup, and use the net, you can fix the driver with an update without having to manually login to all the locked down machines. the fact that they have not bothered to implement such a measure when this has happened before is disappointing.
@@grokitall Thanks for confirming you wont even go look and you'll just parrot whatever anyone says. David is wrong too and he would admit it if he looked. We're human, it happens... He probably doesnt have a dump to go and check. and honestly doesnt matter. Whats more concerning is how confidently wrong people and they have no interest in learning anything that wasnt hand delivered to them by some source they consider trustworthy. This is a huge problem and our political climate is evidence enough of this. If you would have asked "How do I verify this?" since you obviously don't know or even care to, I would have shared that information with you so that you could be more informed on the topic... but nah, polly wants a cracker instead. For those that are interested in learning, csagent's Start value is set to 1. Meaning its just another driver, its not special in regards to booting. If it were, you'd get a 7b on boot. This entire interaction is disappointing. What happened to the days when people went "Oh yea? Show me".
Working for a Bank we had drills where we simulated losing our systems for a few hours and had to do everything (and I mean every conceivable thing we might be asked to do in a normal day) without any computers. Including driving physical records to central processing locations.
Crowdstrike did more harm to its clients, and to the Western world, that it could ever have possibly prevented for the entire duration of its existence as a company. How they ONLY lost 20% of their share value is mind-boggling.
Linux has a feature that allows the sandboxing of channel updates using eBPF, although Crowdstrike doesn't use it yet. In theory, that could have prevented the BSODs had Windows had a similar feature. Also, I don't ncessarily agree that Windows is blameless here. While Crowstrike is definitely at fault, Windwos did certify their driver, and that validation somhow didn't include testing for corrupted or invalid channel files. There's no reason the driver should blindly trust those files without validation.
Yeah, Microsoft also allows eBPF, but it's in an alpha, very early state. Also, the people opining that "this isn't a Windows' issue" are right to a degree, but when you realize that there are design deficiencies around how Microsoft handles drivers, it can only be said, "they're right to a degree," especially when you can specify kernel command line options to disable drivers that are acting bad, or have a fallback initramfs that doesn't load the CrowdStrike driver, which Windows doesn't really allow. I believe that CrowdStrike is also on the eBPF design foundation alongside some other industry giants like Apple, Google, Microsoft, etc. I think CrowdStrike also uses eBPF for Linux in their newer agent after the debacle back in March/April with Debian.
My understanding is that CrowdStrike does use some type of interpreted code in their definition files, which would imply that there was some bug in the interpreter (or code downstream of it) that allowed a null-pointer dereference through (or made a null pointer dereference on its own).
@@reybontje2375 Windows does have self-recovery functions for bad acting drivers, but they do not work on boot drivers and Crowdstrike's driver is a boot driver so the system is not allowed to boot if it crashes by design unless you use safe mode.
Computer Phile is amazing! I love your content and calm but casual demeanor. Your explanations and ability to break things down is superb! Keep it up 🙏🙏🙏🙂❤️
Finally, FINALLY, some informed and cogent commentary on this issue that isn't just "Tech influencer says Windows is a mess and this would never happen in Linux or macOS"
Falcon is using definition files which are NOT part of the WHQL process which Falcon obviously is! I don't know how this works on Linux or MAC, but maybe it should not be allowed for Windows driver makers to deliver _anything_ to the kernel that does not go through the WHQL certification.
This is the part that’s wild for me. WHQL is supposed to be this Highest Level Of Scrutiny thing, and somehow WHQL reviewed this workaround to inject arbitrary runtime behavior without requiring WHQL recertification and said F It Ship It.
My only suspicion is that someone, somewhere thought requiring WHQL for definition files could delay definitions too long when new vulnerabilities are discovered and need to be monitored. Like, "if we do WHQL on every definition, by the time it gets released, so many people could be affected by this exploit!"
@@IceMetalPunk I think that's the reason, and I can't say I have any insights in the WHQL process to tell you how long the process normally is. Would be interested to know though, do you know? I would imagine most of it is automated.
Maybe definition files do not contain any code and are thus exempt from WHQL process? It could be that the definition file was simply corrupted and unreadable and the kernel driver crashed when trying to read it.
I am a recently retired Cyber Security (though being heavily involved in Computer Security for over 30-years, and a software developer for 20 years prior to that, I prefer the traditional names of Computer or Systems Security) Compliance Officer. Although the systems I monitored were involved with critical infrastructure and not open to regular users of business systems, they were still peripheral dependent on many such systems. Since I was a stickler for avoiding the Cloud and third-party security products, my former employer has taken steps to ensure I never know if they were severely affected by the CrowdStruck (accepting the pun) event. The real issue is something you two gentlemen mentioned but did not go deeply into. What if there were malicious embeds (i.e. spies) working for that organization, or for Windows System development? We would not be face a bad day or so, but it could been lights-out until every critical system were completely rebuilt and data backups restored. I can understand why discussion of that scenario would be avoided, but should it be avoided. If I were a critically ill patient in the hospital I would want to know so I could prepare for the aftermath.
@@ZiggyGrok Y2K only affected those that were too lazy to add 2 more characters to their dates. If your code was vulnerable, it was terrible code to begin with.
@@davidmcgill1000too lazy... No, using software originally designed when memory was small and expensive, and saving two characters per entry won them pay rises There were huge and expensive efforts put in to check and update to get around the issues many years later, and so near nothing happened, but it doesn't mean there wasn't a problem
The new update to CrowdStrike falcon included some corrupted channel files (they contained just zeroes instead of the intended data), and because the core driver that loaded the channel files didn't do enough input validation, it continued on using the messed up channel files, and this revealed a bug that likely had been there for a while. The bug caused the driver to attempt to dereference a null pointer, which caused the BSOD.
Yeah and probably crowd strike have not fixed the bug because it would require a new release of the driver and that would have to go again through the Microsoft WHQL signing process which the use of these channel files seeks to avoid.
Note that this corruption claim is afaik coming from one random twitter user and has been denied by Crowdstrike who says there was a logic error in the updated rules file that caused the problem. It seems extremely unlikely to me that crowdstrike does no validation on these files given that they're being updated frequently on a huge number of machines and are therefore liable to get corrupted (due to power failures and such) on a regular basis.
As someone who led the deployment of EDR and EPP to 18,000+ endpoints last year, agents are absolutely installed on Windows servers, yes. Updates like this that don’t go through change control are a calculated risk for more up-to-date protections. Problem is that the risk mitigation is that the vendor does testing and releases competently..
@@blackholesun4942I am not sure which part you didn’t get. The custom blue screen of death (BSOD) is something they fabricated. 1337 is often used in gamer culture to mean LEET (or elite rather). Usually indicating something like highly skilled (1337 player for instance). ISWYDT : I see what you did there. So it is used a bit ironically here, because it was of course not a skilled update. Hope that helps.
@@playground2137TBF, 1337 is specifically turn-of-the-millennium gamer culture (late GenX, elder millennial). I'm not sure I've even seen younger millennials using it, let alone Gen Z.
“This was a phishing attack and a chip level attack?” “No, no… the cash register system is down thanks to broken Windows update” “They broke your windows and stole your cash?!” “No, the money is still here!” “Okay, I’ll just pay you in cash then” “I can’t do that! The register is locked unless the computer tells it to open! Besides, each purchase is required to update the inventory as well” “I don’t see what the Tories have to do with anything in this case” “… I don’t have time for your Monty Python shenanigans” “I’d think this stuff would be programmed in C and not Python” “GET OUT!”
Its going to be very interesting to see what Crowdstrike learns from this. One thing they didn't seem to use is a canary or blue/green deployment scheme. Hoping for some enlightening blog-posts on the topic eventually.
Typical "Management Bug?" A CrowdStrike engineer or two urges more testing before release. Some executive then pounds the conference table and shouts, "No more f**king EXCUSES! I want that update NOW gawdammit!"
Especially seeing that the CEO of Crowstrike *now* was the CTO at McAffee back *then* , when McAffee brought down XP Machines by deleting Windows core files in 2010. The common factor ist the manager.
13:06 totally agree, we just need to US develop our technology. But we see how US monopoly all technologycal aspects, and any real competitor they ban out...
UPDATE: Thanks tma2001 letting me know the zero file was not the cause. And in fact there is validation in place. The error was somewhere else. So the below is inaccurate Seems it was a lack of input validation. Apparently the root cause of the crash was that one of the files in the definition update was just a file filled with zeros for whatever reason. Leading to a null pointer dereference (which always crashes, by design) But that makes me go like: Input validation anyone?! Does CrowdStrike Falcon fail to at least make sure the definition file makes sense as a definition file before blindy following its directions?
Everyone who is even remotely competent knows to put headers on files, network packets and the like. A magic byte or two and some metadata goes a long way when validating.
no that was a red herring - for some people it wasn't all zeros and CS confirmed in a technical blog post that null bytes in the channel file were not the cause. There are many possible reasons why it was a file of zeros for some folks - pre-allocated ahead of time before updated or wiped clean as a post processing step for security. Valid channel files have a magic signature at the beginning and they actually contain code in the form of byte code for a VM interpreter in the actual kernel driver. The logic error was in the byte code. Of course this means the actual driver can have gone through WHQL but is actually a dynamic entity.
@@TechSY730 you were not alone - I too was confused by what little folks had to go on initially. None of it made any sense! There is a full explanation by the Cloud Architect B Shyam Sundar on Medium website to breaks it down.
Crowdstrike didnt do any validation control(or not enough) in their Driver to check the .sys file before running it to confirm it wasnt just full of Null values etc.
I cant find one UA-camr talking about proper sysadmin practices at the enterprise level that would have caught this before getting rolled out. I have never worked at a company where PCs weren't locked down from software installs and every update (even ones from MS) were tested by local QA before rolling them out to your enterprise PCs. Unbelievable that airlines are being run this way. Unless Cloudstrike installed some rootkit that bypasses all these processes I'm shocked at the state of sloppiness in IT.
I am trying to voice out the same thing but not even tech guys understand. CS Falcon updates bypass everything but still i don't understand how admins allow live updates on supposedly closed system like airports, banks, POS etc. And the loophole seems like the same windows update server used fir both live and testing, or just plain network connection to outside world to allow CS Falcon updates so that it can prevent zero day security issues. It is just absurd!
You didn't mention that in order to install kernel drivers, the code needs to be submitted to Microsoft's to be tested, approved and digitally signed. As you mentioned, the bug was not present in the main kernel, but in the "channel files" that are updates without following that same process. It is not clear to me if those "channel files" are code or just configuration, but maybe Microsoft is partially at fault here for allowing these channel files in the first place, or for not sufficiently checking the kernel driver had the necessary logic to gracefully crash without taking down the entire system.
Clownstrike apparently uses a P-code interpreter to sneak unsigned code into their driver. You'd be a millionaire by Saturday if you invented a heuristic that can reliably detect a P-code interpreter and/or the P-code itself (which of course can be in any format the writer desires) running in kernel mode.
@@throwaway6478 In this case it's not that hard, it's a new file getting loaded from system32, the kernel knows every file you open so you could absolutely block unsigned files in system folders from loading, but as they said it would interfere with competing products so they can't do that, they signed an agreement to allow kernel drivers to work.
There are exceptions to requiring to get your code MS Certified - code that needs to respond to Day 0 attacks don't need certified, for obvious speed reasons. Fortunately/unfortunately.
My school's coding club was faster to respond than our IT helpdesk, and they were more helpful too. They posted a document with detailed step-by-step instructions, while IT just said "come see us." Thankfully I got rid of Falcon at the end of spring semester, as we're not required to have it over summer break.
Not seeing much understanding of administration. A system I was admining involves testing updates before they get installed on the live environment and with this many computers, you don't install it on all of them at the same second, you install it in segments and don't continue until you have successfully restarted the first batch of computers. This all about GREED admining, they didn't want to pay for doing to properly, my way of admining was developed in the 19xx, we have INTENTIONALLY dropped security to save money.
Yep, admin practices is the key and not a particular bug. Live updates in closed system is big NO no matter what sweet voice of software vendor tells you. And the most common phrase nowadays is: "it is for you security" - be it the people or the machines.
Some companies had staging environments but they use the same windows update server for both live and staging/testing so this update just bypassed software enforced policies and gone live. Those are mine speculations git from admins sharing their cases. Yet no in depth public case analysis. Hush practice fir reputation.
It also doesn't help that Microsoft took away the key combo to tell the OS to boot into safe mode on startup. If that was a thing I'm sure this would've been at least a bit smoother.
@@throwaway6478 because I don't specialize in the black box that is Windows. Also why should I have to dig through layers of archaic settings to change this when it's a sensible default?
@@SyphistPrime oh stop it, you're not reading the source code for linux to figure out how something works, no one does that... you "can" do it, but thats not a thing an average person does. You're reading documentation just like people do with windows. Stop it.
@@irql2 The documentation on Linux is leagues better than Windows. There's so many undocumented and hidden features in Windows where as with Linux it's all out in the open. Also I have read bits of source code when AUR packages failed to compile. I've very much used that to help fix issues with PKGBUILDs and compiler errors. It's not usually necessary to read source code because all the documentation is out in the open, unlike Windows.
00:03 Windows machines experienced widespread blue screens due to an operational error. 01:55 Windows utilizes safety mechanisms like blue screens to protect against critical failures. 03:43 Kernel-level code in Windows can cause serious errors if not managed properly. 05:32 Kernel mode software failures can severely disrupt essential services. 07:25 Microsoft's Windows systems faced critical issues due to a specific bug. 09:04 Mitigating system failures through advanced update mechanisms. 10:56 A genuine mistake led to significant issues, but damage could have been far worse. 12:42 Cloud dependency poses risks for individuals and organizations during outages. 14:24 Exploring advanced image recognition capabilities.
Apple has the luxury of being able to force changes to their OS like that because only a minuscule percentage of the world infrastructure relies on it. Microsoft must remain backwards compatible as best they can with their OS upgrades precisely because they aren't a tiny player in this arena.
Best resources/books Windows system internals (P1 & P2 ) usually takes 1 year to complete. Art of memory forensics ( Wiley for understand NT authority and kernel objects ) : also this available in previous books as well. Both are amazing books🔥🔥🔥
Linux can run CrowdStrike, and had a worryingly similar issue a few weeks ago, since it was in the kernel there was nothing Linux could do either... But only on a couple of distros and only if you had installed Falcon CS ...
"Anything that can go wrong will go wrong.." - Murphy's Law Another one I like is the variation of Murphy's law from Interstellar: "Anything that can happen will happen."
The wider issue is that, while Windows acts in a way to mitigate the consequences of a malicious act (which this failed update mimicked), there has seemingly been no thought into how to manage, contain and recover from such a problem when it is happening at scale on massive numbers of end-points at a very rapid rate. The rate of 'infection' is happening far faster than it can be contained. Microsoft's kernel code policy on top of Crowdstrikes error has exacerbated the problem. The impact isn't a theoretical one, it is real with potentially life threatening consequences (like the Highways Agency being unable to control Smart motorways when their displays were not reflecting what signs were saying and they couldn't change them - that left people in Refuges being unable to rejoin live motorway lanes). It has exposed many weaknesses.
8:42 It can happen and indeed DOES happen on mac and particularly linux machines but the difference is those operating systems have safety mechanisms in place so that mass IT outages like the kind that just occurred can't fail to the point of individually booting every single device into safe mode and deleting a driver file. As you said, there was a kernel panic error on clownstrike's linux distributions, yet it didn't crash the world's infrastructure because the error was handled correctly. So microsoft should be at fault in some part for not providing these error handling systems.
@@Formalec the x86 family supports four rings, but for reasons Linux didn't continue the tradition used in VMS and some other contemporary mini computer operating systems, where kernel is ring 0, drivers are ring 1 and shared libraries are in ring 2. Choosing to do the same as NT did, skipping rings 1 & 2 only leaving kernel and user processes. Since essentially nothing uses more than ring 0 & 3 nowadays most new CPU designs only implement 2 rings
Linux allows you to specify a kernel command line from the bootloader, and you can blacklist individual drivers in the kernel command line, so recovery would be simpler.
@@JonBrase Same as with BSoDs, you would still need some techie typing in the fix at the Console. On cloud servers, it could be automated, same as with BSoD fixes, but I doubt it could be done on standalone machines
Also, how was rollout conducted? Normally it would be tiered / staggered to minimize damage from faulty code. I haven't found any confirmation, but this looked like a "big bang" release.
@@ytechnology It sounded like from the video, what they pushed out was definition files, and not code per se? Normally I would not expect that kind of thing to cause a kernel panic, so maybe they didn't either. Hopefully, this incident will make them take a hard look at how they do/deploy things in the future, no matter what it is.
Friday update before the holidays strikes. Just like Friday built cars. Just push into production and go down the pub, will deal with problems when we get back.
QA is a cost center. Everyone is getting rid of that. Why not have the devs responsible for QA, oh and deploying the stuff to the customers and datacenters. The above is not a joke, I've lived it for 5 year now.
"Wondering how it got past QA?" - there was none. This industry is unregulated. The mentality is "push now, patch later". Maybe governments will finally wake up to the certainty of more timebombs.
There is a much simpler and pragmatic approach that I've used in places. Which is to simply not allow updates to critical IT infrastructure (DC, DNS, etc) until the update has gone out to a smaller group of endpoints first. Permit the update to 10% of end user compute estate before permitting it on all of it. Kernel mode drivers should go through a rigorous testing regime (WHQL, for example). The other problem was Crowdstrike configured their driver as a boot-start driver otherwise people could have used safe mode easily.
8:40 The point of using not-Windows isn't that the other OSs are impervious, but rather the fact that diversification *is* redundancy. Instead, the current landscape is still heavily Windows-centric and that is a bad thing if we're talking resiliency.
"They may have implemented something badly, we don't know". Yes, we do know. It happened, therefore they implemented something badly. This sort of thing is why we have canary deployments, and apparently they have the infrastructure for that, and allow customers to have settings for which computers get updates first in order to validate them, but they also have some updates that simply ignore those settings, and this one one of them. Yes, they 'implemented something badly'.
@@alazarbisrat1978 'Held under less scrutiny' by whom? The reality is that it crashed computers, and this isn't the first time similar updates by Crowdstrike have caused crashes (including on linux). The fact that they know this is a possibility but failed to implement proper testing before pushing out to everyone, means the 'implemented something badly'.
@@TimothyWhiteheadzm they didn't know that would happen, sorta how this ever got out in the first place. but companies always neglect QA, it's just how it is. and also definition files themselves couldn't do any of this without a huge screw-up so they're not as important to defend, but had they tested it there would be no problem. some programmers just prefer to test after failure tho, just a complete miss
@@alazarbisrat1978 What makes this remarkable is that the entire purpose of this product and company is to address that QA neglect. They've demonstrated they're among the worst at the one thing they're claiming to do better.
@@0LoneTech not really, most companies do that, just that this one was widespread and broke something fundamental. they just got unlucky with their neglect and this slip-up got all the way and broke everything. legend has it that there have been many other issues in their code over time that went totally unnoticed and only now caused catastrophic failure
12:50 don't apologize to Elon. He deadnames one of his kids. If he can do that, you can deadname his company. The best he's going to get out of me is ex-Twitter.
It's been obvious for a while now - MS does NOT DO software testing, nor Crowdstruck evidently. They are delegating the testing straight to the end user. They pushed a bad binary to an "on-the-fly" update, and after the updated binary was first touched, it crashed the system. That's criminal negligence, brought to you by industry's greatest security providers.
Something like this software is necessary: monitoring large networks for suspicious behavior. But letting the companies that make this software be privately owned and for profit removes points of accountability and introduces incentives to cut corners (increase profits by delivering an inferior product to a captured market). We can’t snap our fingers and prevent this from ever happening again, but we can improve the product by removing bad motivators (profit) from the equation.
"Dave's Garage" a former microsoft software engineer just did a video about what he thinks happened about this. Very comprehensive and very clear. He also speaks extensively that this was possible because Crowdstrike works in kernel mode.
@@murzilkastepanowich5818 Sorry, I am not aware about any of that or don't even know what you are talking about. Just found about it yesterday, the video in question seems fine and basically makes some of the same points as this one, but is a bit more detailed.
"Well, well, well. Tell me, young gentlemen, why is it always you two when something bad happened??"
Because we rule the world, and a one in a billion chance is next Tuesday for us.
I am reminded of Cheech & Chong, - but high on technology. I mean man, what can you do?
"It's a gift." -- the 4th Doctor
well put
That's a bit unfair, isn't it? Crowdstrike managed to crash tons of Linux systems with the exact same software this April. Same software (Falcon), same problem (kernel panic). Only nobody made a big deal about it back then. Dr. Begley even mentions it briefly in the video.
McAfee did something similar several years ago. A bad definition quarantined core system files. The McAfee CTO from that era is now CEO at Crowdstrike.
To borrow a comment from elsewhere "real men test in production on a Friday"
A fine example of "failing up"
1 time is a mistake to be learned from. 2 times are a pattern of behaviour, signalling deeper flaws.
I got dragged into this and I'm now at 48 hours of overtime. Thanks CrowdStrike.
@NigelfarijI was about to say
@Nigelfarij Tell that to the taxman.
Thats crazy. Whats your patch rate / hour? How many machines?
Ujjjj😢😢😢😢😢😢😢😢😢😢😢
Did you guys get the USB microsoft created to automatically fix it? What is cool is the winpe usb drive just boots into safe mode and runs repair.cmd file it creates. I am keeping this as it will be easy to change that batch file and have it do other things in the future if I want to.
"As I said online, you should just go outside and enjoy the sunshine."
Okay, but what are people in the U.K. supposed to do?
Shots fired. But not seen in the UK, because of the dense cloud cover.
😄
So there were 3 seperate failures from Crowdstrike.
1. The kernel Driver didn't have proper input validation
2. The Channel File was broken
3. The testing was so abysmal that they didn't notice before sending the update out to customers.
It’s quite scary that they get their kernel driver signed, despite it not meeting the standard of validating all input! That’s a systemic problem with their entire solution! (Well, so is the third, but testing is not you build quality into the system, so I think the first is the fatal flaw.)
4. They didn't even notice that every client that updated went down, or at least they didn't respond. How that is even possible is beyond me. Their entire product is based on monitoring systems, but it took them hours to respond, and that was after Google had called them out for the chaos everywhere.
I think #3 is the worst and why their share price is tanking. Such an utter lack of responsibility to Yolo this into prod.
It does call into question the WHQL testing that allowed the driver to be signed, which does push some degree of responsibility back to Microsoft.
@@ReverendTed Bingo.
Heh the BSOD at 0:40 is cool
"For more information about this issue and possible fixes, do not ask us"
But it's about as helpful as a genuine one!
also LEET% complete
@@T_GingerDude5416 All hail 1337!
Haven't come across that for years. Had totally forgotten how it looks like.
Could've been a genuine message from M$ then!
I was waiting for this video with extreme excitement for the last 2 days. I jumped on UA-cam as soon as I saw the notification.
We all were
The problem is rolling out an update (that might not have been tested so well) TO EVERYONE ON THE PLANET AT THE SAME TIME. I can't believe Crowdstrike is operating like this. If you did a phased roll-out to a couple of smaller customers initially, and then monitored whether the updates didn't have any glaring issues this whole situation could have been averted.
That's the nuts & bolts of it. Zero QC/QA before release. In an unregulated industry, this is damningly the norm.
I can't believe huge customers don't have a tiered approach to allowing patches to be deployed.
@@lever2k what company do you work at that tiers endpoint protection updates? Never heard of such a thing. Crowdstrike may not even offer that capability.
@@lever2k That's assuming the software even allows tiered deployment and doesn't expect _everything_ (including the main server) to be working on the same version - and any machine that isn't updated yet can only connect to update.
@@lever2k based on what i've bbeen hearing from others online: a lot of companies **do** have tiered approach for updates, including crowdstrike, but this update - deemed by crowdstrike to be very critical, ignored ALL such settings and was deployed unilaterally to everything.
The real worry is the lack of QA at Enterprise companies. A state actor infiltrating one of these orgs would be absolutely devastating.
The real issue and worry is a monoculture. This sort of problem will always happen. Someone is always going to be affected and there's always going to be a cohort of people who are unfairly affected by things that are out of their control. The problem is the cohort here happens to be extremely big because of there's a monoculture of this type of software monopolies lead to monocultures and monocultures lead to unique weaknesses. This unique weakness was able to take out. You know millions of computers all around the world cuz everyone was using this software. We need more companies in this space. Even now the fact that after this happens, everyone basically have to look to crowdstrike because that's who everyone uses. It sounds there's no competitive alternative
It has and still is devastating. Didn't need the boogieman to show this.
If you can think of it, someone has already done it.
There are probably already some bad actors out there. Just look at the catastrophic instances of espionage inside the CIA. See Robert Hanssen and Aldrich Ames.
Agile!!!!!!!!!
I love Agile development practices!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
If Dr Bagley and Dr Pound had a podcast, I'd definitely listen to them talk for hours lol.
"The IT podcast with Bagley and Pound" Does that sound interesting to you?
@@paulmichaelfreedman8334 oh yeah it does
A Computerphile podcast as a sister podcast to the Numberphile Podcast would be amazing!
Fr. I like listening to them speak
@@paulmichaelfreedman8334 Yes, actually.
I was stuck in Atlantas airport because of this. It was absolute madness and everyone that talked about it, either from the airline or passengers, said it was a Microsoft issue. That's all most people are going to remember.
That's not entirely wrong. Microsoft did bless this software as permitted the privileges to do whatever to the entire system. They're in turn blaming this on EU, but EU only mandated they provide access to security software at the same level their own has; it's Microsoft's choice to make that this risky. Then there's the trust placed in Crowdstrike; they're likely selected for being a known name, never mind they ran a previous company into the ground in this particular manner. It's like the hotel manager decided to install an entry counter in their front door and nobody asked why it's also a guillotine.
I swear this is only the beginning for tech companies that are losing valued senior staff over the many, many decades...
Honestly I see why. This career is mostly miserable and the pay seems to be going down.
Senior staff that in case probably cautioned against allowing running code in kernel space before it's tested on a test system because that's a fast track to exactly what happened. Senior staff likely tired of their expertise being ignored by suits who cannot comprehend anything outside their niche might matter.
Losing? They think they can do things cheaper elsewhere and AI can replace everyone. I wish them luck in the wars to come. Yes, this was a fun career and all I've see is degradation of quality of life on a massive scale. Where everything is micromanaged by 100% non-technical types. I don't miss it at all.
@@DoubleOhSilver UA-cam censored my comment. Wanted to say that I totally concur with the sentiment. Not only is it miserable, the hiring process that is adopted across the board seems to be nonsensical hazing rituals that do not map to real world problems or realistic development tasks and activities. The golden age is well and truly over.
Especially the ones who are losing senior staff who know the ins and outs of the product, and replacing them with “Business guy who does business things and doesn’t need to know how the technology works”
"If you put everything on the cloud, and then the cloud's not there, you've got nothing."
The clouds have multiple redundancies though, depending on how much the customer is willing to pay.
what if the could and its redundancies were affected?😮
This boggles my mind as an IT professional. I was part of a team that deployed patches and software for years. This included OS deployment patch deployment, software deployment the whole thing on both Workstations and Servers. We tested our patches extensively before pushing them out to the entire population of the environment. This 1st included a sandbox environment, then a select user / system environment, then we would stage our patches out over several hours so if something happened we could back out before catastrophe struck. And honestly sometimes we would find problems with the patches, and we would be able to immediately stop, suspend and even back out.
Yes we would use 3rd party vendor solutions to help with this, and any time we changed ANYTHING we would follow our testing procedures and matrix, normal business. We would never shirk our procedures to test 1st, then deploy. To me this is a total failure of IT Governance and failure to maintain standards. (IT Governance is setting and maintaining standards and policies for the IT Infrastructure)
In the modern version of Battlestar Galactica, Admiral Adama absolutely refused to have Galactica networked to other systems and ships in fleet because of the risks to their it critical system. Yet here we are, allowing a root kit to operate unconstrained on millions of machines. Fun times ahead.
Wow, I thought exactly the same! 😃
A lot of the computers that businesses give out to employees (such as ATM screens and point-of-sale devices) where those computers are so cheap that they become completely useless without a network connection (like a Chromebook), and so the system is working “correctly enough” that it detected a problem in those (theoretically) cheap end computers, and it cut them off of the network. The failure was that the wrong thing was found to be a threat, and all those end computers were cut off.
@@evannibbe9375 "Oops, it's all malware."
@@evannibbe9375 I'm amazed, literally everything you just said in that comment is wrong. It's like I just watched Calvin's dad explain computers
And kernel level anticheat is a thing ...
When talking about this incident it's worth remembering that hospitals were affected and she people may have died because of this. So it's all well and good to say when everything goes down, go outside and touch grass. But also, we do need to think seriously about whether we're doing enough to ensure software safety. We take it way less seriously than, for example, car safety. When a new model of car comes out it has to go through all kinds of testing to ensure its safety. But we are doing nothing to ensure software safety, we are just 100% trusting the vendors. I've been a software engineer professionally for 25 years and have long thought that the current approach is madness and incidents like this one only make more sure we need to have standards that all critical system software meets in its development, deployment and implementation.
Someone left a message in an Spanish dev stream saying their aunt had a miscarriage and couldn't be operated on because the all the hospital computers had BSOD'ed. She had an emergency procedure hours later.
100% true. It's definitely a big deal that this incident took down not just School computers or corporate businesses but hospitals that need them to keep people alive. people were missing their medications and for some people like me missing medication means you end up throwing up for a couple nights for other people the consequences can be much more dire.
At the end of the day as technology begins to run more and more of our lives I do agree there's nothing you can do to prevent hospitals from being part of the affected class these things will happen and hospitals will be affected just like any other computerized business. The problem is we don't need to have so many hospitals affected in a single incident that is purely the result of a monoculture which is the result of monopolistic practices which is a result of the form of capitalism that we have in North America and its effects around the world.
And that's just on a philosophical level without even approaching all the specific problems that could have been prevented in this case
bro is secretly working for the government
while i agree with your statement, digitalization also played a huge role in this. nowadays everything needs to be "smart", even things that dont make sense like refrigirators. if those hospitals had alternatives to the computers they used (like for example have paper copies of documents alongside digital versions) this would have hurt them far less significantly. we are too dependant on digital computers
Anyone using this horseshit on hospital computers needs sacking
The guilty in this instance are both CrowdStrike and their Customer Security Managers.
CrowdStrike has a history of shipping stuff that breaks systems, most recently their Linux product.
The Customers said: Yes CrowdStrike just put whatever you want on our systems without monitoring. And by the way, we have no adequate disaster recovery plan.
As a corollary, letting CrowdStrike put stuff on your systems also allows bad people to compromise CrowdStrike and deliver unlimited hurt.
If I was a baddie I'd spend my every effort to subvert CrowdStrike!
There will most likely be a lot of QA positions opening on Crodstrike in the aftermath of this. Bad actors just need to get one of "their guys" in through that recruitment process.
@@ipadistaI'd sooner expect more attorney positions to open up before QA
What an awful take
@@justgame5508 welcome to the corporate mindset. Protection against liability is more important than delivering a working product. Who do you think the company is prepared to pay the most, the lawyers or the engineers? That reflects how they value their respective services.
@@lintfordpickle Yeah, but when our security software screws up it will a) first crash the test machine which would block the rest from receiving the update, and b) if that somehow fails our system would allow us to reboot with a previous system snapshot. To see these massive and vital organizations not have _any_ backup plans while putting full trust in an external company is mind boggling.
I think it's important to point out that Crowdstrike did the same thing back in April but it affected Linux machines (causing kernel panic).
But not much talk about that, why probably because you have a rollback mechanism in booting previous working kernels in nearly all distros.
Maybe CrowdStrike's management thinks and acts like Boeing's?
Really??
And they've caused a massive c*ck-up a few years ago. Seems they are 'too big' to fail.
@@Techmagus76bcz no one install an anti-virus on linux.
Seeing two academicians discuss this issue is so refreshing. So many ideas thrown back and forth.
Perfect storm: no fuzzy testing the driver code, no staged deployment, no os blue/green boot partition
No not really, a perfect storm implies the issue was due to various timing / bad luck factors. ie It lessens the culpability of ClownStrike. Each of the issue you mention were just plain incompetence.
I am afraid there was not testing at all in this mess. Everything points out to that...
Third Party apps operating in kernelspace... FFS
@@draoi99All operating systems do this. If you are saying FFS about that you don't know how computers work. Yes, including MacOS.
Software running in the kernel pretending to be a driver, when in reality it is a parser, what could go wrong?
The fix is simple, do not push untested code onto live systems where it will run as part of a must run to boot kernel level driver. Run it on a test system first. And never trust a 'security company' who says you should do otherwise (except in rare cases, such as a very bad zero day being exploited where it's a gamble either way). If they allowed this for a run of the mill non-emergency update then they don't know cyber security and safety well enough to protect a home gaming system, let alone major systems. This goes past gross incompetence to the point where I wouldn't blame anyone from suspecting malice. Though I personally think it was "we don't screw up, we stop screw ups" level hubris.
EXACTLY!
Unfortunately, this braindead policy of offloading all QC/QA onto the end user is being practiced my an increasing majority of devs...all thanks/empowered by The Internet. Software development is the most uncontrolled, unregulated industry in existence. Governments MUST act...before it really is too late!
I quote Grey's law: "Any sufficiently advanced incompetence is indistinguishable from malice."
It doesn't really matter if Crowdstrike did it out of malice or just cut corners to cheap out on development costs. They sell a product that is obviously not robust enough to be used on mission critical systems and they have made the decision to risk their customers business to make more money for themselves.
In turn Microsoft allows their OS to hard crash due to a faulty third party driver. That can not be tolerated on mission critical systems so a large part of the blame goes to them as well. The end users seems to be pretty naive as well, they have hopefully learnt the expensive lesson on how to not build infrastructure.
There's also a small chance that the files got corrupted during the transfer to a CDN which served the corrupted update to millions of computers. We shall see....
Thanks for being the first source I found that actually explains what crowdstrike is and what went wrong here, and nice to hear some nuance amd perspective as well.
If you want a little more detail: apparently, the definition file they pushed out left some index entries uninitialized, so some memory addresses that were meant to hold pointers ended up with junk data that, when dereferenced, pointed to invalid memory locations.
@@IceMetalPunk Thanks, this is the best explanation I've heard so far. IMNSHO, the software should have been written in such a way such that the definitions don't directly map to memory. Then when you create data structures in memory, they always point to something valid. But nobody asked me.
@@Tahgtahv I think what you're talking about is Rust. but apparently there were numerous cracks in the program even before then that was caused by the same QA issues that caused this current crash, the crash was just everything finally fell apart
Enjoyed this. Glad I watched the recent 'Dave's Garage' video where he explained the problem. Here I saw and got a good understanding of the wider consequence management. Well werth wathing both I think.
The frowny face is absolutely necessary
Yes I agree. Absolutely necessary, even if not strictly so :(
I dunno, I'm starting to like 😉👍
@@ICanDoThatToo2 any of these would work too:
🤪 🤯 🥳 🥶 😱 💀 💩 🍐 🌋 🆘️ 🏳
Or an animation:
🤣
😂 🔫
😅 🔫
🥺🔫
🤯💥🔫
🧠💀
If Mike Pound says it, it must be true. Therefore you are wrong! 😁
Windows can in fact boot with the failing driver automatically disabled the next time, except for drivers that are marked as absolutely necessary for booting itself, and this driver is marjed as such.
nah it wasnt marked as boot critical, common talking point though. Doesnt change anything though, unless you get to a desktop windows considers it a failed boot, do that 3x and you end up in the recovery console.
@@irql2 yes it was, but the decision as to if it can be downgraded should be Microsofts.
just because they want it to prevent booting if it cannot start does not mean that windows cannot start without it.
@@grokitall stop parroting talking points and go look at how the driver is configured in the registry. People super confident about things and wont even verify when its very easy to do.
@@irql2 according to retired microsoft engineer dave plumer, they had it marked as boot critical according to his sources.
i have no reason to doubt his statement.
despite how unimpressed i am with various choices Microsoft has made, i have no reason to doubt the quality of their engineers. that is why i am sure they are capable of determining if it is actually boot critical when the driver is being signed.
i am also sure that they are capable of writing code which will use that determination to down grade the driver and disable it if it is too broken to boot, and to check if it is stuck in a boot loop.
for any os, as long as you can get to startup, and use the net, you can fix the driver with an update without having to manually login to all the locked down machines.
the fact that they have not bothered to implement such a measure when this has happened before is disappointing.
@@grokitall Thanks for confirming you wont even go look and you'll just parrot whatever anyone says. David is wrong too and he would admit it if he looked. We're human, it happens... He probably doesnt have a dump to go and check.
and honestly doesnt matter.
Whats more concerning is how confidently wrong people and they have no interest in learning anything that wasnt hand delivered to them by some source they consider trustworthy. This is a huge problem and our political climate is evidence enough of this.
If you would have asked "How do I verify this?" since you obviously don't know or even care to, I would have shared that information with you so that you could be more informed on the topic... but nah, polly wants a cracker instead.
For those that are interested in learning, csagent's Start value is set to 1. Meaning its just another driver, its not special in regards to booting. If it were, you'd get a 7b on boot. This entire interaction is disappointing. What happened to the days when people went "Oh yea? Show me".
Working for a Bank we had drills where we simulated losing our systems for a few hours and had to do everything (and I mean every conceivable thing we might be asked to do in a normal day) without any computers. Including driving physical records to central processing locations.
Yesssssss, twas waiting for this. You beautiful channel you. The dynamic duo returns
Crowdstrike did more harm to its clients, and to the Western world, that it could ever have possibly prevented for the entire duration of its existence as a company. How they ONLY lost 20% of their share value is mind-boggling.
Love the point you've made.
You said the most obvious thing
robot movie pfp
@@nicostigliano6393 nobody's saying it out loud tho
Very enjoyable format of two people discussing. Sounds less monotonous, too. Great job.
A number of years ago Tom Scott did a fun talk called "Single Point of Failure." I think about that sometimes.
Every time there's some outage, or bug, or virus big enough to get in the news, I get excited about the inevitable computerphile video explaining it.
Linux has a feature that allows the sandboxing of channel updates using eBPF, although Crowdstrike doesn't use it yet. In theory, that could have prevented the BSODs had Windows had a similar feature.
Also, I don't ncessarily agree that Windows is blameless here. While Crowstrike is definitely at fault, Windwos did certify their driver, and that validation somhow didn't include testing for corrupted or invalid channel files. There's no reason the driver should blindly trust those files without validation.
Yeah, Microsoft also allows eBPF, but it's in an alpha, very early state. Also, the people opining that "this isn't a Windows' issue" are right to a degree, but when you realize that there are design deficiencies around how Microsoft handles drivers, it can only be said, "they're right to a degree," especially when you can specify kernel command line options to disable drivers that are acting bad, or have a fallback initramfs that doesn't load the CrowdStrike driver, which Windows doesn't really allow.
I believe that CrowdStrike is also on the eBPF design foundation alongside some other industry giants like Apple, Google, Microsoft, etc. I think CrowdStrike also uses eBPF for Linux in their newer agent after the debacle back in March/April with Debian.
My understanding is that CrowdStrike does use some type of interpreted code in their definition files, which would imply that there was some bug in the interpreter (or code downstream of it) that allowed a null-pointer dereference through (or made a null pointer dereference on its own).
@@reybontje2375 Windows does have self-recovery functions for bad acting drivers, but they do not work on boot drivers and Crowdstrike's driver is a boot driver so the system is not allowed to boot if it crashes by design unless you use safe mode.
@@forbidden-cyrillic-handle Lol. Your username.
But who would install this on linux? I never seen a linux server with anti-virus or edr. it sounds dum.
Computer Phile is amazing!
I love your content and calm but casual demeanor. Your explanations and ability to break things down is superb!
Keep it up 🙏🙏🙏🙂❤️
when the computer goes down, that is a sign to photosynthesize, nice
It’s thunderstorming where I’m at so I’d have to wait
Finally, FINALLY, some informed and cogent commentary on this issue that isn't just "Tech influencer says Windows is a mess and this would never happen in Linux or macOS"
Falcon is using definition files which are NOT part of the WHQL process which Falcon obviously is! I don't know how this works on Linux or MAC, but maybe it should not be allowed for Windows driver makers to deliver _anything_ to the kernel that does not go through the WHQL certification.
This is the part that’s wild for me. WHQL is supposed to be this Highest Level Of Scrutiny thing, and somehow WHQL reviewed this workaround to inject arbitrary runtime behavior without requiring WHQL recertification and said F It Ship It.
My only suspicion is that someone, somewhere thought requiring WHQL for definition files could delay definitions too long when new vulnerabilities are discovered and need to be monitored. Like, "if we do WHQL on every definition, by the time it gets released, so many people could be affected by this exploit!"
@@IceMetalPunk I think that's the reason, and I can't say I have any insights in the WHQL process to tell you how long the process normally is. Would be interested to know though, do you know? I would imagine most of it is automated.
Yeah that is an important part that they didn’t mention, I think.
Maybe definition files do not contain any code and are thus exempt from WHQL process? It could be that the definition file was simply corrupted and unreadable and the kernel driver crashed when trying to read it.
Finally, a really good explanation of crowdstrike and what it does and what went wrong.
I am a recently retired Cyber Security (though being heavily involved in Computer Security for over 30-years, and a software developer for 20 years prior to that, I prefer the traditional names of Computer or Systems Security) Compliance Officer. Although the systems I monitored were involved with critical infrastructure and not open to regular users of business systems, they were still peripheral dependent on many such systems. Since I was a stickler for avoiding the Cloud and third-party security products, my former employer has taken steps to ensure I never know if they were severely affected by the CrowdStruck (accepting the pun) event.
The real issue is something you two gentlemen mentioned but did not go deeply into. What if there were malicious embeds (i.e. spies) working for that organization, or for Windows System development? We would not be face a bad day or so, but it could been lights-out until every critical system were completely rebuilt and data backups restored. I can understand why discussion of that scenario would be avoided, but should it be avoided. If I were a critically ill patient in the hospital I would want to know so I could prepare for the aftermath.
These IT disasters always have the upside of flushing Dr Pound and Dr Bagley out of whatever else they’re up to, to give us these great explanations!
The CrowdStrike bug was what Y2K wished it could be.
Fortunately we fixed Y2K before it could cause this chaos. If we had done nothing, it would've been far far more devastating.
@@ZiggyGrok Y2K only affected those that were too lazy to add 2 more characters to their dates. If your code was vulnerable, it was terrible code to begin with.
The world was not as interconnected then too.
@@davidmcgill1000 You realize that non-programmers use two digits for years too? A lot of it was a (lack of) standards issue, not just code
@@davidmcgill1000too lazy... No, using software originally designed when memory was small and expensive, and saving two characters per entry won them pay rises
There were huge and expensive efforts put in to check and update to get around the issues many years later, and so near nothing happened, but it doesn't mean there wasn't a problem
that os/house/hotel analogy was really good!
The new update to CrowdStrike falcon included some corrupted channel files (they contained just zeroes instead of the intended data), and because the core driver that loaded the channel files didn't do enough input validation, it continued on using the messed up channel files, and this revealed a bug that likely had been there for a while. The bug caused the driver to attempt to dereference a null pointer, which caused the BSOD.
Yeah and probably crowd strike have not fixed the bug because it would require a new release of the driver and that would have to go again through the Microsoft WHQL signing process which the use of these channel files seeks to avoid.
Note that this corruption claim is afaik coming from one random twitter user and has been denied by Crowdstrike who says there was a logic error in the updated rules file that caused the problem. It seems extremely unlikely to me that crowdstrike does no validation on these files given that they're being updated frequently on a huge number of machines and are therefore liable to get corrupted (due to power failures and such) on a regular basis.
I found a twitter post from someone that the problematic channel file was _not_ zero-filled on any of the systems he had to manually fix that day.
As someone who led the deployment of EDR and EPP to 18,000+ endpoints last year, agents are absolutely installed on Windows servers, yes. Updates like this that don’t go through change control are a calculated risk for more up-to-date protections. Problem is that the risk mitigation is that the vendor does testing and releases competently..
13.37% complete... ISWYDT 🙃
What does that mean
@@blackholesun4942 I see what you did there
@@blackholesun4942I am not sure which part you didn’t get. The custom blue screen of death (BSOD) is something they fabricated. 1337 is often used in gamer culture to mean LEET (or elite rather). Usually indicating something like highly skilled (1337 player for instance). ISWYDT : I see what you did there. So it is used a bit ironically here, because it was of course not a skilled update. Hope that helps.
@@blackholesun4942 🏴☠️
@@playground2137TBF, 1337 is specifically turn-of-the-millennium gamer culture (late GenX, elder millennial). I'm not sure I've even seen younger millennials using it, let alone Gen Z.
I would have listened to these guys talk about it for an hour
My local pub went down.. no fish and chips for me..
No cash in hand?
“This was a phishing attack and a chip level attack?”
“No, no… the cash register system is down thanks to broken Windows update”
“They broke your windows and stole your cash?!”
“No, the money is still here!”
“Okay, I’ll just pay you in cash then”
“I can’t do that! The register is locked unless the computer tells it to open! Besides, each purchase is required to update the inventory as well”
“I don’t see what the Tories have to do with anything in this case”
“… I don’t have time for your Monty Python shenanigans”
“I’d think this stuff would be programmed in C and not Python”
“GET OUT!”
@@Abdega 😂
in my local pub i can order by sending a SMS to their fax. cash-only place
@@Abdega When the best comment is buried in a thread
I mean crowdstrike issues aside, WHAT A CHANNEL! Thank you
Its going to be very interesting to see what Crowdstrike learns from this. One thing they didn't seem to use is a canary or blue/green deployment scheme. Hoping for some enlightening blog-posts on the topic eventually.
nothing. The guy in charge oversaw something exactly similar when he was at McAfee
Microsoft,CS,Black rock the WEF and more are tied together .was no accident
Never fails, something big happens in the field of cybersec, we can guarantee that we'll get a Computerphile video starring Dr Bagley &/or Dr Pound :)
Typical "Management Bug?" A CrowdStrike engineer or two urges more testing before release. Some executive then pounds the conference table and shouts, "No more f**king EXCUSES! I want that update NOW gawdammit!"
Yeah, and I want it rolled out to everyone, NOW!!! Phased roll-outs are for pussies!
Especially seeing that the CEO of Crowstrike *now* was the CTO at McAffee back *then* , when McAffee brought down XP Machines by deleting Windows core files in 2010. The common factor ist the manager.
As an Ex MS employee and one that worked at Windows, I appreciate what was said at 7:42 :)
😂 the example bluescreen at around 0:36 , 13.37% 😂 love it 😁
Oooh boy, you're guys are back. Finally!! ❤
Crowdstruck? We gave this overtime event a codename of 'clownstrike'
I love listening to these engaged guys 😁
Crowdstrike sounds like a nickname for Mustangs 😅
good one lol
13:06 totally agree, we just need to US develop our technology. But we see how US monopoly all technologycal aspects, and any real competitor they ban out...
UPDATE: Thanks tma2001 letting me know the zero file was not the cause. And in fact there is validation in place. The error was somewhere else.
So the below is inaccurate
Seems it was a lack of input validation.
Apparently the root cause of the crash was that one of the files in the definition update was just a file filled with zeros for whatever reason. Leading to a null pointer dereference (which always crashes, by design)
But that makes me go like: Input validation anyone?! Does CrowdStrike Falcon fail to at least make sure the definition file makes sense as a definition file before blindy following its directions?
Everyone who is even remotely competent knows to put headers on files, network packets and the like. A magic byte or two and some metadata goes a long way when validating.
no that was a red herring - for some people it wasn't all zeros and CS confirmed in a technical blog post that null bytes in the channel file were not the cause. There are many possible reasons why it was a file of zeros for some folks - pre-allocated ahead of time before updated or wiped clean as a post processing step for security.
Valid channel files have a magic signature at the beginning and they actually contain code in the form of byte code for a VM interpreter in the actual kernel driver. The logic error was in the byte code. Of course this means the actual driver can have gone through WHQL but is actually a dynamic entity.
@@tma2001 Ooh, thanks for the correction. I hadn't heard any technical detail updates since the original 0'ed file finding
@@TechSY730 you were not alone - I too was confused by what little folks had to go on initially. None of it made any sense!
There is a full explanation by the Cloud Architect B Shyam Sundar on Medium website to breaks it down.
We can probably thank Dave Plummer for making sure guys like this actually know how to explain the issue.
Are there no standards for deploying updates that run in the kernel?
Excellent discussion. I'm so glad I'm not in IT any more.
Crowdstrike didnt do any validation control(or not enough) in their Driver to check the .sys file before running it to confirm it wasnt just full of Null values etc.
I cant find one UA-camr talking about proper sysadmin practices at the enterprise level that would have caught this before getting rolled out. I have never worked at a company where PCs weren't locked down from software installs and every update (even ones from MS) were tested by local QA before rolling them out to your enterprise PCs. Unbelievable that airlines are being run this way. Unless Cloudstrike installed some rootkit that bypasses all these processes I'm shocked at the state of sloppiness in IT.
I am trying to voice out the same thing but not even tech guys understand. CS Falcon updates bypass everything but still i don't understand how admins allow live updates on supposedly closed system like airports, banks, POS etc. And the loophole seems like the same windows update server used fir both live and testing, or just plain network connection to outside world to allow CS Falcon updates so that it can prevent zero day security issues. It is just absurd!
You didn't mention that in order to install kernel drivers, the code needs to be submitted to Microsoft's to be tested, approved and digitally signed. As you mentioned, the bug was not present in the main kernel, but in the "channel files" that are updates without following that same process. It is not clear to me if those "channel files" are code or just configuration, but maybe Microsoft is partially at fault here for allowing these channel files in the first place, or for not sufficiently checking the kernel driver had the necessary logic to gracefully crash without taking down the entire system.
Clownstrike apparently uses a P-code interpreter to sneak unsigned code into their driver. You'd be a millionaire by Saturday if you invented a heuristic that can reliably detect a P-code interpreter and/or the P-code itself (which of course can be in any format the writer desires) running in kernel mode.
As I understand it, if something fails in ring zero or kernel mode, the entire OS goes down.
@@throwaway6478 In this case it's not that hard, it's a new file getting loaded from system32, the kernel knows every file you open so you could absolutely block unsigned files in system folders from loading, but as they said it would interfere with competing products so they can't do that, they signed an agreement to allow kernel drivers to work.
There are exceptions to requiring to get your code MS Certified - code that needs to respond to Day 0 attacks don't need certified, for obvious speed reasons. Fortunately/unfortunately.
the "bug" was in csagent.sys, thats the driver that was referencing an invalid memory address. Important to note that.
My school's coding club was faster to respond than our IT helpdesk, and they were more helpful too. They posted a document with detailed step-by-step instructions, while IT just said "come see us." Thankfully I got rid of Falcon at the end of spring semester, as we're not required to have it over summer break.
Not seeing much understanding of administration. A system I was admining involves testing updates before they get installed on the live environment and with this many computers, you don't install it on all of them at the same second, you install it in segments and don't continue until you have successfully restarted the first batch of computers.
This all about GREED admining, they didn't want to pay for doing to properly, my way of admining was developed in the 19xx, we have INTENTIONALLY dropped security to save money.
Yep, admin practices is the key and not a particular bug. Live updates in closed system is big NO no matter what sweet voice of software vendor tells you. And the most common phrase nowadays is: "it is for you security" - be it the people or the machines.
Some companies had staging environments but they use the same windows update server for both live and staging/testing so this update just bypassed software enforced policies and gone live. Those are mine speculations git from admins sharing their cases. Yet no in depth public case analysis. Hush practice fir reputation.
Big cyber-cockup
(a beat)
"Crowdstruck (Windows Outage) - Computerphile"
------
Right on time, thank you
It also doesn't help that Microsoft took away the key combo to tell the OS to boot into safe mode on startup. If that was a thing I'm sure this would've been at least a bit smoother.
It amazes me how many of you don't know about bootmenupolicy legacy.
@@throwaway6478 because I don't specialize in the black box that is Windows. Also why should I have to dig through layers of archaic settings to change this when it's a sensible default?
@@SyphistPrimeYou use an operating system where you have to edit dotfiles to configure your mouse. 🤣
@@SyphistPrime oh stop it, you're not reading the source code for linux to figure out how something works, no one does that... you "can" do it, but thats not a thing an average person does. You're reading documentation just like people do with windows. Stop it.
@@irql2 The documentation on Linux is leagues better than Windows. There's so many undocumented and hidden features in Windows where as with Linux it's all out in the open. Also I have read bits of source code when AUR packages failed to compile. I've very much used that to help fix issues with PKGBUILDs and compiler errors. It's not usually necessary to read source code because all the documentation is out in the open, unlike Windows.
00:03 Windows machines experienced widespread blue screens due to an operational error.
01:55 Windows utilizes safety mechanisms like blue screens to protect against critical failures.
03:43 Kernel-level code in Windows can cause serious errors if not managed properly.
05:32 Kernel mode software failures can severely disrupt essential services.
07:25 Microsoft's Windows systems faced critical issues due to a specific bug.
09:04 Mitigating system failures through advanced update mechanisms.
10:56 A genuine mistake led to significant issues, but damage could have been far worse.
12:42 Cloud dependency poses risks for individuals and organizations during outages.
14:24 Exploring advanced image recognition capabilities.
Apple has the luxury of being able to force changes to their OS like that because only a minuscule percentage of the world infrastructure relies on it. Microsoft must remain backwards compatible as best they can with their OS upgrades precisely because they aren't a tiny player in this arena.
This was a fantastic conversation! 😊
Been waiting for this 🍿
Best resources/books
Windows system internals (P1 & P2 ) usually takes 1 year to complete.
Art of memory forensics ( Wiley for understand NT authority and kernel objects ) : also this available in previous books as well.
Both are amazing books🔥🔥🔥
1:13 if that hotel is like linux then the guests would carry their own air conditioners 😂
and smart guests will build their own hotel next to the original, with only a small difference.
Linux can run CrowdStrike, and had a worryingly similar issue a few weeks ago, since it was in the kernel there was nothing Linux could do either... But only on a couple of distros and only if you had installed Falcon CS ...
Room key is not in the sudoers file. This incident will be reported.
And to get your room cleaned, the instructions would be, "Run make, look for any errors, and correct them."
The #1 video I’ve been most looking forward to!!!
The cure was worse than the disease.
"Anything that can go wrong will go wrong.."
- Murphy's Law
Another one I like is the variation of Murphy's law from Interstellar:
"Anything that can happen will happen."
Murphy also says...
"Remove QC/QA and you're f*d !!"
Honestly I would have called it crowdstroke :p
The wider issue is that, while Windows acts in a way to mitigate the consequences of a malicious act (which this failed update mimicked), there has seemingly been no thought into how to manage, contain and recover from such a problem when it is happening at scale on massive numbers of end-points at a very rapid rate. The rate of 'infection' is happening far faster than it can be contained. Microsoft's kernel code policy on top of Crowdstrikes error has exacerbated the problem.
The impact isn't a theoretical one, it is real with potentially life threatening consequences (like the Highways Agency being unable to control Smart motorways when their displays were not reflecting what signs were saying and they couldn't change them - that left people in Refuges being unable to rejoin live motorway lanes). It has exposed many weaknesses.
8:42 It can happen and indeed DOES happen on mac and particularly linux machines but the difference is those operating systems have safety mechanisms in place so that mass IT outages like the kind that just occurred can't fail to the point of individually booting every single device into safe mode and deleting a driver file. As you said, there was a kernel panic error on clownstrike's linux distributions, yet it didn't crash the world's infrastructure because the error was handled correctly. So microsoft should be at fault in some part for not providing these error handling systems.
This could be exactly as bad for linux machine if the driver is at ring 0.
@@Formalec the x86 family supports four rings, but for reasons Linux didn't continue the tradition used in VMS and some other contemporary mini computer operating systems, where kernel is ring 0, drivers are ring 1 and shared libraries are in ring 2. Choosing to do the same as NT did, skipping rings 1 & 2 only leaving kernel and user processes. Since essentially nothing uses more than ring 0 & 3 nowadays most new CPU designs only implement 2 rings
Linux allows you to specify a kernel command line from the bootloader, and you can blacklist individual drivers in the kernel command line, so recovery would be simpler.
@@JonBrase Same as with BSoDs, you would still need some techie typing in the fix at the Console. On cloud servers, it could be automated, same as with BSoD fixes, but I doubt it could be done on standalone machines
Mac has not allowed kernel level access since Big Sur.
This guy is great at explaining.
Wondering how it got past QA?
Seems like installing the update on a docker instance or vm would have found this bug.
Also, how was rollout conducted? Normally it would be tiered / staggered to minimize damage from faulty code. I haven't found any confirmation, but this looked like a "big bang" release.
@@ytechnology It sounded like from the video, what they pushed out was definition files, and not code per se? Normally I would not expect that kind of thing to cause a kernel panic, so maybe they didn't either. Hopefully, this incident will make them take a hard look at how they do/deploy things in the future, no matter what it is.
Friday update before the holidays strikes. Just like Friday built cars. Just push into production and go down the pub, will deal with problems when we get back.
QA is a cost center. Everyone is getting rid of that. Why not have the devs responsible for QA, oh and deploying the stuff to the customers and datacenters. The above is not a joke, I've lived it for 5 year now.
"Wondering how it got past QA?" - there was none. This industry is unregulated. The mentality is "push now, patch later". Maybe governments will finally wake up to the certainty of more timebombs.
There is a much simpler and pragmatic approach that I've used in places. Which is to simply not allow updates to critical IT infrastructure (DC, DNS, etc) until the update has gone out to a smaller group of endpoints first. Permit the update to 10% of end user compute estate before permitting it on all of it.
Kernel mode drivers should go through a rigorous testing regime (WHQL, for example). The other problem was Crowdstrike configured their driver as a boot-start driver otherwise people could have used safe mode easily.
There is a bottle of water under the desk !
8:40 The point of using not-Windows isn't that the other OSs are impervious, but rather the fact that diversification *is* redundancy. Instead, the current landscape is still heavily Windows-centric and that is a bad thing if we're talking resiliency.
"They may have implemented something badly, we don't know". Yes, we do know. It happened, therefore they implemented something badly. This sort of thing is why we have canary deployments, and apparently they have the infrastructure for that, and allow customers to have settings for which computers get updates first in order to validate them, but they also have some updates that simply ignore those settings, and this one one of them. Yes, they 'implemented something badly'.
it was definition files not the drivers themselves that broke so it's held under less scrutiny
@@alazarbisrat1978 'Held under less scrutiny' by whom? The reality is that it crashed computers, and this isn't the first time similar updates by Crowdstrike have caused crashes (including on linux). The fact that they know this is a possibility but failed to implement proper testing before pushing out to everyone, means the 'implemented something badly'.
@@TimothyWhiteheadzm they didn't know that would happen, sorta how this ever got out in the first place. but companies always neglect QA, it's just how it is. and also definition files themselves couldn't do any of this without a huge screw-up so they're not as important to defend, but had they tested it there would be no problem. some programmers just prefer to test after failure tho, just a complete miss
@@alazarbisrat1978 What makes this remarkable is that the entire purpose of this product and company is to address that QA neglect. They've demonstrated they're among the worst at the one thing they're claiming to do better.
@@0LoneTech not really, most companies do that, just that this one was widespread and broke something fundamental. they just got unlucky with their neglect and this slip-up got all the way and broke everything. legend has it that there have been many other issues in their code over time that went totally unnoticed and only now caused catastrophic failure
1:55 I always imagine the crashing OS saying something like "my goodness". But I like your version too.
12:50 don't apologize to Elon. He deadnames one of his kids. If he can do that, you can deadname his company. The best he's going to get out of me is ex-Twitter.
And then uses his child as a culture war pawn publicly. Gross
Thanks for explaining it so well. I love this channel!
Thanks Lord Targaryen
Bagley and Pound. What a great duo.
Sounds like a law firm.
@@dembro27 A get things done firm
"Sorry Elon"? Never apologize to that man.
Really amazing video, like always!
It's been obvious for a while now - MS does NOT DO software testing, nor Crowdstruck evidently. They are delegating the testing straight to the end user. They pushed a bad binary to an "on-the-fly" update, and after the updated binary was first touched, it crashed the system. That's criminal negligence, brought to you by industry's greatest security providers.
Something like this software is necessary: monitoring large networks for suspicious behavior. But letting the companies that make this software be privately owned and for profit removes points of accountability and introduces incentives to cut corners (increase profits by delivering an inferior product to a captured market). We can’t snap our fingers and prevent this from ever happening again, but we can improve the product by removing bad motivators (profit) from the equation.
"Dave's Garage" a former microsoft software engineer just did a video about what he thinks happened about this. Very comprehensive and very clear.
He also speaks extensively that this was possible because Crowdstrike works in kernel mode.
why would anyone want to watch that scammer?
@@murzilkastepanowich5818 WTF?
@@murzilkastepanowich5818 Sorry, I am not aware about any of that or don't even know what you are talking about. Just found about it yesterday, the video in question seems fine and basically makes some of the same points as this one, but is a bit more detailed.
@@cidercreekranch your wholesome 100 big le epic reddit content creator aint that wholesome 100 eh?
@@murzilkastepanowich5818 take your meds
Actually 8.5 million machines were affected.
But that were machines owned by some of the biggest companies in the world... which lead to bad things.
Loving how social media is making comp sci lecturers get trendy haircuts and dress properly 😂
Never, I say! NEVER! *puts on sandals over socks*
What an amazing Anti-advertisement! Now we all know what CrowdStrike is and how to avoid it like the plague 😂