when a null pointer dereference breaks the internet lol

ForrestKnight

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 5 вер 2024
but it may not be the devs fault.
If you're a developer, sign up to my free newsletter Dev Notes 👉 www.devnotesda...
If you're a student, checkout my Notion template Studious: notionstudent.com
Don't know why you'd want to follow me on other socials. I don't even post. But here you go.
🐱‍🚀 GitHub: github.com/for...
🐦 Twitter: / forrestpknight
💼 LinkedIn: / forrestpknight
📸 Instagram: / forrestpknight

КОМЕНТАРІ • 763

@fknight Місяць тому ⁺³¹³
UPDATE: New info reveals it was a logic flaw in Channel File 291 that controls named pipe execution, not a null pointer dereference like many of us thought (although the stack trace indicates it was a null pointer issue, so Crowdstrike could be covering). Devs fault 100% (in addition to having systems in place that allow this sort of thing). Updates to Channel Files like these happen multiple times a day.
@kofiz7355 Місяць тому ⁺¹²
This should be pinned
@ingiford175 Місяць тому
Thanks for the update. Have not used named pipes in a Long time....
@anaveragehuman2937 Місяць тому ⁺²
Source please?
@fknight Місяць тому ⁺¹⁶
@@anaveragehuman2937
-
www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
@user-in3xs9gn2o Місяць тому ⁺⁴⁶
Then delete this video.
@vilian9185 Місяць тому ⁺¹⁶⁸⁷
Fun fact the null point reference were also on the linux crowdstrike, the linux kernel just handled it like a boss
@oleksandrlytvyn532 Місяць тому ⁺¹²³
There seems to be an articles where it is said that some time prior, like maybe month or a few months before CrowdStrike allegedly did same as we see now for Windows to Debian 12 or Rocky Linux.
Potentially because smaller blast radius it went not noticed in media.
But I myself don't know if it was true or not, so take it with grain of salt
@Songfugel Місяць тому ⁺¹³⁵
well, null pointer dereference is something that should throw an error, not be allowed and silenced without the dev explicitly handling it correctly. Problem here is, how was this ever allowed to be mass delivered to everywhere at once with such a glaring and general case bug that should have showed up at any sort of testing
So linux handling it itself quietly, might not be the own you think it is
@vilian9185 Місяць тому ⁺¹⁵⁵
@@Songfugel it throw an error, it just didn't kill himself like windows did, where the fix were to reboot your machine 15 times and hope that the network go up first than the driver 💀
@Songfugel Місяць тому ⁺⁴⁵
@@vilian9185 But a failure like that should kill the program and not let it continue, and since it was the kernel itself, it should fail to boot past it. Windows had the exactly correct reaction to this very serious error, problem is that it should have never gotten past first patch ot tests
@HappyGick Місяць тому ⁺¹⁰³
@@Songfugel No, you should leave it to the developer to handle a crashed driver. The end user using the driver does not care if it crashed or not, only that the program using it works, *and most importantly,* that the machine works. Fail silently (let the user boot), inform whatever's using the driver that it doesn't work and why, and let the developer handle it. There are times and places for loud failures. This is one of the occasions where it's better to silently fail and inform the developer. A crashed driver almost took down society with it.
Edit because seems like it wasn't clear: no, I'm not saying we should dereference the null pointer. Of course not. I'm saying that we should crash the driver only, and let the system move on without it loaded. Or unload it if it crashed at runtime. If another program tries to use it, it will raise an error and will be able to recover why the driver failed. In enterprise environments it's much better to have the system running vulnerable than not running at all. A vulnerability costs millions on one company. A company-wide crash costs billions. A worldwide crash is incalculable.
@capn_shawn Місяць тому ⁺⁸⁶⁶
“You cannot hack into a brick”
-Crowdstrike, 2024
@Songfugel Місяць тому ⁺⁴
@@capn_shawn 😂
@jedipadawan7023 Місяць тому ⁺¹⁰
Torvalds has always held that security bugs are just bugs and should not be granted special status whereby, given the obsession of some, all functionality is lost in the name of security.
Crowdstrike just proved Torvalds is correct.
@Songfugel Місяць тому
@@jedipadawan7023 He is kinda right here, but he is also often wrong and is just a normal person who just managed to get away with mostly plagiarizing (not sure if that is the right word, like Linus I'm a Finn, and not that great with English) Unix into an extended version as Linux
@Acer11818 Місяць тому
but you can break it
@666pss Місяць тому ⁺¹
😂😭
@whickervision742 Місяць тому ⁺⁴⁹³
But it's still their fault for pushing it out to everything everywhere all at once.
@brentsaner Місяць тому ⁺⁴⁴
And they did so ignoring clients' SLA/update management policies, too! Damages as a *result* of breach of contract? Crowdstrike's *done* for.
@marcus141 Місяць тому ⁺¹⁸
Well in my previous role, I deployed crowdstrike for a major broadcaster, and one common misconception in all of this, is that crowdstrike can push updates to customer endpoints without their knowledge or consent. It doesn't work like that. Endpoint management is handled centrally by IT admin, and when crowdstrike release a new Falcon sensor version, after reviewing, we can choose if we want to use the latest version or not. You can of course configure crowdstrike to auto update the sensors but that would be ludacris for obvious reasons.
@kellymoses8566 Місяць тому ⁺⁹
@@marcus141 It wasn't a new version, it was just a definition file.
@SahilP2648 Місяць тому ⁺²
@@kellymoses8566 maybe I am missing something but if the driver file got updated, wouldn't the affected PCs boot into recovery only when shutdown? So, they could still in theory keep running if not shut down?
@maddada Місяць тому ⁺¹⁴
100% agree. They should've updated 5% of users and compared failures to before the update. Not send updates to everyone in one go!
Especially for such a huge company writing a critical kernel level software.
@bluegizmo1983 Місяць тому ⁺²¹⁵
Crowdstrike is DEFINITELY still at fault. You never ever ever push an update out live to millions of computers without extensive testing and staged rollouts, especially when that update involves code that runs at the kernel level!
@vesk4000 Місяць тому ⁺⁴
Yeah I cannot possibly comprehend how and why this was pushed to everyone so quickly. Also why didn't the clients of crowdstrike say: heyy, do we really have to update everything day 1?
@inderwool Місяць тому ⁺⁴
They're a security company and it becomes a necessity that they rollout security patches to everyone at the same time. Staged rollout means, you leave the rest of the customers viable to being compromised.
@vesk4000 Місяць тому ⁺⁵
@@inderwool I agree if this was some kind of critical security update, but apparently it wasn't.
@ingiford175 Місяць тому ⁺²
Especally code that can execute within the Kernel
@batman51 Місяць тому ⁺⁶
I am still surprised that everyone apparantely just loaded the update. Surely in a big organisation at least, you run it through your test network first. And if you really don't have one, you will know better now.
@stevezelaznik5872 Місяць тому ⁺⁶⁶
I still don’t understand how this patch didn’t brick the machines they tested it on, the idea that a company worth $70 billion didn’t catch this in CI or QA is mind blowing
@simoninkin9090 Місяць тому ⁺¹³
They didn’t run it 😅 tested sections of code, but not the integrated product.
@ingiford175 Місяць тому
@@simoninkin9090 Or how they did not stage the patch on a small fraction of machine per hour and then pull it back when BSOD happens
@xponen Місяць тому ⁺⁵
this company went big because of politics, they are the one who investigated alleged hacking of Democrat email server.
@JeanPierreWhite Місяць тому ⁺⁴
@@ingiford175 Right. Rolling out to the world in one fell swoop is really irresponsible. Even given the best QA in the world, mistakes will get by, that's why you stage deployments.
@JamesTSmirk87 Місяць тому
@@stevezelaznik5872 that’s just it. They clearly did not go integration testing.
@rickl7604 Місяць тому ⁺¹⁵⁵
This is precisely why you actually test the package that is being deployed. If you move release files around, you need to ensure that the checksums of those files match.
@rekko_12 Місяць тому ⁺⁵⁵
And you don't deploy anything on friday
@rickl7604 Місяць тому ⁺⁶
@@rekko_12 Amen.
@JamesTSmirk87 Місяць тому ⁺³²
And you don’t deploy to the whole flipping world in one go.
@grzegorzdomagala9929 Місяць тому ⁺⁷
And md5 checksum all files...
They must use some sort of cryptographic signature securing package integrity. It means they don't test "end product" - they probably tested compilation products, then signed the files and sent it to whole world - and somewhere between end test and release one of the files was corrupted.
I bet it was something silly - for example not enough disk space :)
@simoninkin9090 Місяць тому ⁺¹
@@grzegorzdomagala9929exactly my thinking. Just skipped on some critical integration tests - environment mismatch or something of a sort.
However I don’t think it got exactly “corrupted”. The only reason the world got into this trouble, was because they have packaged the bug within the artifact.
@mitchbayersdorfer9381 Місяць тому ⁺²⁸¹
Saying the root cause was a "null pointer dereference" is like saying the problem with driving into a telephone pole is that "there was a telephone pole in the way." The root cause was sending an update file that was all null bytes. The fact that the operating system executed that file and reported a null pointer dereference as a result is not the fault of the OS, and is not a root cause.
@JamesTSmirk87 Місяць тому ⁺⁵⁰
Bingo. And I can’t believe the testing server was apparently the only (apparently single) server in the whole world not affected. I get that we don’t want to make assumptions and point fingers willy nilly, but this one is a bridge way too far.
@paulbarclay4114 Місяць тому ⁺⁶
the problem is centralized control
that word salad is a tertiary problem
@astronemir Місяць тому ⁺⁴
Well actually because of this shitty OS people miss flights etc. Should never run such critical systems in Windows.
Just leave that for your employees PCs
@zebraforceone Місяць тому ⁺²
@astronemir so what alterations would you make on an OS level to avoid this?
@Ryan-xq3kl Місяць тому ⁺¹⁵
The root cause is ACCEPTING null bytes, just check for them, its LITERALLY the programmers fault
@samucabitim Місяць тому ⁺²¹
giving any software unlimited kernel access is just crazy to me
@mallninja9805 Місяць тому ⁺⁵
MSFT: "Should we do something about the kernel, or develop AI screenshot spyware?"
@plaidchuck Місяць тому ⁺²³⁹
Tired of hearing like Y2K was some panic or something that just magically fixed itself or wasn't a big deal. It wasn't a big deal because people spent years before fixing it
@HeeroAvaren Місяць тому ⁺⁵
Yeah buddy we all watched Office Space.
@kxjx Місяць тому ⁺²⁰
@@HeeroAvarenwell I am old enough to have seen it first hand, I don't remember what office space said but I do remember all the overtime 😅
@successmaker9258 Місяць тому ⁺¹⁰
Welcome to bad reporting by the media, and a general lack of knowledge by the layman of tech
@davidhines7592 Місяць тому ⁺⁸
there is another one coming in 2038 when old unix system's 4 byte time integer overflows. jan 19 2038 ought to be interesting if any of those systems havent been fixed and are doing something critical.
@henryvaneyk3769 Місяць тому ⁺²¹
I spend many months doing tests and fixing code for Y2K. That nothing happened is testament to the fact that we did our jobs well.
@coltenkrauter Місяць тому ⁺¹⁷³
I have no doubt they will do a thorough investigation as this was such a massive impact with millions and billions of dollars of implications.
@astrocoastalprocessor Місяць тому ⁺³
🤔 worldwide? probably trillions 🫣
updated 24h later to add:
the peanut gallery is correct, the wikipedia entry makes it more clear that some enterprises and markets were unaffected and some were only affected for a short time 🧐 thanks everyone
@TehIdiotOne Місяць тому ⁺²⁷
@@astrocoastalprocessor Nah, it was big for sure, but i don't think you get quite how much a trillion is.
@jedipadawan7023 Місяць тому ⁺¹³
I have no doubt Crowdstrike are going to be sued into oblivion.
I have been reading the comments from employees reporting how their company's legal departments are being consulted.
@puppy0cam Місяць тому ⁺³
@@jedipadawan7023 Just because a legal layperson is trying to find out from a lawyer if there is any legal liability doesn't mean that there actually *is* any legal liability. That doesn't mean people won't try to sue them, and that will be costly fighting them off.
@JamesTSmirk87 Місяць тому
The question is will anyone outside ClownStrike ever hear what actually happened?
@adwaithbinoy5355 Місяць тому ⁺⁵⁶
Has the name says - crowd strike , every device goes to strike
@suntzu1409 Місяць тому ⁺³
DoS like a boss
@AmxCsifier Місяць тому ⁺⁵
the name is quite fitting seeing how many people were left stranded in airports
@JohnSmall314 Місяць тому ⁺²⁷
Let me guess. Maybe Crowdstrike recently laid off a stack of experienced developers who knew what they were doing, but were expensive, and kept the not so experienced developers who didn't know what they were doing, but were cheaper.
Then on top of that because of the reduced head count, but same workload, then under pressure the developers cut corners to rush product out.
I'm not saying that is what happened. But I have seen that happen elsewhere, and I'm sure people can come up with loads of examples from their own experiences.
@dmknght8946 Місяць тому ⁺⁹
Oh funny enough, there's a topic on reddit (18 hours ago) told this: "In 2023, Crowdstrike laid off a couple hundred people, including engineers, devs, and QA testers…under RTO excuse. Aged like milk." But is there any official (or at least trusted) sources?
@smoocher Місяць тому
Sounds like a lot of companies
@Palexite Місяць тому
It’s true, but I think it’s vice versa. They’re keeping the “experienced” programmers while throwing away rookies. Atleast that’s the trend we see with google and Microsoft.
They want to pay less to employment as a whole, and the only way to do that without tearing the whole team apart is kicking people out.
@ABa-os6wm Місяць тому
Not at all. They skipped the first test and went directly to the cheap inexperienced suckers.
@davidjulitz7446 Місяць тому
Not likely. The underlying issue was obviously introduced already a long time ago but never catched. So far only valid "param" files where pushed and parsed by the driver. The error itself is likely easy to fix if you accept it has also to be able to parse non-valid files without crashing.
@Bregylais Місяць тому ⁺¹¹⁸
Thank you for your insights. Man, I hope CrowdStrike does a thorough post-mortem for this one. That's the least they're owing the IT professionals at this point.
@mirrikybird Місяць тому ⁺³
I hope a third party does an investigation
@bart2019 Місяць тому ⁺¹⁰
It did not break "the internet". It broke a lot of companies' office computers, but those are not on the internet. In fact, the internet chugged along just fine.
@w1l1 Місяць тому ⁺⁹³
most of the backlash from developers -> ego devs who write like 2 lines of crap code a day but are (for whatever reason) extremely vocal
@juandesalgado Місяць тому ⁺¹⁶
Narcissism is so prevalent in this profession. As with surgeons, violinists and physicists ;)
@Dead_Goat Місяць тому
ive been bitching about crowdstrike for a long time.
@FritzTheCat_1030 Місяць тому ⁺⁸
@@juandesalgado Speaking as a retired violinist who now works as a software dev, I feel like physicist might be the next career I should look into!
@CptMartelo Місяць тому ⁺⁹
@@FritzTheCat_1030 As a dev that started as a physicist, what are your tips to learn violin?
@juandesalgado Місяць тому
@@FritzTheCat_1030 lol - I hope you keep playing at home, though!
@originalbadboy32 Місяць тому ⁺³⁶
First rule of patch management is you dont install patches as soon as they are available.
If I know that then why some of these massive companies don't is beyond me. It seems that IT management has forgetten the fundamentals.
Also technically it can be done remotely if it's a virtual machine or remote management is enabled.
@rehmanarshad1848 Місяць тому ⁺²²
The problem is that these patches are automated OTA (Over the Air) patches.
Which was marketed to businesses as there would be less administrative work in installing patches, since these patches come directly from CrowdStrike the trusted vendor. Thus, they wouldn't need to hire as many qualified IT people for cybersecurity tools patch management.
It was like a SaaS service handled by the vendor that they didn't need to worry about.
Little did anyone realize that there was no proper isolated testing done before pushing this out to production globally.
@rehmanarshad1848 Місяць тому ⁺⁷
The lack of testing and slow gradual rollout + Windows OS architectural design flaws
Combined, it created a single point of failure.
Didn't help that AzureAD was also down as well so anyone trying to login via Active Directory to remediate the issue and get Bitlocker keys were also screwed. 😅
@ChristianWagner888 Місяць тому ⁺⁶
The update was more like a virus definition data file. The actual scanning engine driver file was not updated. These types of updates are apparently pushed multiple times a day as new “threats” are encountered. It is astonishing to me, that the Falcon driver cannot handle or prevent garbage data being loaded into it.
Also it’s the poor architecture of Windows that driver crashes bring down the OS. Additionally possibly a bad architectural decision by CS to embed their software so deeply into Windows that the OS will crash if the Falcon driver misbehaves.
@Lazy2332 Місяць тому
yeah, for the physical machines, I hope they have vPro set up; if they don't, I bet they're really wishing they had done it sooner. Lol.
@allangibson8494 Місяць тому
@@Lazy2332Virtual machines proved even more unrecoverable than physical machines - you need a physical keyboard connected to enter safe mode (assuming you actually have the bitlocker keys).
@Zulonix Місяць тому ⁺²
As a developer, I always checked if a pointer was null before dereferencing it.
@isomeme Місяць тому ⁺⁹
At the most fundamental level, it is obvious that CrowdStrike never tested the actual deployment package. Things can go wrong at any stage in the build pipeline, so you ALWAYS test the actual deployment package before deploying it. This is kindergarten-level software deployment management. No sane and vaguely competent engineer would voluntarily omit this step. No sane and vaguely competent manager would order engineers to omit this step. Yet the step was definitely omitted. I hope we get an honest explanation of how and why this happened.
Of course, then you get into the question of why they didn't do incremental deployments, which are another ultra-basic deployment best oractice. I am beginning to form a mental image of the engineering culture at CrowdStrike, and it's not pretty.
@coltenkrauter Місяць тому ⁺³⁷
I have not written operating system code, but generally code is supposed to validate data before operating on it. In my opinion, developers are very likely the cause. Even if there is bad , the developers should write code that can handle that gracefully.
Also, this video asserted that this kind of issue could slip by the test servers. That sounds ridiculous to me. The test servers should fully simulate real world scenarios when dealing with this kind of security software. They should run driver updates against multiple versions of windows with simulated realistic data.
But, I would be surprised if a single developer was at fault. Because there should be many other developers reviewing all of the code. I would expect an entire developer team to be at fault.
It'll be interesting to learn more.
@TehIdiotOne Місяць тому ⁺¹⁰
I'm just astonished that this got past testing, AND was deployed to everyone at same time. Just screams of flaws in the entire deployment process at crowdstrike.
@famboettinger2041 Місяць тому
... and also the management for not giving enough resources for testing. Its always features, features, features!
@Ic3q4 Місяць тому
Bc this clip is not worth it dont waste your brain
@kkgt6591 Місяць тому ⁺⁴
AI usage is going to make such occurrences common in coming decade.
@dondekeeper2943 Місяць тому ⁺³³
The internet was not broken. Not sure why people kept saying it did
@robertbutsch1802 Місяць тому ⁺⁹
Right. If it was a network-type problem, the IT folks could have just applied a fix across their network from the comfort of their cubicles and then gone home at 5:00 on Friday. Instead some of them had to run around to individual machines and boot them to safe mode while others had to try to remember where the bitlocker keys were last seen.
@oswin4715 Місяць тому ⁺²
Ye just a click baiting title but I guess everyone is just surprised off the scale of this. Most people couldn’t do their job and CrowdStrike probably made billions in financial loss to these companies, airlines etc
@IKEARiot Місяць тому ⁺⁶
"tHe SiTuaTiOn tHaT bRoKe tHe ENtiRe InTernEt"
Instant downvote.
@FullFrontalNerdity-e3z Місяць тому ⁺³
What this shows me is that it's a bloody miracle that any computer works at all.
@nathanwhite704 Місяць тому
Any *Windows computer. The Linux systems that run crowd strike weren’t affected :).
@ItsCOMMANDer_ Місяць тому
@@nathanwhite704they were in april ;)
@glitchy_weasel Місяць тому ⁺¹³
A guy on Twitter theorized that maybe it was some sort of incomplete write - like when the filesystem records space for a file, but stops before copying any data leaving a hole of just zeroes. If sometime like that happened in the distribution server or whatever it's called and didn't manifest during testing, well, kaboom!
@TheUnkow Місяць тому
Would be even "funnier" if it turns out a bug in the file system.
@sirseven3 Місяць тому ⁺²
@@TheUnkowI'm still suspect of windows. I work enterprise IT at a big defense contractor. I see drivers fail ALOT in windows and most of my job now is just updating drivers. I see memory management, inaccessible boot device, nvpcl.sys crashes, all related to drivers that get rolled back/corrupted from windows updates. I'm just not good enough yet to find it and expose it.
@TheUnkow Місяць тому
@@sirseven3 As a developer myself I know sometimes it is the most weirdest bugs that cause the issue ... just a bit of an incorrect offset and any file or code may become totally useless ... sometimes even hazardous.
I haven't been using Windows for a while because of being unable to determine what causes some the issues, I know that using an alternative is not always an option ... but debugging closed source is a really challenging process.
Just because we get a set of API's or other functionalities from Microsoft to use ... no one guarantees they are bugfree or security/privacy/memory leaks free. Even if they were 100% ok, on the next update (such as in this case), an issue may be introduced and we will have trouble again.
Note that Linux and any other software isn't fully clear of these issues as well, for example just recently they had the RegreSSHion bug, which was also a bug introduced in the update which enabled most serious security vulnerabilities.
Still I would say the transparency of open source would make such issues easier to overcome and harder to introduce.
Easier life with closed source has it's downs not just ups, we must take precations against that, glad to hear some people like yourself are serious about it.
@bob_kazamakis Місяць тому ⁺²⁴
Doesn’t macOS fail gracefully when a kext misbehaves? If so, you can still technically blame Windows for not handling that situation well
@samyvilar Місяць тому ⁺⁴
I don’t know about later iterations but from my experience in Big Sur and earlier iterations kexts’ can still cause kernel panics, at least when an invoking an instruction that raised an uncaught/unhandled CPU exception, in my case I was trying to access a non-existing MSR register on my system. The thing is whether it’s kexts/drivers/modules on macOS/windows/linux doesn’t really matter, cause at that point your in ring 0, the code has as much privilege as the kernel, the only safeguards at this level are rudimentary CPU exception handling hence why kernel panics and BSOD always seemed so CRUDE with just a few lines of text, since at this point everything has halted and and the CPU has unconditionally jumped to a single procedure and nothing else seems to be happening …
@k.vn.k Місяць тому ⁺¹
@@samyvilar CrowdStrike does not have kernel level permissions on new Macs, because Apple has been pushing people to move away from kernel extensions, so CrowdStrike runs as a system extension instead which is run outside of kernel.
The system files on Mac are mounted as read-only in a separate partition and you need to manually turn SIP off and reboot in order to be able to even write/modify them.
Good API designs encourages your developers to adopt more secure practices. CrowdStrike isn't intentionally malicious here, but lax security design in Windows stemming from good old Win32 days allowed such failure to happen.
@JeanPierreWhite Місяць тому ⁺¹
I doubt it because MacOS is like Windows, it does in place upgrades to software.
Some versions of Linux and ChromeOS employ blue/green or atomic updates that allow for automated rollbacks if a boot failure occurs.
@samyvilar Місяць тому ⁺¹
@@k.vn.k I was under the impression crowdstrike was windows only, for as long as I can remember Enterprise seemed to shy away from macOS, given Apples exorbitant price on its REQUIRED hardware. macOS Darwin kernel is significantly different from windows and Linux for that matter, crowdstrike may or may not need kernel level privileges, for feature parity across the 2 platforms, but make no mistake anything requiring ring 0 does!
@JamesTSmirk87 Місяць тому ⁺³²
So, it wouldn’t show up on the testing server, but it would show up on millions of servers all over the rest of the world? I can’t say that makes sense to me.
@BittermanAndy Місяць тому
Yeah, this is.......... very dubious.
@grokitall Місяць тому
we have known how to write software so this does not happen since the moon landings. we have even better tools now. the only way for this to happen is for everyone including microsoft to ignore those lessons.
@cyberbiosecurity Місяць тому
This can not make any sense unless you are delusional. So you're good. The man stated absolute nonsense, like he has no idea what he's talking about.
@grokitall Місяць тому
@@JamesTSmirk87 of couse if you canary release to the test servers first, then to your own machines, and only then to the rest of the world, it would have been caught.
@MASTERPPA Місяць тому ⁺⁷
Its called deploy to 1% of customers at a time... Maybe starting on a Monday at 6PM..
@JeanPierreWhite Місяць тому
Bingo.
@bokunochannel84207 Місяць тому ⁺⁵⁸
the thumbnail says "not the devs fault". wrong, totally the devs fault. previous update broke future update, classic.
@soko45 Місяць тому ⁺¹
Pdf weeb
@pen_lord8520 Місяць тому ⁺⁴
@@soko45You have a right to cry about a drawn picture on the internet.
@gearoidoconnell5729 Місяць тому ⁺¹
I agree it test, test and more test. The should row test on some PC not all to test if was full work. You think with Airline that part should has most test. Code do not change just OS it run on clear was not test for that OS.
@whickervision742 Місяць тому ⁺²⁴
As I understand it. AssClownStrike has a "secure" conduit to copy anything to system32 folder. Windows happily runs any driver file there during startup. (Reworded for the pedantic).
Windows is designed to bugcheck (bluescreen) on any driver problem. Always has.
Having the ability to send trash to over to a billion computers system32 folder with one command is the real problem.
@robertbutsch1802 Місяць тому
Yeah, I’ve heard Windows channel files mentioned. Sounds like a similar process to what MS uses to distribute new Defender signature files.
@allangibson8494 Місяць тому ⁺¹
It wasn’t a driver fault - it was a bad file configuration the driver downloaded automatically.
@sirseven3 Місяць тому
@@allangibson8494but couldn't it have been a bad pull from the WSUS to the clients? The checksum wasn't verified on the client side, but verified before distribution
@reapimuhs Місяць тому ⁺¹
@@allangibson8494 that still sounds like the driver's fault for not gracefully handling that bad file configuration
@allangibson8494 Місяць тому
@@reapimuhs Yes. The file error checking seems sadly deficient. A null file check at least would seem to have been warranted.
@francismcguire6884 Місяць тому ⁺²
Thanks for the detailed explanation of why I am spending the first 4 days of my vacation at the airport. Honestly.
@baboon_baboon_baboon Місяць тому
He hardly said anything
@itsstudytimemydudes4345 Місяць тому
I am so sorry
@ZY-cr7yg Місяць тому ⁺²⁰
If the disk corruption occurred before checksum, how come it’s not caught in the CI pipelines. If the corruption happened after the CI pipelines, why don’t they check the checksum before distributing it
@grokitall Місяць тому ⁺²
but that assumes that they have a decent testing and deployment strategy, despite all the evidence to the contrary.
to paraphrase terry pratchett, in the book raising steam, you can engineer around stupid, but nothing stops bloody stupid! 😊
@raylopez99 Місяць тому ⁺³
Is this man speaking into a cactus in a vacation setting? He's crazy. Subscribed!
@bf-696 Місяць тому ⁺¹
I guess that doing a validation test on a limited scale before pushing to the world just never occurred to anyone at CrowdStrike or MS?
@microdesigns2000 Місяць тому ⁺¹
Pointers from a file, that is nuts. 😂
@diogotrindade444 Місяць тому ⁺²
All parties need to fix this broken system:
- Security companies cannot ever force push without testing.
- OS (special MS) need to improve all aspects in this scenario with lots of new well documentated automated testing/check tools for multiple steps in the process.
- Essensial companies cannot trust blindly on updates without basic checks, and MS should not be the only OS running if you want to make sure that you online all the time.
We need better software build for failure special for essential compatines that cannot stop. If companies do not fix this on all levels it can open a new door for failure.
@Murph9000 Місяць тому ⁺²
There is something you can do better for the case of a bug after testing, when you're going to push an update to a massive population of systems. Unless it's an emergency update that needs to be pushed NOW, you do a phased push. Push to 1% of the systems, and wait an hour (or longer); then push to 5%, and wait 5 hours; then push to 10%, and wait 12 hours; finally push it out globally. While you are waiting each time, you monitor support activity closely and/or look for any abnormal telemetry such as high rates of systems reporting errors, going offline, etc.
You can also split the application between kernel and user space, so that you have a minimal footprint in kernel space and do the more complicated work in user space. In that model, the kernel code can be hardened and shouldn't change on a regular basis; and the high frequency updates are then to the user space code, which is much less likely to take out the entire system due to bad data.
@JSRTales Місяць тому ⁺⁶
probably they might have laid off the qa who could have caught it
@JeanPierreWhite Місяць тому ⁺¹
And the deployment team who would deploy to say 1% of their customers first to be double dog sure.
@henryvaneyk3769 Місяць тому ⁺²
Dude, it is still the driver developer's fault. What happened to using MD5 or SHA checksums to validate the contents of a critical file? If the driver did the one simple step to do checksum validation, it would have noticed that the contents of the data file is not valid, and could have refrained from loading the file and could then have issued an alert instead of BSODing. It would be a very simple step to also add the checksum and do validation during the CI/CD pipeline and the installation process.
@chronixchaos7081 Місяць тому ⁺¹
The first use of ‘Gnarly Event’ to describe a world wide catastrophe. Well done.
@manojramesh4598 Місяць тому ⁺³
Crowdstrike really had the crowd strike!!!!
@ExpensivePizza Місяць тому ⁺⁴
As a software developer with over 30 years of experience I must say... you couldn't be more wrong. There's no way in the world this wasn't a developers fault. Software developers are responsible for testing the actual thing they're going to ship against the thing they're going to ship it on. If they don't do that, it's on them.
@JeanPierreWhite Місяць тому
The devs did no doubt create the problem and wrote code that is prone to failure. The devs must take some blame for sure.
The dudes in charge of deploying the code around the world are also to blame. Why on earth would you not deploy this to a percentage of your clients first until it is proven to be reliable? Deploying to everyone at the same time is not a devs fault. It is still with Crowdstrike tho, very irresponsible of them.
I knew something bad would happen like this after I retired lol.
@mallninja9805 Місяць тому ⁺¹
@@JeanPierreWhite Surely there's enough failure here to spread some blame around. Devs should check for & handle null pointers. Test suites should find bad channel files. Engineering department should properly fund & staff. Fortune-100 companies should be wary of all deploying the exact same EDR solution. etc etc.
@AndrewTa530 Місяць тому ⁺²
Friends don't let friends write C++
@juanmacias5922 Місяць тому ⁺²⁹
3:20 don't get me wrong, it could have even been flipped by a solar flare, but saying something happened after CI/CD and testing still sounds like it should have been implemented better haha Edit: WTF I had never heard of the 2038 bug, makes sense tho, I always found Unix time to be limiting.
@sadhappy8860 Місяць тому ⁺⁵
That's just an extra little thing for us all to worry about for 14 years! Good night, lol.
@astrocoastalprocessor Місяць тому
@@sadhappy8860😱
@eltreum1 Місяць тому
Enterprise gear uses hardware ECC RAM with a separate parity chip for error checking and correction to prevent that. Even if it flipped the same bit in 2 chips at the same time perfectly the file created would have failed a CRC integrity check and the build should have failed in pipeline. A failing disk or storage controller in a busy data center is not going to pick on 1 file and would be eating enough data to set off alarms. This was the either the perfect storm of multiple human errors or sabotage.
@SimonBlandford Місяць тому ⁺²⁶
The idea that security somehow involves installing a remotely controlled agent that can potentially go full "Smith" on critical servers is the problem.
@joseoncrack Місяць тому ⁺¹
Yes.
@krunkle5136 Місяць тому ⁺¹
Fr. Security should never require critical code that can CRASH the kernel to be continuously deployed so easily.
@kxjx Місяць тому ⁺⁸
AV software is clearly very risky. The industry for some reason seems obsessed with it. Customers keep asking for it on their servers, I keep saying no we have other ways to handle this hazard. Why oh why are you letting a vendor push kernel changes to all your domain controllers at the same time? Why are you in a position where you feel you need this kind of software on critical servers? Are you letting your administrator browse the Web from your file server?
@successmaker9258 Місяць тому
AVs are risky, users are riskier. No, riskiest.
@reapimuhs Місяць тому
@@kxjx then what happens if a bad actor manages to exploit your server in some way to get their malware onto its system and running? without an AV on the server to help catch it then surely it would be capable of running loose for far longer than it would have if an AV was present would it not? what other and better ways do you have to "handle this hazard" without something present on the server to try and identify and deal with it?
@prcvl Місяць тому ⁺²
Why should the Null bytes have to do anything with the file? If you deref a nullptr, you crash in cpp
@burtonrodman Місяць тому ⁺²
good explanation. one additional note that on Windows at least, a null ptr deref is basically a special case of an access violation… the first page of the process is marked as unreadable and any access attempt (like the 9c in this case) causes an access violation and any access violation in the first page is assumed to be a null ptr deref.
i’m really surprised people aren’t talking more about why this went out to millions of computers all at once. why aren’t they doing a phased roll out? i bet they will now 😂
@reapimuhs Місяць тому ⁺²
other comments seem to suggest this wasn't an actual update but rather a faulty definition file that was downloaded, the real problem is why they were not validating the integrity of these files and gracefully handling corrupted ones.
@burtonrodman Місяць тому
@@reapimuhs that makes sense, but regardless of what they call it,imho even config files are a part of the software and require testing and roll out procedures just as if code had been updated.
@SkinnyCow. Місяць тому ⁺¹
Microsoft Windows is one patch ontop of another patch. There's a reason why linux and Apple software is preferred by developers.
@evacody1249 Місяць тому ⁺¹
🙄
@evacody1249 Місяць тому
It's also the reason they have less apps and programs they can run because it would mean a whole rewrite of the majority of apps that have no issues.
Also work around auchas wine are not the answer.
@ItsCOMMANDer_ Місяць тому
Yeah, because unix is a monopoly, you simply cant get good alternatives on windows (jk)
@harisaran1752 Місяць тому ⁺²
Always rollout slow, why the hurry, rollout 1% if its okay 10% and so on
@kellymoses8566 Місяць тому ⁺¹
Even a simple 32 bit CRC would have detected that the file was corrupt. So incompetent.
@mikef8846 Місяць тому ⁺¹
Headline should be "level 1 techs save the world."
@jmarti1997jm Місяць тому
Crazy how I can understand how one line of assembly code caused everything to just die
@SaHaRaSquad Місяць тому ⁺¹
Wouldn't it be funny if Crowdstrike used their own security product (I know, right?) and had bricked their own computers as well?
@MeriaDuck Місяць тому ⁺¹
When parsing input data, especially from a kernel driver, one needs to be VERY defensive.
Validation should happen and the validation stage should not be able to crash on any input, especially empty or all zero files.
@InstaKane Місяць тому ⁺¹
No I disagree, they should be testing the updates on Windows Mac etc by rolling out the updates to the machines, and preforming restarts on those machines and then running a full falcon scan to check that the application behaves as expected. Also an engineer can if a values is null before de-referencing so I think this is an engineering/testing issue by cs for sure. But hey live and learn I guess their testing processes will be updated to catch these types of bugs going forward.
@alxk3995 Місяць тому ⁺¹
A security company of that scale should have their testing and update pipeline figured out. Learning basics at that size is just unacceptable.
@diogotrindade444 Місяць тому
OSs like openSUSE, Fedora Silverblue, macOS, and Chrome OS use automatic rollback mechanisms to revert to a stable state if an update or configuration change causes a system failure, preventing widespread issues.
@baruchben-david4196 Місяць тому ⁺³
Even I know yer not spozed to dereference a null pointer... How could it not be the devs?
@me99771 Місяць тому ⁺⁴
But doesn't this mean that whatever's reading this file isn't checking if the file has valid data? So there is a bug in the code and it was just sitting there for who knows how long? Have they not tested how their code handles files filled with zeros or other invalid data?
@tbavister Місяць тому ⁺¹
Test your final deliverable *as a customer*
@JeanPierreWhite Місяць тому
Clearly Crowdstrike bypassed all change management that corporation employ. There is no way that all corporations worldwide decided not to do change management last Thursday. Crowdstrike updated customers production systems bypassing all change management. bad bad bad. They will be sued out of existence.
@mrgunman200 Місяць тому ⁺¹⁰
The scale it happened is 100% their fault and preventable period.
@krunkle5136 Місяць тому
Fr. The buck must stop somewhere. I get protecting people from a mob but...
@mrgunman200 Місяць тому
@@krunkle5136 Single dev isn't to blame, the whole company is. None the less no matter what there can always be situations like this, which is shocking they don't having rolling updates to minimize such damage. If for example they only rolled it out to places like McDonald Kiosk's, then slowly to more critically clients, they would of known of the issue before it became a huge cluster fuck
@JeanPierreWhite Місяць тому
Crowdstrike will be sued out of existence due to this.
Microsoft also has some blame, only their OS was affected by a bad Crowdstrike release. Their system is too fragile and has no automated recovery or rollback.
@mrgunman200 Місяць тому
@@JeanPierreWhite Hopefully that will bring some good to Windows for a change of pace
@grokitall Місяць тому
@@JeanPierreWhitethe issue is not that they had no recovery mechanism in place, it is that go into safe mode and fix it yourself does not work with locked down machines.
@drygordspellweaver8761 Місяць тому ⁺¹
A simple Zero Initialization would have prevented this. ZII (zero is implementation)
@tahliamobile Місяць тому
Even with a basic tech understanding there seem to be way too many obvious errors involved in this incident. ZIL, staggered rollout, local testing...
This is not how a professional software security company operates.
@pauljoseph3081 Місяць тому ⁺⁹
Just imagine the amount of Jira tickets and story points within Cloudstrike right now... Non devs folks can leverage on this and micromanage all devs moving forward lol
@josb Місяць тому ⁺³
Maybe, but you don't push an update on a kernel driver to all your clients at the same time. Kernel drivers is a serious business you don't want to mess with it
@TheUnkow Місяць тому
You have testing environments for that.
If you don't push it to all at the same time you couls be sued for giving priority to some over other customers (ergo discriminate or downright steal) as some security updates may be essential.
A bug of this caliber simply should not have been allowed on live, it was a most basic and serious mistake for a security company.
@JeanPierreWhite Місяць тому
@@TheUnkow BS.
Most large corporations have change management in place to prevent production software from being updated that doesn't go through all the necessary quality steps. Crowdstrike updated clients systems without their knowledge or permission.
In addition some customers are OK being a beta customer, those are the ones you target first, then the majority and finally the customers who say they want to be one release behind.
Deploying all at the same time is clearly irresponsible and highly dangerous as evidenced by this disaster.
Making an update available for download by everyone is fine, but pushiing said release to everyone at the same time is irresponsible and will be the basis for a slew of lawsuits against Crowdstrike.
@TheUnkow Місяць тому
@@JeanPierreWhite If someone screws up, they can always be sued.
If the option for having beta testers are included in the contract, then that is a kind of software model, those rules were agreed upon and that is fine.
BUT I would not want a big business when my software provider just decides that my competitors get some security updates before me just because they had some kind of an additional arangement or the software provider just deemed someone should get them first (even if it was round robin).
And yes, almost everyone tries to skip steps until a thing like this happens. Skipping saves capital and because the first priority of companies is capital, not the lives of humans in hospitals which depend on their software, things like this will continue happen, but are not right just because they are beind done by many of the corporations.
@awkerper Місяць тому ⁺²
It was the Fisher Price dev tools they used!
@HenkvanHoek Місяць тому ⁺¹
That is why I like PiKVM on my servers. Although I don't use Crowdstrike or even Windows. But it can happen on Linux or Apple as well.
@brentsaner Місяць тому ⁺²
You kind of downplay the actual largest concurrent global IT outage in history, dude.
@krunkle5136 Місяць тому ⁺¹
Just tech people typically downplaying issues and avoiding accountability.
@TheDa6781 Місяць тому ⁺¹
This guy is on drugs. All they had to do is test it on one PC.
@baboon_baboon_baboon Місяць тому ⁺¹
He has 0 clue what he’s saying. You can do plenty of null pointer derefernce checks similar to typical null checks
@henningjust Місяць тому
Respect for mentioning the 2038 problem. Gosh, I hope the code I wrote in 1999 isn’t still running by then😂
@Gengh13 Місяць тому ⁺¹
Of course I have never forgotten to check the pointer, never happened in the past, totally impossible👀.
@gaiustacitus4242 Місяць тому
Crowdstrike was anticipating a major worldwide attack by a never before heard of hacks so the development team decided to put Windows into "Super Safe Mode".
@figloalds Місяць тому
The billion dollar mistake just went nuclear
@stingrae789 Місяць тому ⁺¹
Still the dev team's fault.
Testing wasn't sufficient and environments weren't close to production. Also technically they could have also done a canary rollout which would have meant only a few servers were affected.
@williamforsyth6667 Місяць тому
There is no in person repair for most of the cases.
Servers in data centers usually do not use disks directly but though some storage network technologies.
They can access the file systems of the affected machines remotely.
@mariusj.2192 Місяць тому
You COULD prevent that by signing before testing, so the signature guarantees what you're shipping is what you've tested.
@MALITH666 Місяць тому ⁺¹⁰
I wont finger point at devs. Id point releasing production changes on a working day. Especially CS being a agent that sits at Kernal level of OS. This means zero testing done.
@RC-1290 Місяць тому ⁺¹
Checksums are a thing, right?
@JeremySmithBryan Місяць тому
I am amazed by the fact that it is 2024 and we're still writing software ( especially operating systems and drivers ) in non-memory safe languages and without formal verification. Crazy. That's what we should be using AI for IMHO.
@jamesbutler5570 Місяць тому ⁺¹
In c++ you have to check if data is valid!
@jrkorman Місяць тому ⁺²
IF? if it had only happened to certain computers, with certain versions of OS, then I'd maybe believe that testing might not have caught it. But with this many computers, all at the same time, CrowdStrike's pre-delivery testing on a deployment box should have broken also! So, deployment testing was not done properly! If at all?
@pilauopala843 Місяць тому
It’s the developers fault. They didn’t check that the struct pointer was NULL before referencing it.
@kamilkaya5367 Місяць тому
That's why there are two phases in product stages. One for development phase, one for master or production phase. Even if There is an error after all the merge requests coming together and introduce a bug, You would catch it during the development phase and fix it. You wouldn't release it right away. This fault is not justifiable.
@RoterFruchtZwerg Місяць тому
Well if your CI/CD and testing allow for file corruption afterwards then they are just set-up wrong. The update files should be signed and have checksums and you should perform your tests on the packaged update. Any corruption afterwards would result in the update simply not being applied. The fact that the update rolled out shows they either package and sign after testing (which is bad) or don't test properly at all (which is even worse and probably the case).
@semibiotic Місяць тому
NULL dereference is always a dev's fault, because it is a lack of simple error handling.
@renat1786 Місяць тому
Here must be a meme with Bret Hitman Hart pointing at null pointer dereferencing in Wrestlemania game code
@LCTesla Місяць тому
A pointer's validity (i.e. not being a null reference) can always be checked, so the dev that wrote that does hold some accountability. But what's more important is that code is run in small execution blocks that never take down the whole system when an exception of any kind occurs.
@grokitall Місяць тому
rubbish. this is kernel level code, and the wrong type of bug will crash the kernel on any operating system.
the problem is it should never have been deployed, should have immediately stopped deployment when the machines started crashing, and windows needed a better response to a broken driver than just put your locked down machine into safe mode and fix it yourself.
@MrTrak08 Місяць тому ⁺⁶
It would have been trivial to test the driver for this kind of issue, sadly people are too complacent thinking the worse can never happen
@benyomovod6904 Місяць тому ⁺¹
Why no canary test, developer 101
@nicnewdigate Місяць тому ⁺¹
Presumably you don’t understand the difference between the internet and Microsoft windows. And Also absolutely Microsoft’s fault too.
@noelgomile3675 Місяць тому
This could have been avoided if CrowdStrike used a null safe programming language.
@ItsCOMMANDer_ Місяць тому
No it couldnt, it seems like it was read from a file so no memory safe lang would have helped
@mysticknight9711 Місяць тому
Need to disagree - bug introduced after CI CD (ie: perhaps in code signing, or code packaging/unpackaging) violates the dictum that “though shalt test the same bits as delivered to the customer”
@niveZz- Місяць тому
basically they immediately tested it on all devices in the world instead of at least one after pushing the update lol
crowdstrike is quite a fitting name tho
@johnmcvicker6728 Місяць тому
Deploy to limited test servers for something like this, run it a few days. This seems like a deploy to the world and not a staggered deployment.
@shashwa7 Місяць тому
Still it could have been easily avoided if they did incremental/batch/controlled rollouts, government, travel and healthcare systems must always receive any new update after few months of testing of public rollouts.
@Scott_Stone Місяць тому
Should've rolled out that update in chunks. That's the real problem
@wengkitt10 Місяць тому ⁺¹
I wish everyone knew what really happened and would stop blaming Microsoft and even the government in those affected countries.
@JeanPierreWhite Місяць тому
It's aliens.
It's always aliens.
@grokitall Місяць тому ⁺¹
but the machines staying down is directly due to microsoft.
if they had looked past safe mode and implemented something to detect and recover from bad driver updates, then it would have been a simple case of turning the machine of then on again and letting it recover.
@JeanPierreWhite Місяць тому
@@grokitall Yep. Microsoft have a lot of 'splaining to do.
@BanterMaestro2-y9z Місяць тому
_"How was your day, love? Are you okay? You look frazzled."_
_"No biggie. Just took down the Internet, that's all."_
_"Oh I'm sorry! By the way, I burned dinner and the cat puked in your sock drawer."_
@Nahash5150 Місяць тому ⁺¹
Those of us who aren't code geeks read:
Crowdstrike has way too much power.
@JeanPierreWhite Місяць тому
Truth.
@nathanwhite704 Місяць тому ⁺¹
@@JeanPierreWhiteeven those of us who are code geeks believe that.
@kasimirdenhertog3516 Місяць тому
Fun fact: Tony Hoare, the guy who invented the null reference in 1965, called it his ‘billion-dollar mistake’ - not far from the truth!
@tharnendil Місяць тому ⁺¹
It might not be a software dev fault, but releasing such changes without any form of non-staged mechanism (first, release 5%, then 20%, etc) and observing reports (you can report success after booting patched OS) is a sing of lack of the good process and not following industry standards
@JeanPierreWhite Місяць тому
The devs are at fault for writing flaky code.
The deployment manager is at fault for deploying to everyone at once.
The scale of the disaster is down to deployment methodology or lack thereof.
@frederikjacobs552 Місяць тому ⁺¹
Look, it's 2024. Both at Microsoft and CrowdStrike you need to assume this can happen and that the impact will be huge. Don't tell me nobody ran a "what if" scenario.
At best both Microsoft and CrowdStrike could have done way more to allow some sort of fail-safe mode.
For example: you detect your driver was not started or stopped correctly 2 times in a row after a content update > let's try something different and load a previous version or no content at all > check with the mothership for new instructions.
Which would still be bad, but only "resolve itself after 2 reboots" bad...
@JeanPierreWhite Місяць тому
Yes; This is why corporations should not use Windows in mission critical systems. Its too fragile with no resiliency or automated rollback built in.
Even my lowly chromebook can revert a bad OS update automatically by switching partitions at boot. Microsoft should have provided some level of resilience after all these decades.
@thomasetavard2031 Місяць тому
Excuses for Poor Testing. What a shame. In this case twice over. Didn't test for corrupted or invalid data and didn't test the deployment of the package before release.

Наступне

Автоматичне відтворення

Turning the worst key on a keyboard into the most useful one