Code That MURDERED 6 People | Prime Reacts
Вставка
- Опубліковано 21 лис 2024
- Recorded live on twitch, GET IN
/ theprimeagen
Reviewed video: • how a simple programmi...
Channel: Low Level Learning | / @lowlevellearning
MY MAIN YT CHANNEL: Has well edited engineering videos
/ theprimeagen
Discord
/ discord
Have something for me to read or react to?: / theprimeagenreact
Hey I am sponsored by Turso, an edge database. I think they are pretty neet. Give them a try for free and if you want you can get a decent amount off (the free tier is the best (better than planetscale or any other))
turso.tech/dee...
In general, I do agree with the whole "too much test is BS", but since I started working with medical software for implantable devices, I got absolutely crazy about testing. Specially system and integration tests that try to grantee that all different parts can work together.
I work in fintech space, and I also got very crazy about the testing shit properly. You don't want to nuke the entire company for the laziness. This is no joke.
@@peppybocan See: Knight Capital Gets Hammered Following $440M Flash-Crash Loss
Its a bit different when you know there are lives on the line...
Human error is terrifying.
Not just from a programmer perspective, but also from a user perspective
Yeah, places where bugs will lead to deaths are ABSOLUTELY the right ones to strive for that 100% test coverage of lines, branches and whatnot.... Throw in some integration tests and a separate environment as well
Static analysis, unit testing, code contracts, property testing etc. Everything. Writing tests is boring? Types are annoying? There's the door. GTFO.
„Radiation level too high or too low“
Operator: „50/50 chance, let’s try again!!!“
Also given that 'too high' will be fatal, why is this not its own error? The exception design is just wrong. Any exceptional case that could put someones' health at risk just the slightest should be a FULL STOP error that cannot be circumvented.
Ah yeah! That message made me mad! How can a person whoever wrote the software or the manuals can provide the customers with such a stupid error message? I am speechless.
@@bitbraindev Too low would also be fatal in the case of life-saving treatment, but I agree that they should have separate error codes.
Maybe it took integer overflow into account. If it’s an unsigned int and you get a value that’s lower than expected, then either it really was too low, or it was so high it overflowed.
@@bitbraindev Considering that they had a previous version of this machine that had a hardware safety measure for that condition, it wouldn't have been necessary, except the company removed it. It seems likely that the guy that wrote the software might've written the code with that in mind, even. Even if that's the case, the major fault is removing the safety system, imo. Crazy situation, whatever the case it was. Goes to show how far we come generally speaking, even if we still insist on screwing up every so often.
As a software engineer for over 30 years this kind of thing still makes me feel a little sick.
I think the "software never fails" mindset comes from the idea that computers will always do what you tell it to do.
Unfortunately, sometimes what you tell it to do is still wrong.
The computer does what you tell it to do: 😄
*the computer does what you tell it to do:* 😳
ditto
Unless there is a bug in the processer's ISA itself
Also background radiation induces bit flips.
There was an election where one candidate got an extra 4096 more votes because of that...
Veritassium made a video on that:
ua-cam.com/video/AaZ_RSt0KP8/v-deo.html
I remember this quote I heard before: "Cpmputers only do what you command, not what you meant"
This is not just a software failure. The whole equipment design, UI/UX, the hospital procedures and software together killed the people.
Yup, but it started with the healthcare workers who have no coding experience trusting that the machine works as intended.
There's a name for it in risk management , Swiss cheese model failure
@@lowwastehighmelanin ...and there's nothing wrong with that. They should trust that in using the equipment the way they were trained, it will work. They don't need to learn to code- that's what a user interface is for.
this has nothing to do with UI/UX
@@hwstar9416 weird that we studied this case in college in UI/UX course then.
some of this could've been prevented with proper UI/UX design, if the UI indicated that at that moment you couldn't put anything in it while the other thread was running for instance if the UI made it mandatory to wait 10 seconds before changing the value this could've been prevented.
The "hobbyist programmer" was only a hobbyist because it wasn't his full-time job, which made him a hobbyist by the standards of that time. A big part of the emphasis on that is because the initial PR startegy of the company was shifting blame to an allegedly incompetent outsider, away from the deficiencies of their own development and QA process. While his exact identity remains unknown, he was in fact the same person who wrote the code for the earlier Therac-6 and Therac-20 machines (also alone and without supervision). There are also claims that he was an electronics engineer in his late 40s at the time of updating the old code for the Therac-25, but I've not been able to find the source I read it from a few years back so take that with a grain of salt, but it definitely seems to me that painting the picture of some nerd working out of his mother's basement might have been deliberate misdirection.
Also, the hospital industry of the time wasn't entirely blameless here either: they were the customers paying 6+ figures for these machines, not the patients receiving treatment from it. As a result of those customer preferences, the primary directive for coding these machines was "Make sure the software can't brick the expensive electronics if the operator screws up ", rather than anything to do with patient safety. And unhandled software crashes bricking the physical electronics was a major concern back in those days, because those assembly programs had basically no safety layer to stop them from giving the circuitry instructions that would melt it. Which led to a lot of boilerplate error messages that were more intended for developer debugging than operator correction. Which meant a lot of alarm fatigue due to most error messages resulting from input mistakes being harmless, and the operator procedure for the vast majority of error messages would have been to ignore the error and proceed. Pressing [I]gnore or some equivalent for an error message was part of routine operation in a lot of electronics back then, not just medical equipment, and up until this disaster, nobody in the hospital industry had a problem with it.
This also makes right to repair as paramount right so even the operator can know what's wrong and what to fix instead of 2 days support delay or operating in unsafe way due to intermittent problem that the software didn't handle correctly.
Thanks
Heh, typical. Take all the credit, pass on all the blame. I bet he stayed anonymous because they'd have to share the profits with him.
most people I know was hobby programmers in the 80ies. Only my neighbour was working at a airplane company as a real programmer. But back in the 80ies we had 15 year olds developing their own 3d graphics routines in software, with even raytracing algorithms in 68k asm. So I would rate them over many real engineers of today.
@@fltfathinRight to repair shouldn't be used here. These machines should only be repaired by people who know how to do so properly and validate it.
Boeing did the same DELIBERATELY by not adding a backup sensor on the 787MAX. This was done to avoid FAA re-testing of the aircraft. As it was not tested, there was nothing put into the flight manuals either (to avoid FAA spotting it). This was all done as the newer engines were much bigger and had to me moved up the wings which disrupted the airflow/lift. The software was supposed to compensate this be taking over the yoke. When the main sensor malfunction, there was no backup which caused the plain to dive with no way for the pilot to pull up. The pilots had NO idea what was happening. This is what caused 2 crashes.
> In 2016, the FAA approved Boeing's request to remove references to a new Maneuvering Characteristics Augmentation System (MCAS) from the flight manual.
Please don't spread misinformation, the FAA was aware and approved not putting it in manuals
I work at a company that makes these kinds of radiation therapy machines. The way the software regulation works these days is that everything from requirements to acceptance criteria to test cases to executing the test cases is checked and double-checked, so that nothing ever is seen by one pair of eyes only. The same applies to any changes in the code. This means that the code base changes slowly and iteratively. Even seemingly small changes may necessitate a huge amount of re-testing. In my team there are almost as many SQEs as there are devs.
And what is your salary for this responsibility? Lower or Higher than a front-end javascript scriptkiddie ? Mostly lower... Same in the Embedded world. The software is so much harder, but the pay is so much lower.
@@HermanWillems it all comes down to the amount of money you generate with your code. A shitty platform made with some JS framework may generate way more money, especially in the short term.
@@HermanWillemshate to break it to you, but
1. plenty of devices that get people killed have JS in them for UI purposes
2. there are plenty of fantastic engineers that also do JS
3. Bjarne Stroustrup, himself, was hired to work on several plane contracts, because developers could not be trusted to follow the try/swallow or try/ignore pattern of coding. Rebooting a plane at Mach 2 is not a thing that can occur. Note that they were not using JS, and still needed oversight from the highest levels.
It's not the language, it's the people, the culture, and the lack of care or oversight to protect the real people on the other side.
The deeply unfortunate piece here is that regulation in a lot of these circumstances is based less on having the best of the best analyzing the code and documentation (on the regulators’ side), but rather how well things are tracked, and how similar this thing is to other things that came before it.
It's why, in the past I’ve made sure teams know:
“This will be used as a suicide device. This will be used as a murder device.
And that's sad, but it can't be helped. What we must never, ever allow, is for it to be an accidental death, because during operation, they had a tiny tremor and accidentally clicked the button twice, or because we didn't clear state and start over, for every mode-switch, or because we wrote the code in a way that the logic in a vacuum, or the system outside of the hardware, couldn't be isolated and tested."
@@SeanJMay I do not think that using a GC language for such applications are a viable choice.
Nevertheless, it is as you said the people, the culture, lack of care, oversight and comprehension of when you code these kinds of systems that a real person will be affected by it.
Spent most of my career at NASA. Our test campaigns are extreme, to the point that the public likes to shit on us for being slow lmao. None of us give a shit, though. Better safe than a murderer
Nowadays, medical testing also tends to be extremely through, from hardware to software.
With multiple safety measures for things that likely will never ever need those safety meassures.
Or at least I have heard, I honestly am no NASA engineer, and no medical engineer lmao
This. So much, yes such software takes long to develop. But it's part of the job. As you are with NASA. :) You probably know about Ariane 5 rocket....
Nasa has "extreme" test campaigns in software, but then they use made in china stuff to safe 2 dollars on a ring seal and the shuttle explodes!
very compentent decision making in nasa!
Dont worry your laws, roads, medical insurence system, hospital ownership system, jails, financial laws, ect unalive tons of people. So even if you unalived a couple you would be way better then the rest of the systems in place.
@@Dogo.R are you mentally ok?
Back when I was in engineering school, we had this one teacher who was both awesome and super strict. He was strict because he had to deal with a court case after two people got killed due to mishandling machinery.
Back when I ...........(the same text, only replace "super" with "extremely super").............because he had seriously hurt himself while working with a machine tool. He didn't want the students to experience the same.
if i recall the story right, the software weren't even designed for that machine, but it was reused without the devs input for the later machine
that is even worse and further exonerates the programmer if true
Yes, it reused software from earlier machines that had hardware locks (which had masked the software errors). They trusted the software simply because it had worked previously. Lots of stuff on the wikipedia article- as with most engineering disasters there were numerous causes.
@@johnnycochicken I mean, the video says that the machine used to have hardware locks, and that the dev was there during the time he developed the software and just left. This is a piece of information I was not aware of, I always assumed it was completely the programmer's fault for making a bug, but it seems like he actually did a perfect job, because his code was built for a machine that was completely different from the one that was deployed. I'm kind of surprised this part of the story tends to be eliminated when people share the story, I mean, I sort of get why, because if this is true, it not only exonerates the programmer, it also makes it clear that he made no bugs, worked all on his own, had no credentials and was simply a hobbyist working in assembly, and made a perfect job that would have worked flawlessly had the company not cut corners and never removed the hardware locks, and we all know how much people hate "rock star" programmer stories like that... because it further encourages people to do wild shit. In any case, if what the video says is true, then the programmer was never guilty to begin with and it was entirely the company's fault for not only not testing the device, but for deploying something completely different from what the code was designed for in the first place.
If purpose of these retellings is to impart some important lesson about code safety and you misrepresent such a major part of the story, you are the irresponsible one.
I believe there is a different take away: make redundat testing in not unlikely case that negligent people run your code in environment much different then one you developed for; be prepared that they will try pushing all the responsibility for errors occuring on you.
I was expecting programmers being more pedantic about details then to tell falk stories.
@@tiranito2834 It still was a race condition bug, however previous machines were prevented to run with incorrect hardware configuration with presence of hardware interlocks. Bug or not, the fault is in the hardware. From my perspective of electronics engineer, it's absolutely ridiculous that you allow software to configure hardware to do something deadly or just self-destructive. Of course there are some edge cases when it's impossible to do so. However this wasn't one of them by far. From what I've seen, a few limit switches and relays would be enough to prevent device running if it wasn't configured correctly for the mode used. Even $40 microwave oven has at least 3 limit switches preventing it running with door open.
One of my Comp Sci professors, Clark Turner, was part of the investigation into the Therac-25 incident and I remember him telling us the story about how he and another person found the race condition that led to these people's demise. They wrote a paper about the investigation. Crazy stuff
Link: web.stanford.edu/class/cs240/old/sp2014/readings/therac-25.pdf
in college i took M programming. an esoteric programming language that's similar to C, but written more like ASM. it was a fun class, and i got one of the highest scores. afterwards i was approached by a local medical hardware company called McKesson. turns out the whole class was basically a "last starfighter" type scenario looking for employees. i ended up turning the job down for exactly this reason. my intrusive thoughts keep saying: knowing me, i'd pull some "office space" type shiz and unalive a bunch of people over a decimal point.
I love all the cultural references. This man has good taste.
It's important to decide if you're up for such a task. Good that you made the decision you're comfortable with.
That being said, this is a comment section. You can write kill.
You turned it down.
Who will take the burden instead? Were you hoping that they would change their methodology?
@@raffimolero64 Who will take the burden instead? Not unlikely someone that wouldn't be burdened by such thoughts of accountability.
That's the cost when conscientious objectors don't participate.
@@raffimolero64 yep, there's plenty way to write code nowadays with good programmers who can make computer inside games or if equipped can wire it themselves.
and you expect someone to babysit a machine older than themselves alone without references and full knowledge on how the thing runs?
Might as well build it from scratch
I can't imagine taking the responsability to deal with code that manages something like this, for real. One time I dealed with a fairly simple database that stores crucial medical data for patients on a hospital and already got a lot of anxiety by just thinking: what if I mess up some SQL statement or flow control that causes wrong data for someone on blood type, for example?
Especially when the company is to greedy to provide necessary nor care for sufficient capacity, testing and even removes safeguards to save some $$$
Same thoughts here, creating rules for test results, and if a false negative happens, without going into too much detail, it could lead to lives being ruined.
@@JohnDoe-up2qpif that happen just leave that company and never come back
The important takeaway on this was kind of glossed over: the software with the race condition was written for the device with the hardware interlock, preventing the dangerous dosage. The next version of the device removed the hardware interlock but reused the lions share of the software. (If I recall correctly, but can’t find corroborating documentation, the engineer that made the changes for the machine that removed the interlocks was a DIFFERENT ENGINEER than the one that originally wrote the code, so was likely less familiar with it.)
You'd still be giving the wrong type of radition right? Not great either.
Edit: Oh, looks like the turntable position overflow wasnt mentioned, well thats another problem that was found with the therac 25, and fixed, after which AECL claimed some stupid 5 9s (99.999%) safety increase (i dont remember the exact number of 9s but it was smth like that), but that turned out to not help much either, in fact it just led to further radiation related injuries since people were made to believe its safe to use the machine now. but yeah the point still stands, no proper code review is asking for trouble
The therac-25 having NO hardware interlocks was a huge jump that did not make sense being taken. Yeah software is great and all, but where it is someone's life on the line, it is never bad to have more failsafes. At worst, its more of a headache for the operators maybe, but, it would at least not kill the patients. The entirity of the program was just made by a hobbyist, so things like overflow of the turntable position checker (8 bit) to read 0 and the treatment proceeding because the turntable is deemed to be at the right position by the software is a mistake any programmer can make, no matter how skilled and experienced, and it would've not even been caught for a while cuz it was all written by a singular person, with no one else cross checking the code written. It is ALWAYS better to have hardware failsafes.
Kyle Hill also has an awesome video about it, which goes more in depth. I would highly recommend it if you are interested
But hardware costs money and we can't stop making bigger and bigger margins for the shareholders.
SPARK is a mathematically safe subset of Ada that was made to rule out such software errors.
How would hardware lock help, if operator can only check the settings in the buggy UI? UI says - good to go, operator removes the lock and kills patient.
@@vasiliigulevich9202 by this, i mean hardware interlocks that are completely independant of the UI/Software. interlocks that make it physically impossible for something like this to happen. Interlocks that cannot be disabled by the software
@@gljames24 Till the cost savings kill people and destroy ur company's rep, yeah
I‘ve been a software developer in the 1980s and had to deal with assembly language a lot. From today‘s perspective programming back then has been both, easier and more difficult at the same time. There were less layers of abstraction and the tasks were less complex but also the tools were less sophisticated and programmers often needed to talk to the machine directly. Compilers/assemblers were slow, imagine two hours compile time. For this reason we often applied little changes directly to the object code and marked those changes in the printed (and archived) listing. Sometimes someone forgot to put a change note to the listing…
a bit younger but I remember the early internet wild west days before proper version control and build pipelines. people sometimes hacked stuff directly in prod out of convenience/laziness and then got their changes overwritten when things were "deployed" from QA. nowadays changes are supposed to go through a release process but hotfixes still sometimes override procedures (was swapping library files live this week when AWS deprecated old TLS and broke part of a sales system, git was the last place where they landed)
Too bad this was before Tom's time. JDSL would have prevented this. There would have been no way to change anything in the 8 seconds, because the UI would have been frozen waiting for a Subversion checkout.
Or just print "moving arm" and wait in a loop for a value from a sensor that all parts where were they were supposed to be, and if the loop takes more than x seconds make it fail.
He's a *genius*
This is a first-year topic in almost all Computer Science degrees. Its a really horrible way to make you realise that your code may kill - even if accidentally.
There's still plenty of stuff that can injure or kill people controlled by software. That's obvious. What should scare you is some of it it's written by a single person with a boss upset about deadlines. I'm not talking about the 80s. I'm talking about the 2000s, because I have been that dev.
Not only software, but also the wrong type of sensors for the software. Like Tesla using cameras only to detect depth etc in a picture, instead of a combination of sensors like Ultrasound, radar etc. But I would say this "agile" software releases also is bad, cause when I tell them the software is not ready, they just tell you to ship it anyway, and fix the bugs later.
@@AndrewTSq It's especially terrible with Teslas only using cameras because of Musk's ego, inserting himself into design decisions so he can claim to be an active engineer so his fanboys eat up his Tony Stark image
Ah, the Therac case, a staple of SW engineering classes since God knows how long. Now being covered by YT content creators. It's an Oldie but a Goldie.
This is what those weird people at your school with incomprehensible ramblings about formal semantics of programs were worried about.
I just got a job writing testing tools for aerospace systems, so this all is top-of-mind.
this is the first horror content on prime's channel
Low Level Learning is great but the Well There's Your Problem podcast covered this topic in much more detail and they even had a medical professional as a guest
If you can't imagine this kind of cowboy coding with machine outputing xray, check the story around the demon core or some of the criticality incidents. Early nuclear handling was wild
Just how many scientists made the exact same mistake...
It’s one of those things. I love writing code that can’t hurt anyone. But someone has to do it…
Big ups to those who are passionate about writing code for stuff that matters!
even a puny little frontend developer can write a code that hurts people. My eyes and my aesthetic feelings are hurt almost every time I browse web. Not to mention my time wasted on navigating badly designed UI with too mamy steps to do simple things.
All developers must be regulated and treated the same way as doctors and civil engineers.
@@vitalyl1327Eh, frontend devs have only a minor possibility of hurting. Whereas those working on medical (and other life/death) devices have a highly probable chance of hurting without regulation. It doesn't really make sense to hold all software up to that standard unless we had something like an overabundance of regulators (maybe AI could get us there someday).
Not to say that we shouldn't be looking to improve things across the board - but your stance doesn't seem reasonable or productive, imo.
Oh, and I'll just add that any developer who is working on something with a level of risk that is anywhere near what a civil engineer or doctor (or other) has should 100% be regulated up to the same standards as them. So my apologies if that is what your original point was.
@@Legac3e yep, and this is exactly what is not happening in the industry. And if I am to chose between regulating the crap out of everyone and not eegulating at all, since selective regulation can be very hard to apply, I am for the former.
@@vitalyl1327 That is fair. Of the two extremes, I'd be for regulating everything, too. And at minimum increasing our current standards of regulation would likely be a positive overall, even if it isn't applied to everyone (yet?).
5:22 Ok so error that told patient got too low or TOO HIGH dose of radiation needed to be decoded by a freaking technician like it's MCdonald ice cream machine ? Instead of.. you know, having that information at least in manual so user can know at the spot that patient might have a problem.
Yeah I don't get it why the errors are so obscure, this is just a medical radiation machine and not something really complex like an ice cream machine.
Yeah why is the death ray interface designed like a sony playstation?
@@anlumo1 I'm not referring that MCDonalds ice cream machine is complicated but error codes are purposefully confusing to earn money on tech support.
@@crusaderanimation6967 yeah I know, it was sarcasm.
@@anlumo1 touché
Hardware interlocks are programmers best friends. They genuinely help me sleep at night
Software interlocks are hardware designers' best friends.
the ultimate interlock is Erlang or Node.js where zero memory is shared between threads. I am sure this program written in assembly had some concurrency memory issue race condition bullshit, which won't happen if you run in an environment that's universally strictly thread-safe. Only hard-real-time systems (like control systems for airplanes) need the performance level of assembly. This laser machine needed soft-real-time at best.
I still don't understand how such failure of a machine were made.
There should be a feedback switch or something that indicate raw screen state like lamp on = E mode with the intensity shown in dial or somewhere.
Also if it can dual mode you should be able to limit the output of the beam with hardware switch that change the dac input to the safe amount.
It is wild to me that the software in cars isn't openly available. You just know thousands of volunteer hours would go completely unprompted into hunting for weaknesses in those things. "What is the failure mode on the brake pad thermometers and the tire pressure meters, and how does the code that takes in brake pedal position and transforms it into actual braking react to said failure mode, on a 2018 Nissan Qashqai?" And millions of similar little questions, many of which would be answered, because someone became curious and just went and found out. And the only thing that would happen is we would all be a bit safer.
Usually, this kind of machines are made with special distribution of languages design for: "critical missions". Is a real incredible field, spark, a variant of ada, is one of this languages. Rust has a toolchain like this too called Ferrocene. Imagine have to build something like this in C or goddam assembly.
You have MISRA C. But... Rust already covers a lot of MISRA C rules. If you would make a strict version of Rust it would be very well suited for such systems.
@@HermanWillems oh didn't know about MISRA. I agree with you, a strict rust has a great potential for this niche. Actually, Ada Core, company behind Ada and Spark are major sponsors of the Rust Foundation. Given that this company is funded by the US Department of Defense, this is quite interesting.
It's interesting how this story exploded recently. I've heard this story like 5 separate times in my university lectures 10 years ago.
In the 80's some people did think it would just work, and remember it was the operator altering a previous selection after the system started adjusting. So would never have been a primary test, they would have tested likely changes not X to E as that would have been weird, the operator would never select the wrong value, and not change it after all the other figures were entered.
Don't start moving untill the last Punch has been verified.
I started programming in 1981 as a student and financed my studies by programming during the semester breaks. I mostly programmed in Z80 assembler language. Using assembler was quite normal back then. However, the mindset back then, either for me or my colleagues, was not that software always works if it has worked once before. Testing was necessary even then. But there were no unit tests at that time. Therefore, I often worked with the debugger.
I read this story somewhere else years ago. They didn't have the developer redesign the logic. I seem to remember something about technicians removing keys from the keyboard to prevent the operators from altering the prescription once the process started. Deleting the keybinding by removing the physical key itself on the keyboard! The operator had no way to change the X / E parameter once the form is submitted to the next stage. Only an abort function keybinding which would reset the machine to baseline and start the whole thing over from the beginning. Eventually the machines were recalled / replaced. It really was the Wild West in the early 80's. Software Engineering was completely new. Most programmers were self-taught without formal college engineering programs. There were no regulations nor industry standards nor oversight, etc. You had very limited memory, storage, and processing power, every byte counted. Ultimately the manufacturer was liable. This isn't exactly a bug and it wasn't entirely operator error. The flaw was not handling the unexpected input properly.
I'm 5 minutes in and already horrified. This is why I'm extra cautious with my code quality, even if the stakes aren't this high. It was such a long time ago (still in high school) when I read Alan Cooper's book, The Inmates are Running the Asylum" and it had a number of ways code could screw things up... from mildly annoying to downright dangerous. Each chapter started with a question, "What do you get when you cross a computer and a *Foo?"* for any foo. The answer was always a computer because of how screwing things up in the code would mess up the other thing. You can't reboot a plane, or laser gun, a car... the code shouldn't require that when mixed with these things.
" You can't reboot a car " I literally had to reboot my car in the middle of a onramp stop because of an ECU problem. That day I was thinking, what if that thing happened when I was on the road. I thank the engineering for my brakes being totally manual and having no electronics.
@@monad_tcp ooh, that sucks. (PS: I should have qualified that statement. Haha. You can't safely reboot a car, plane, etc when in motion.)
From the book: _*"What Do You Get When You Cross a Computer with a Car?*
A computer! Porsche’s beautiful high-tech sports car, the Boxster, has seven computers in it to help manage its complex systems. One of them is dedicated to managing the engine. It has special procedures built into it to deal with abnormal situations. Unfortunately, these sometimes backfire. In some early models, if the fuel level in the gas tank got very low-only a gallon or so remaining-the centrifugal force of a sharp turn could cause the fuel to collect in the side of the tank, allowing air to enter the fuel lines. The computer sensed this as a dramatic change in the incoming fuel mixture and interpreted it as a catastrophic failure of the injection system. To prevent damage, the computer would shut down the ignition and stop the car. Also to prevent damage, the computer wouldn’t let the driver restart the engine until the car had been towed to a shop and serviced."_
@@FabulousFadz also, to prevent damage the computer would not let you apply the brakes because it was fly by wire.
Those are my nightmares with cars.
glad you gave the example to the people in chat. this is as serious as you're presenting it
Here is why writing good code is important kids.
No, you need to test rigorously, good or bad is decided after testing
@@IAmOxidised7525 Oh trust me, you can have shit-tier code that is difficult to review/modify/etc. that passes tests, maybe even for the wrong reasons because some shit happened to line up in memory just right.
Good code is a separate thing from tested code.
You need both good code and good testing to have a good product.
@@IAmOxidised7525Testing is not sufficient, you need a formal verification.
@@IAmOxidised7525 The approach the companies take is that they make sure the code is good and then they test it exhaustively.
@@IAmOxidised7525 Do you have also written tests for your tests? By separate people ?
There's a much better and detailed video on that by Kyle Hill, I really recommend watching it too. It wasn't just this bug alone, and it wasn't just about the code (although you pretty much figured this out at this point). It's a must learn topic in computer ethics class.
Could please send a link to that video you mentioned?
@@gilbertovampre9494just search for Kyle Hill Therac 25
@@gilbertovampre9494 ua-cam.com/video/Ap0orGCiou8/v-deo.html I guess this is this one.
@@gilbertovampre9494
links are deleted. Type the name:
"History's worst software error"
@@gilbertovampre9494 ua-cam.com/video/Ap0orGCiou8/v-deo.html
This was a concurrency bug. No amount of traditional testing will help with that. You need model checking and/or formal verification. Alternatively no tasks, just a single polling loop and even then rigorous measurement of timings is necessary.
This. It seems like those mathematical methods are completely forgotten during discussions of current software development, even though the foundations have been around for decades
@@nackha1 Yeah, totally agree. There is some adoption by industry (distributed systems mostly), but way too little in my opinion.
It’s quite scary to think somebody like us programmed such machines. I do freak out thinking about this. 😮
Not that scary, at least nobody programs those machines in Javascript yet so they are at least somewhat reliable even when not properly tested
Software does EXACTLY what you tell it to do. And that is also the problem as you might have missed edge cases or not handling certain extreme values. This is why fuzzy test is good. Writing unit test is good, but it's not only about coverage, but testing all kind of input values to make sure you handle that properly. It's easy to write code that works and have unit tests for that, but the problems occur when input is something you didn't think of, and when it fails and you don't handle cases you didn't think of. This is also by the way why I like the Go's approach where every function follows the flow of return on failure, and if you reach the bottom of the function, all above is OK. But this still requires you to unit test with data beyond what you initially were expecting.
When I read the title, I read the "Prime Reacts" as "Crime Reacts" because of word "killed".
Me too
When I see the moustache I don’t even read that part of the title ;)
It's not that we believed "software will never fail", but rather "computers don't make mistakes, people do". That hasn't changed. I've been coding since 1992 btw.
And the missing unsaid part "computers don't make mistakes, people do" "and don't need to accountable as an discipline the same ways as engineers, or doctors are regulated"
@@TheNewton what are you even saying? Doctors have one the highest error rates of any profession
Prime: sees a picture
Prime: "is that a chat GIBIDY image ?"
Having programmed PDP-11s using RTX-11 real time operating system in assembly I can understand how those errors can occur. That being said, race conditions etc. are typical errors one should watch and test for. That's when code reviews (yes, we had them back then) and exhaustive testing come into play.
omg. Saying software doesn't fail is like saying numbers don't lie. If you've ever worked with either you know this to not be the case. I suppose we live and learn. Sad story, but a good one. Thanks PrimeTime. ;)
Ah the joys sidestepping rhetoric and strawmen
"Computers don't make mistakes, people do"
"Guns don't kill , people do"
"Chemical don't pollute the waters, people do"
"numbers don't lie, people do"
Frame it as an argument about the formers inability to have involvement to avoid discussing accountability for the latter, and regulation for both.
Also happening in self-driving cars, despite they are using the best practices in writing software.
Not even self-driving - see the Toyota breaking bug.
@@vitalyl1327 Toyota has more, Toyota had spaghetti code that killed many people. By suddenly accelerate !!! Without you pressing the gas pedal and kill people.
I was writing a version extraction function a bit earlier today, something that takes the string printed by `program --version` and extracts the version number.
I thought "oh it's literally a simple regex and five lines in total so why bother with a test". And if it was my personal project that would have been it. But then I thought "okay this is a public repo, I'll be a bit more responsible" and went ahead and wrote a simple test.
Lo and behold, I incorrectly used a `+` modifier on a capture group instead of a `*`, so now a naked major version is not matching 🤦. Needless to say I'm glad I wrote that test.
I guess the lesson here is to just accept the fact that you will always be a shitty programmer, no shame in that 😅.
I'm reminded of the saying "I have a problem so I will try regex, now I have 2 problems"
I write tests to ENSURE that the code works the way I think it does, I write my tests to challenge my code and make sure that it's not doing something other than I intended. I work in the healthcare industry, my code gets hammered on by the public more than 10,000 times a day.
It drives me nuts when I learn that the code that I've been logging out every single step of the way has not actually been working the way I intended.
It's forced me to hard-type everything I possibly can
i told my dad this yesterday that behind every app, website, program, computer there is a person who chose (either purposely or accidentally) it to look and operate that way
and that it all doesn't just all come out of the ether
Note to self: if your software involves the safety of human lives. make damn sure that you implement testing and diligently do it properly even if your managers are already harping you about some stupid deadline.
7:40 better than the alternative, which is imagining those programs were written by machines.
that's why formal verification tools like TLA+ were created
We covered this in undergrad comp sci.
This was around 1982 so home computers were barely a thing at this point (think ZX spectrum's and other 8 bit machines) and the memory constraints were really tight. So it doesn't suprise me it was written in assembler, as it's not just the machine you're programming that has tight constraints but also the machine doing the programming. The big flaw was only having one guy write and test the code, but it was very specialist thing back then
This does remind me of the plastic surgery machine in logans run that goes haywire, I think it was the "escapulator"
I read a technical paper about this incident (there were, in fact, more glitches in Therac 25 that this one) and use of assembly as the programming language was not seen as a contributing factor. The real flaws were poor troubleshooting, incompetent risk analysis, failure to act in time and the fact that Therac 25 was never tested as fully assembled by AECL. Marietta hospital physicist Tim Still deserves a lot of recognition for figuring out how to repro the bug.
Honcho: "So what Dev Team are we using for the software?"
.............
Emp: "Craig........."
I watched the original video a week before having a CT scan. Not the same type of machine but I was crapping myself thinking about it!
I watched a different vid about this.
It had pictures of the ppl before they died...
It was like holes were rotting through them where the radiation had gone through :/
I programmed in assembler in the 1980s alongside engineers buliding prtotype hardware. Of course either could fail and part of the skillset was working out which was the issue for given bug. None was safety critical (sound mixing gear) but all still built and tested BEFORE going into the wild. This story was no more acceptable then than it would be today.
A finger chopping machine sent me here 😂
dude i was feeling sad at first, but that "Dividing by Zero shit" caught me off guard 🤣
People being too confident in software is something still present today. Just this year, I found out that a company that offers tax calculation and payment services, has their accountants being too confident in their in-house software resulting in them not doing their job properly, and wrong taxes being presented due to faulty data.
They should detect edge cases but if the software doesn't ask for it, they won't ask for it. The cherry on top is, the software doesn't allow for corrections and breaks spectacularly if you try to do so
ftfy: c̶o̶n̶f̶i̶d̶e̶n̶t̶ not legally regulated or accountable.
Even though actual accounts are regulated, it's bizarre but inevitable to get changed.
@@TheNewton they are legally accountable, after all they have to put their signatures and their professional license to get those tax declarations accepted.
Also, the government here likes to wait to make the fines bigger, so it takes time before that company starts getting legal action coming to them and it won't be cheap
I remember the days we laughed about biiling software that sent final demands for 0.00
I worked at an EHR company that worked in a proprietary programming language. It wasn't until around 2019 or so that someone (as a personal project) created a unit testing framework. The company had been in business for decades with no unit tests whatsoever. Fortunately, I worked in an area that was less exposed to things that could risk patient safety, but companywide there were plenty of bugs each quarter that either could or did seriously harm patients. Certainly didn't help retain talented folks either that compensation was well below market so anyone with any ambition generally jumped ship fairly quickly.
I got an FMEA ad after watching this 😂
Failure Modes and Effects Analysis (FMEA) is a systematic, proactive method for evaluating a process to identify where and how it might fail and to assess the relative impact of different failures, in order to identify the parts of the process that are most in need of change.
My man is so right. Its like all my friends who study medicine, having grown up with them id be scared for my life if they were my doctor. Just like if i become a programmer anything i produce will be so stupidly bad that anyone who relies on my code should be fired from their tech job.
I think in the 80s, the difference was not "software never fails", but rather "the only guy who can solve this on our hardware writes the code in the basement"
Tests were also in my opinion rare to be part of general Software Development until the 90s.
I am amazed it had "tasks"
I heard this during one MeetUp session where some SW related to DevOps was being introduced: "The log files will record all errors except for irreproducible bugs because if a bug cannot be reproduced, there is nothing we can do about it". So it "tought" me that if you ignored the worst kind of bugs, they did not exist (until someone got hurt or killed). Needles to say that the code of this SW was neither functional nor immutable, which is not a silver bullet by itself, but FP can help quite a lot to limit bug occurance.
For me there should be a big red sign saying: "If you messed something up, dont try to fix it. Turn everything off and start again" because this is as much error from the manufactures as it is from the user.
Edit: I put the operator when actually i was refering to the user, the hospital in this case. They should train the operator to a default case, basically when a error happens, something out of the planned (like writing something wrong in the input), no matter how small it seems, always turn the machine off and reestart the whole thing.
Only if the manufacturer has communicated that rule to the operator. Otherwise it's all on the manufacturer.
nah the operator shouldn't be blamed at all in this instance. they were trained how to use it, and they used it exactly according to that training. I can't even blame the person who wrote the code, because it should have gone through testing.
This is 100% on the manufacturers for removing the hardware locks and not properly testing the machine before sending it out.
This is just a workaround for bad or nonexistent testing. This approach only works because the default “happy path” behavior is usually tested by the programmer itself. Many consecutive times. It must be daunting to start everything over from scratch all the time.
Like you have misspelled the zip code for the patient which would then reset the form to blank and start over. I don’t think even nurses can handle this long term (they are experts at repetitive tasks)
This reminds me of the programmer who was found dead in the shower holding a bottle of shampoo.
The autopsy revealed he died from starvation and exhaustion, they could not figure out why this had happened until
they read the instructions on the shampoo bottle "Wash, Rinse, Repeat".
This is horrifying...
I have an idea of fixing infinite loops, its called probabilistic coding. You can change the formal semantics of Lisp/eval and Haskell's lambda calculus eval to change all expressions from being E to being (E, 0.95) where 0.5 is a probabilitiy. You can then run a program 1000 times and see some constraint holds to see if the code is being fixed or not.
you have to think about 'software responsibility' like you do human responsibility (because thats actually what it is). If the software is employed in a way that the people who commissioned and designed it knew that faults had a high potential of being directly involved in the cause of an injury or death, then they are legally obligated almost everywhere to take extra ordinary precautions to prevent such a thing, not doing so is criminal negligence; like drunk driving or shooting someone on a movie set with a real fake gun.
Any software developer or engineer that's required to be solely responsible for any important part of a safety critical system should just refuse the job. The higher-ups that asked should be responsible as well if anyone should accept.
Whoever said “b for beam - you know that’s a good program” in the twitch chat made my day 😂
it was pretty common for the 80s to have a 2-3 studio and most code was written in assembly
welp good thing I'm just a frontend dev styling buttons, hopefully it won't kill anyone
The really scary thing is that even if there were hardware locks, several professional engineers coding in a high level language, more extensive software testing AND more extensive testing with users/operators, there might STILL have been an “unpredictable” software error that resulted in deaths. Testing reduces the chance of errors (drastically) but doesnt guarantee that all possible errors are accounted for. Sometimes engineering is about preventing mistakes, and sometimes its about learning from mistakes.
not long time ago, 2 planes model 737 MAX crashed due to software relying on a single sensor, which in one case was installed and not calibrated, worth taking a look at the case.
@@dioneto6855Seemingly another greed move- if I remember correctly Boeing had built redundant sensors for military planes with the MCAS.
There is something called "software correctness proofs"
There are a few things that will also get you. The computer you are shown is just a terminal. It is connected to a PDP11 multiprocessing computer which is what controls the therac-25.
It was do to the race condition which occurred from multiprocessing that the error was allowed to happen. If the process had been limited to a single thread it would likely have been caught.
They probably did test the software but with a dummy system rather than the actual Therac25. In this case it probably did something like up a certain light if the signal came did what it was supposed to. The problem is it most likely didn't replicate the Therac25 behavior entirely such as that 8 second block of time it doesn't see input changes. That would most likely been done for several reasons. Cost of using the actual machine, legal requirements for the operator of the machine and their licensing requirements. (Imagine trying to find a programmer who also happens to be a trained radiologist.) I would bet that is were this issues really starts at a politicians lap. Some idiot created a law without regarding how it would affect the development process. I seen that to many times.
It's worth pointing out that it's incidents like this one that inform the coding standards we use in industry today. 4 decades of learning from mistakes makes errors like this feel alien
And...This is one of the reasons why I'll probably never get LASIK (Laser eye surgery). As a programmer, I can't afford to take such risks with biggest asset, my eyes. I know they test these things, but all you need is ONE tiny screw up either on the software, hardware, or staff level and yikes. Sure you could sue the company for money, but some things are just too priceless to lose.
I think I'll stick with my regular prescription glasses unless I absolutely need surgery.
Other of the times that it failed, it wasn't because of that, it was because of another bug related to buffer overflow, a classic that killed people
Heard this story before. They had one guy coding the whole thing with no oversight whatsoever then denied the problem existed. Utterly disgusting and they faced no consequences.
And 5 patients dying and not banning the goddamn thing for a year until its tested into the ground. FFS
Me and LLL grew up programming together and he was one of my best friends I'm so proud of his channel.
I used to work as a programmer within medtech, more specifically visualization of MRI, CT, PET, NMR, etc.
Back in the days, a small error in a printer routine could start a chain of delays, which in effect postponed emergency treatment, thus potentially costing lives. :/
In my testing and software analysis course in my master's program, therac and some European rocket explosions were mentioned when talking about the importance of testing.
This gives UX a completely new meaning. Industries where you can harm someone just by making a seemingly small or easy to overlook error are really scary. I find it crazy that we have so little legislation about software developers despite the huge role software has in modern devices.
Medical device software is heavily regulated and monitored by the FDA.
Uhm, developing software for devices that have the possibility to kill someone is a TOTAL different world that script kiddies writing front end Javascript. It's just totally different.
Even more legislation in the medical field scares me even more
Having some experience with medical UI for embedded systems and HIS. They are mostly confusing, error prone and often buggy. I attribute it to the fact that at the end of the day, the medical software industry is still a niche market.
@@thibautconstant3942I attribute it to the fact that at the end of the day, it’s still software.
I've done a little bit of assembly programming and you very quickly learn that you're basically walking on a tightrope. Make one wrong move and there's no safety net to catch your fall, it's game over. Just the idea of running it in production with no risk mitigation or testing makes you go numb with fear.
We still use an iSeries machine, with code dating back to late 80s, early 90s. Some who wrote code were just working closely with the machines and picked it up. Maybe started as testers and learned more and more on the job. Also, it barely has source control never mind a testing suite!
This is a story about a Hydro Power company in northern BC. Bills were sent out on computer punch cards; in the late 70s to mid 80s. There was an error and a person was over billed. They contested that the bill was incorrect and received a response "The computers are never incorrect!". This was sent back with a request to have it notarized and returned and if it was they would pay out the balance. A few months later; after paying the bill normally to not raise a suspicion immediately. The person began altering the punch card portion at the top right to make it so that the bill was a negative balance; and paid it for 0. The person did this for about 3 months until finally the Hydro Power company determined there was a billing error, they don't really owe a credit. The person responded with notarized copy of the previous letter. This end up in claims court with the Judge ruling in favour for the full credit balance to be paid to the client as per the Companies own notarized letter "The computers are never incorrect!"
I mean at this point there was way more than a single point of failure. And yeah, I think they had hardware locks originally because it was standard for the industry, and the belief in software is what caused its cost cutting removal.
The face when he realized "Big radiation emitting machine controlled by software : Tested in production". As for the rest, unpopular opinion, but this is what you sign up for when you use async code for things that don't NEED to be async. Sure, you can make the async one work properly with extra care, but the question is WHY? Why deal with the extra complexity when you really don't have to...
"Computer Error" is the most hard-working and dangerous thought terminating cliché in every modern industry.
Hard working in that those short phrases do a mind boggling good job at preventing software development as a whole from having to grow up and become a regulated profession; even as it borrows the aesthetics of regulated titles and shuffles responsibility elsewhere.
Dangerous in that it's not really a tracked stat for cause of morbidity instead again we shuffle responsibility elsewhere.
And that when events like this happen it's generally the business process being sued, not the code or it's programmers.
Amazing.
I've seen software errors related specifically to a user changing an input parameter after having set it in the first place just as the nurse here did when changing her X to an E. Specifically I've seen cases where even though the visual interface shows the input change (in this case it changes from the X to the E) but the actual change in software never occurs. That's approximately the case in this scenario, it's just one of the many things that can go wrong in these kinds of systems. It's also one of the things that bug testing (which wasn't really even done here) doesn't necessarily even catch since you can't really check for things you don't think to check for, and it's the kind of simple little error that somebody could very easily overlook and just not ever check to see if it worked properly when doing it that way. I've even seen situations where the change from, for example, X to E actually works as intended but if you did it the other way around, E to X, then the UI updates but the actual change doesn't take place in software.....so again, it's one of those things that might not even necessarily be caught with testing unless the person specifically thought to check all possible iterations of input and input changes to make sure it worked in every possible configuration and with every potential change.....
So many things have gone wrong in the hospital in the story, so much negligence on every corner. If it was not a faulty radiation code it would have been mold or collapsing ceiling.
The sheer number of race conditions in production at this very moment is probably astoundingly high. The safeguard for that is software paradigms (Dart, Erlang, Node.js) where the application code does not share memory (between threads). Of course, even with this protective mechanism at the application level, different threads that otherwise can't share memory, can still write to the same file at the same time. So I/O is not protected (barring a hardware solution, like this video suggests - hardware interlocking).
Real talk: inverse radiotherapy planning (the "math" behind RT planning) is literally my research topic, and nowadays we have SO MANY different security procedures that I'd never guess a mistake like this was even possible
I saw a commenter ranting about "who would have thought about unit testing in the eighties?" They literally did unit testing from the birth of programming. Loads of unit testing in the sixties. TDD was not the birth of testing, but of a development practice. Testing has been there all along.
What scares me is there’s a dangerous machine out there somewhere that imports NPM leftpad
Imagine the complexities... Coding a UX like that in Assembly Language from scratch. There's unlikely to be handy libraries, etc. It's the late 70's early 80's when the code was written. He's flipping bits and performing register manipulation, etc., etc., etc. Outsourced to an independent hobbyist. The correct 80's terminology is computer enthusiast. He must have had some skill because it's NOT easy to code in assembly. But none of the rigorous testing methodologies we take for granted existed at the time. Events like this are why those methodologies came about. This is a case of the developer being given instruction on how the machine worked and how the UX needed to function. The developer was too close to the system and whatever tests he performed (unlikely to have physical access to the actual machine) were various normal operations. Verifying the output being sent to the hardware without the actual hardware. I mean this guy wasn't going to setup a large X-Ray / Radiation beam machine in his living room. Clearly he didn't test what happens when you alter fields in the UX after starting the process. Ideally, the system should have not allowed the change and forced the operator to cancel the entire set of sequences and reset the machine to baseline. Then redo the prescription input from the beginning.
It's very very easy to understand 80's devs. Business: "Just write a proof of concept so we knew what we need and then somebody with engineering experience will go over it."
17:09 my impression of what happened re: mindset around 80s was result of an overcorrection of sorts. when software systems initially entered into various industries people were wary and it was a hard sell, so they sold harder and convinced management types with things like "software can be proven correct" (in mathematical sense, which technically might be true for some parts of the system but anyway won't happen in commercial world), some went a bit too far and we got whiplash as result. all things considered I think we've been pretty lucky with how few fatalities there's been due to software bugs (iirc even that toyota gas pedal thing turned turned out to be some combination of bad floor mats and human error)
With the benefit of hindsight, it’s easy to point fingers. In reality it’s these events that actually formed our current perception on testing.
Before "It works on my machine" we apparently had "It works on my patient"