Six Little Lines of Fail - Jimmy Bogard

NDC Conferences

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 18 жов 2024
Наука та технологія

КОМЕНТАРІ • 71

@JeremyAndersonBoise 3 роки тому ⁺⁴⁸
I’m not a C# programmer, but I knew exactly what this talk would be like as soon as I saw those six little lines. Yeesh. Surprised this isn’t common knowledge, and yet I have written similar junk. The Gregory Hohpe paper is awesome!
@technoturnovers7072 2 роки тому ⁺⁵
I knew I was in trouble as soon as I saw the 6 lines, and didn't see anything wrong with them. Like, oh shit, what new danger that I wasn't even aware existed categorically am I gonna learn about now?
@technoturnovers7072 2 роки тому ⁺²
NO TRY CATCH, BRUHHHHH
@Fafix666 Рік тому
Yeah, I'm pretty proud of myself to have come to the same solution as Jimmy... In the first 3 minutes or so. Still, a very insightful talk that I enjoyed every minute of. Shame it's not common knowledge tho' :(
@NXTangl 2 роки тому ⁺¹⁶
Maxim #70: Failure is not an option. It is mandatory. The option is whether you let failure be the last thing you do.
@nothingisreal6345 2 роки тому ⁺¹⁵
If you have State you need a State machine to handle it. Every state has possible transitions to either the next "successful" state or an error state. Retry simply detects error states and re-tries for a finite number of attempts OR until a non-recoverable error occurs. Some States are considered as terminal - either successful or failure. Transitions between states can either be automatic or manual (involve support). It is important to log every state transition (time, trigger, outcome...)
@Gersberms Рік тому ⁺³
You are right about that. I haven't seen any really good talks or practical explanations about state machines though, most seem to show you the diagram and expain the purpose. Implementation is left as an exercise for the reader. I've found, that if you really need to keep track of a state, that's where you want to create an object or a struct. Likewise, if you really need an object or a struct, it may be a state machine and it's good to think about that state explicitly.
@simonisenberg4516 2 роки тому ⁺⁵
I came here for the promise of a juicy dev anecdote from the title but I stayed for an interesting look behind the scenes of e-commerce and the horrors of interdependent remote systems.
@andersborum9267 2 роки тому ⁺⁸
It's a great talk for the junior or intermediate developer; however, it basically comes down to the question of either controlling (i.e. trusting) or not controlling components in your architecture. Jimmy does a great job in explaining the original problem of tight coupling and introduces an architecturem that's more resilient to non-responding components like SendGrid or Stripe.
Ask yourself at position 54:35 what happens if the request to SendGrid actually completes but your connection to the message queue is dropped (or is in a faulty state because of transient errors) .. unless you're using an idempotent email service, then the retry strategy of the process manager resends the event and the SendGrid handler is invoked over and over. At some point you need to assume that some components of your architecture is always up .. and i.e. prevent orders from being placed temporarily if they are experiencing transient errors.
@ApacheGamingUK 2 роки тому ⁺¹⁰
I still think developers kowtow to clients too much. Whenever I watch these talks, it's always described as some "Upstairs, Downstairs" type relationship where you have to doff your cap and tug your forleocks, and be thankful of whatever crumbs are left on the plates after the banquets.
Don't ask clients what they would like in an open ended way. Give them an exhaustive list of options. You can either have "0 to 1" emails, or "1 or more" emails. Those are your only options. And the more technophobic they are, the more assertive your lead/liaison needs to be. For every list of exhaustive options, remove the most damaging to the codebase (not the most difficult to implement). If you can see five moves ahead, and foresee the inevitable problems, then remove it as an option now, rather than in six moves time. And if your client insists on the worst possible option then you have every right to call them up on it, and even refuse to actualise the harmful decisions they make.
I've been on both sides of this; where as a developer, I've point blank refused to implement terrible design decisions. And I've been a client of developers who were deferential to the point of embarrassment. "Here are all the options under the sun, so please sir, what is thy bidding?", while staring so hard at the terrible option that it burns a hole in the paper, visibly sweating, and praying to every God and Goddess every made up that I don't pick that one.
So, make exhaustive lists, with harmful options (anti-patterns, principle breaking, unethicial, etc.) removed. Talk about code workflow, especially TDD, in terms of PDSA, rather than some arbitrary traffic light analogy. And do an ALISON course in Business Management, Business Enterprise Skills, and Operations Management, so that you can more easily converse in the language they would know, rather than always thinking in code. Clients are just human beings, and it's a good thing to pull them up on their bullshit when it's going to bite them on the arse in a few months time, or if they only ever go for the short term fix, or if they only ever focus on immediate profit margins. Talk to them about the CapEx, and OpEx of potential routes, and the ROI of not choosing the path of least resistance at every step. If you are the liaison to the client, then you're not just a developer, you're also a Technical Consultant, and you cannot be afraid to actually give that consultancy, when needed.
@AyCe 2 роки тому
Exactly! You are the IT expert, not the customer, that is why they pay you. Don't offer a choice if there isn't a real alternative. Don't ask them questions like "0/1 or 1/more emails?", when you can infer from the business requirements what solution would be best. Even some programmers don't understand why that is a choice - why ask non-programmers. I wish I'd known all of this years ago...
@pm71241 2 роки тому ⁺²⁴
Yeah ... this is one of the reasons I'm no longer a fan of exceptions as error-handling.
(That, and the fact that they are soooo often abused for flow control.)
@NXTangl 2 роки тому ⁺³
The problem with not using exceptions is that error codes can be ignored. However, the problem with using exceptions is that ignoring them sometimes works...but often doesn't.
@DryBones111 2 роки тому
@@radbarij Purposely dumping an "error as a value" is no different to the negligence of an empty try-catch (i.e. the ignore option). The reason why exceptions are worse is because the 6 lines of fail at the start indicate nowhere that they are exception throwing calls, they're opaque. At least an "error as a value" wouldn't compile unless you picked a strategy to deal with it.
@ClearerThanMud 2 роки тому ⁺⁵
I kept expecting this talk to evolve into an introduction to Event Sourcing and Kafka.
@JoonhwanLee 4 роки тому ⁺¹⁶
You know.. actually this kind of video is very helpful to software system design newbies. Thanks. Los Techeeeeee!
@Spookieham 2 роки тому ⁺⁵
Ws-ReliableMessaging - what a pile of manure. Our customer team spent weeks going down that Rabbit hole in Java only to find the author of the main open source library hadn't bothered to implement parts of the protocol. The only saving grace was the team from the Vendor had exactly the SAME problem so we managed to sort out a better method between the two teams. Some idiot had specified it in the contract without having any idea what it did.
@trejkaz Рік тому ⁺⁴
As a note on naming, the word "manager" should never be used. When you're about to name something "manager", you should take a step back and ask yourself, what does this thing actually do? And then, rename it to something more appropriate. The issue at hand is that "manage" basically means "do", which does not add any information to clue someone in to what the thing actually does.
If the thing "manages" a collection of things which are persisted to and retrieved from some kind of storage, what you have is a "Store" or "Repository". If it only does one or the other, perhaps it is a "Reader" or "Source", or a "Writer" or "Sink".
If the thing "manages" creating objects, what you have is a "Factory". If it can accept information in pieces to build the object, it is a "Builder".
If the thing "manages" incoming events, what you have is an "Event Handler", or perhaps just "Handler".
If the thing "manages" a collection of other things which do other things, what you have is a "Coordinator".
If the thing "manages" child processes by restarting them automatically, what you have is a "Guardian" or "Angel". (I have heard other names for this one but these are fairly common.)
The list goes on.
The only times I have encountered things called "manager" where this was _not_ true, were cases where a single thing was performing multiple roles. In these situations, the thing should be divided so that each part can be properly named.
The true name of the thing in _this_ video was, as you may have guessed, "Order Processor".
Just to really drill this one in:
A. Consider that you're in an unfamiliar codebase and you run across a thing called "Process Manager" - what does it do? (Manage processes?)
B. Consider that you're in an unfamiliar codebase and you run across a thing called "Order Processor" - what does it do?
I like to bring this one up in meetings where "managers" are present to remind them that their job title implies that they don't do anything in particular.
@sfulibarri 8 місяців тому
Please, this asinine take on naming, as if its a defining factor in the success of any development effort, is so tired. Like sure, avoid being needlessly ambiguous but jfc, just pick a name and get on with the shit that actually matters. And if you do run into something poorly named in the wild you can, and this may shock you, just read the code. Yea I'm actually serious, if the name doesn't tell you precisely what its doing, you can just look and see what its doing, crazy right? Sure this sometimes requires expending 'effort' by 'thinking' but after a while you get used to it and you may be surprised at the results.
'I like to bring this one up in meetings where "managers" are present to remind them that their job title implies that they don't do anything in particular.' is particularly funny to me because its clear from your comment that the only thing you actually do at work is engage in your own special brand of bikeshedding where you jack off to UML diagrams. If I were your manager and I saw you holding up PR's over something as pointless as 'OrderManager' vs 'OrderProcessor' I would fire you on the spot.
@z_prospective160 Рік тому ⁺¹
Such an awesome presentation. I think every developer at some point will face the same issue.. It is good to explore and understand each of the options.
@danielrhouck 2 роки тому ⁺⁸
3:56: No. Failure is *not* an option with distributed systems. It is mandatory.
44:44 I would *much* rather get the error message right away then a random call hours to days later.
@JonWoo Рік тому
This is what all conference talks should be about. Common real life software use cases and best ways to handle them.
@EQuivalentTube2 2 роки тому ⁺¹
There's one more option, though - "halt and catch fire". If you fail at some crucial step, just cancel the whole show altogether in the noisiest way possible.
@johnangelico667 2 роки тому ⁺³
Nope, nope, nope! Option 4 is the closest with a two-stage commit but still doesn't protect against what Bill is worried about.
Step 0 before Step 1 should be Post Order to DB as Open, Unpaid, Unreported, Unissued, and UnShipped. That is, atomise the workflow sequence. Then proceed with each step, and Update the order for each success but block the workflow for any failure. Unless the flags change to Paid, Reported, Issued to Downstream Store and Fulfilled, then the Open flag remains. The Order is not Closed until all prior flags have been successful. Thus the DB is always in a consistent state, and its state can be queried. I didn't see anything about decoupling the workflow components until 43min into the lecture.
@Tschackie Рік тому
That's pretty much the solution I thought of as well. Error handling becomes very easy in this case - just leave the order as is, and a human can figure it out later. Build some sort of daily reporting of stuck orders to make someone aware of issues, and you're done. If orders keep getting stuck a lot, then you can think about adding automatic systems to deal with common cases - along the lines of agile, KISS, YAGNI.
@premchandrasingh 5 років тому ⁺⁹
Nice talk. It's a great source of information for handling similarly complex problems :)
@Gersberms Рік тому
This is a good talk! I've learned several things about C# and the main talking points are very practical and applicable to many situations in IT.
@marcotroster8247 2 роки тому ⁺¹³
You didn't mention those stupid guys trying to parallelize everything because they think it's important to improve computational performance. And then they finally realize that they've created hundreds of error case permutations because the things are actually running independently 😂
But yeah, most things usually boil down to some kind of producer/consumer pattern for decoupling and scalability, combined with some token-based or event-driven strategy for processing 😂
Btw, I like your take on "what we actually did because it was the last week of our project". Very refreshing insight 😂
@trejkaz Рік тому
My favourite part of parallelising everything is when the process is completing in 1/2 the time, so the dev who parallelised it is like, "woot, it's now 2 times faster!", and ignores that it's also now taking 4 times more processors to go 2 times faster, thus consuming double the energy per unit of useful work.
@marcotroster8247 Рік тому
@@trejkaz Yeah ... Who actually cares about the environment?! We've got cheap compute resources ... 🤔🙃🤯
@kdakan Рік тому
There is real valuable info in the first half of this speak, but later it goes into weird territory. You don't need to reimplement a queue by storing orderids in a db and looping over it to send messages. You store it in a db and send the message, and the consumer decides to update the db to mark it done, to make sure you have a traceid in the db where your transaction commit/rollback mechanism is. And the last 10 minutes where you implement a process manager using saga pattern overcomplicates things. This could as well be hand coded and more self explanatory that way imho.
@sfulibarri 8 місяців тому
"storing orderids in a db and looping over it" was a solution he worked on earlier in his career before message brokers like RabbitMQ readily available; it was just how they achieved decoupled processing absent specialized tooling.
@JohnDlugosz Рік тому ⁺¹
re 4 options: what about #5 Do Something Else ?
You may have alternative services arranged for this very reason. You might be load-balancing between different servers, and can switch to the other one.
At the in-person merchant, if the machine goes down, they break out the legacy paper forms. Your equivalent would be to log the payment information using a dumb system (e.g. append to a text file) to be followed-up by a salesperson. Or, automatically switch to "payment due" rather than pre-paid. I think this happened to me on something I ordered recently: I did not see any charge on the credit card, but a few days after receiving the package I got a bill in the mail.
@zombieregime Рік тому
"not if it fails, think about when it fails"
This is EXACTLY what makes me worry about all this AI self-whatevering-machine design out there. What happens when a car doesnt have the camera/sensor resolution to see a person in the road? Or sees a road sign or traffic lights on a truck in the other lane? Or the moon low in the sky behind some smog appearing yellow? What happens when some company rushes a self driving taxi to market and never considers a "Ride over. Dismiss taxi" button and the taxi drives off with their luggage....or baby? There are tons of videos of vehicles messing up. Sometimes it just makes an annoying ding. Sometimes people are hurt. Sometimes people are killed. They die because a dev didnt consider "when it fails." Yet for some reason, people believe and (insanely) trust the marketing hype not once considering that department of a company is expressly for tricking you into buying your product. LPT - at no point is a company ever obligated to operate in your best interest. For the love of pasta, people, put some thought into these products that are running code copied from stack and git that you can run on a raspberry pi before trusting your safety to them. Or the safety of your children. Or your neighbors and their children. Company execs, DO THE HARD THING!!! MAKE YOURE DEVS CONSIDER FAILURE MODES!!! If they brush it off, take a derisive attitude towards the type of person that 'would do that', or act cavalier in any way when designing the control system of a potential death machine with your companies name on it .....FIRE THEM ON THE SPOT!!!! You dont want that kind of publicity.
@twynb Рік тому
yeah, this is why i'm not gonna trust self-driving vehicles anytime soon. roads have wild amounts of different stuff going on and there's so many edge cases that there is no way for any developer to catch them all.
and if one of those edge cases does occur, in the worst case, that's a metric tonne's worth of death box driving at over 100km/h that's unsure what it's supposed to do, which.. no, thanks, i'd rather not.
@MrNickP 2 роки тому ⁺³
This seems like you are persisting the credit card info which would make you non-pci compliant.
@emjizone 2 роки тому
3:55 As I say now: failure isn't even an option; it's a feature ! :p
@7th_CAV_Trooper 2 роки тому
Junior devs write enough code to make it work. Tech leads write enough code so it can't fail.
@IFraid 3 роки тому ⁺¹
Have author heard about 3d security, that will make stuff much harder
@kelvinyonger8885 2 роки тому
Let me guess, exception safety? (since payment and fulfillment are seperate).
@robertkelleher1850 2 роки тому ⁺²
Great talk. Learned a lot.
One note. "ProcessManager" really?? ugh. How about OrderProcessor? Haha, well if I listen long enough he say that at 56:03
@sakcee 5 років тому ⁺¹
Thank you!
@wowDepressive 3 роки тому
Brilliant !
@thewhitefalcon8539 Рік тому
If the Stripe refund fails the customer calls customer service, but it hopefully happens a lot less.
@nintendoeats 2 роки тому ⁺³
In which we spend an hour trying to solve the two generals problem.
@nintendoeats 2 роки тому
Also... Eye-Dem-Po-Tent, not Item-Potent.
3 роки тому ⁺²
finally everything fails 😂
@emmanueladebiyi2109 4 роки тому
Awesome
@PetrGladkikh 4 роки тому ⁺²
Two-phase commit protocol - there is a solution for it. Alas we have to cut corners in need for performance.
@familytamelo8140 4 роки тому ⁺³
2PC has good chances to block. I.e. when the coordinator goes down. This kills availability. So it's bad not only because of the extra interaction overhead.
@a0flj0 4 роки тому ⁺³
@@familytamelo8140 You can deal with this by using a distributed coordinator. Still, for most situations, the cost is prohibitively high.
A long long time ago, there was a thing called grid computing. It was more or less cloud on reliable hardware. When you can rely on hardware, software becomes a lot simpler, and you can handle the cluster or data center operating system the same way you'd handle a single machine operating system.
But in fact, it turned out to be a lie: reliable hardware isn't. That's why we are now stuck with the cloud: huge steaming (as in evaporating a lot of water for cooling) piles of cheap, fail-happy machines. (Really, I once knew a team working on something that was popping machines like popcorn - a machine assigned to that app had an average lifespan of a few months before the pixie dust vanished.)
I think there's similar thinking with transactions. ACID relational databases are an extremely expensive thing, made possible only by advances in hardware. Once your data structures span the network, that price is no longer worth paying. You just have to learn to live with it. But it isn't that difficult, no more difficult than switching from grid computing to clouds - that one only took a few decades :-D
@familytamelo8140 4 роки тому ⁺²
@@a0flj0 Consider what happens when (not 'if') DTC dies during the 'commit' phase when, say, it has sent commit signal to only some servers (and they did commit) but not yet to the others, before it went down. Now the whole transaction is in inconsistent state (as well as the participating resources, i.e. servers). Should we now allow to read/write to those resources (servers) before the DTC comes up and repairs the transaction state? If yes, we'll be serving inconsistent data which defeats the purpose of all that dance with DTC in the first place, if no we'll hurt availability which is absolute no-go for the majority of modern distributed systems.
Corollary: DTC is not an option in practice (cloud scenarios, highly distributed systems). One should only consider using it in an on-prem private data center where the network is more or less reliable (to cut down on DTC failure scenarios) which is also a known fallacy:)
Plus, you correctly pointed out the overhead for all that communication. In most cases, it's unacceptable too due to the latency reqs.
@a0flj0 4 роки тому ⁺¹
@@familytamelo8140 There are plenty of good descriptions of distributed two phase commit mechanisms all over the web. Reliability is not the issue. Cost/latency issues, however, make such an approach inconvenient for most uses.
@familytamelo8140 4 роки тому ⁺¹
@@a0flj0 there is only one 2PC out there and it's inherently prone to the issue with failing coordinator as I noted above. The more reliable distributed consensus protocols that I think you're referring to are, for example, Paxos, Zab, Raft. But they're not 2PC even by the wildest stretch of imagination :)
P.S. While a consensus protocol is running, the availability is affected too (not as badly as when the coordinator is down, of course). So latency is not the only issue. As with every architectural decision, there is a lot of things to consider. In some cases one may find it not an issue, while in others - just the opposite.
@hkravch 2 роки тому ⁺¹
Using cardid is a bad idea. When user enters another order they will use the same card.
@fmaximus 2 роки тому ⁺⁸
It is using cart ID, not card ID. I guess it is a uniquely generated ID when you start adding things to your basket.
@szirsp 2 роки тому ⁺⁶
6:30 Just drop the DB transaction and insert the order with a "pending payment" status. Done. :)
Then update status on each successful step (status can be a bit field if not all actions are mandatory, require sequential execution).
Have a scheduled task (cron job)/background process that check incomplete orders (that are older than a minute/hour/day/month...;) or just have a human check them (and contact the customer to solve the issue - like wrong card number...).
Did payment fail? Notify the user in the function call: "sorry for your inconvenience...will get back to you" then check in background if payment really failed or just the notification from successful payment wasn't received (checking your bank account for incoming payments with that order id is an option...) Ask the customer if they still want it...
What is the expected failure rate anyway? How often does this happen? Can this be handled with a single part time employee or does this require a separate department with supercomputers and over engineered solutions?
(Low failure rate can still mean high volume, but is it a technical issue that requires retries/SW, or a human issue that requires people, customer support? Because the solutions probably will be different. High failure rate on the other hand could be solved by having redundant payment providers or just switching provider...)
Notification email is mostly don't care, orders should be available on the site for download. (If it's important send it again, use a different service. Sending an email doesn't mean they are going to receive it anyway.) If generating pdf invoice fails than an admin should be notified in the back end, not in the order function call (it should be exceptionally rare, or something is wrong with the system).
I don't know why they need a MQ, and not just have database trigger (create an orders to process...), but it should be easy to re-send it in the background if status is not updated and no ACK is received. (Just because a MQ message is sent doesn't mean it is processed on the other side, some kind of "order shipped" status should get back eventually in the background anyway, why not have some acknowledgement, tracking status updates in there too? Someone/something probably should check on orders that are not shipped and haven't received a status update in a while...)
How is this talk an hour long?
@livingroom1273 2 роки тому ⁺²
Further: Stripe has events, so you can respond to a successful payment afterwards and only then send the queue & email..
@myusernameisrighther 2 роки тому ⁺³
Because not everyone has your level of base understanding of the functionality being discussed. You make a lot of assumptions, and don’t explain well your reasoning. It takes time to explain these intricacies. If you already know all the answers, why are you wasting your hour?
@hijarian 2 роки тому ⁺⁸
You seriously didn't understand that the point of this talk was not the specific technical decisions but the concept that every line of code in distributed systems can fail, and you have to prepare to it?
@berkes 2 роки тому ⁺²
I'm not sure who I consider a more ignorant programmer: the one that doesn't yet realize every line in a distributed system can fail, or the one that smugly assumes his convoluted workaround for this specific particular case is somehow a generic solution to all distribution problems. Or, actually, I am sure.
@szirsp 2 роки тому
@@hijarian Are you sure that was the point? Because he sure went over a lot of technical details about this specific problem, but ok, my bad, it was a honest mistake thinking it was an advanced topic and not a distributed systems 101 "Hey guys you know calls can fail, right?"
And even malloc can fail, it's not just distributed systems...
@NuncNuncNuncNunc 2 роки тому
Along similar lines, it took me too long to realize that Promise.all in javascript is evil and that linters should by default detect its use as an error.
@runejensen3978 Рік тому
NO NO NO. You should NOT do a authorize - capture logic. as a part of error handling. authorize/capture is a means to follow the law regarding you only being allowed to withdraw money from a customer account as soon as the goods have been shipped(It even says so on the page 27:03).
@yorailevi6747 2 роки тому ⁺¹
This talk shouldn't exists. Someone seriously failed. I still hope that most developers are better than this. this is a shame.

Наступне

Автоматичне відтворення

Top 5 techniques for building the worst microservice system ever - William Brander - NDC London 2023