Use coupon code engineered at nordpass.com/engineered to get a free 3-month trial of NordPass Business, no credit card required. 🚀 Get promoted in 2025 by taking my FREE 5-Day Promotion Accelerator Challenge - geni.us/9P7CAM 💥 Continue the conversation on my Discord server with like-minded ambitious tech professionals. #accountability is *chef's kiss* and #wins is motivating - discord.gg/HFVMbQgRJJ 📈Transform your tech career with my free weekly newsletter - alifeengineered.substack.com/
I would go so far as to say that he didn’t get promoted in spite of this but it probably helped his career. 1. Highlighted the importance of the system he created. 2. He figured out a way to recover and save face while under immense pressure 3. Did some x org coordination that shows leadership qualities. 4. Everyone probably knew who he was after that and getting the spotlight helps you get promoted.
Folks, this is how Steve would answer in an interview. It is really valuable in: which failure you pick, how you describe in clear manner. Thank Stave!
The learning from the mistakes part is important. Once we had a principal level engineer with a prod breakage rate that was measurably 10x higher than anyone else in the company. The company took the blameless culture too far & each time the answer was what can we do to prevent this from happening rather than addressing the elephant in the room. We ended up spending significant resources babyproofing everything for one engineer rather than surgically operating on the root cause. "Lightsaber night is cancelled. Thanks Todd!": If you aren't willing to act on gross recklessness, then the organization will build layer on top of layer of bureaucracy that punishes everyone. We extended procedural due diligence by two weeks or more to release changes for the entire organization to blamelessly prevent one engineer from breaking prod nonstop.
Me watching this right before I go on-call tomorrow, having dealt with a customer impacting issue last Friday :D. PS : I am part of the the Media Ingestion and Processing team in Prime Video.
I had a feeling right when you mentioned the chunking logic that this was going to be a case of a script gone rogue due to a character. Everyone loves Little Bobby Tables, after all :) Seriously though - I've only seen SEV2 in my time so far. I can't imagine being at the center of a SEV1.
Thanks for sharing! Yes todays world offers a lot more possibilities to prevent such issues. Our infrastructure for example is fully event-driven and even when something breaks, we still have the dead letter queue. Great time to be alive! :)
I really don't understand how the 2nd issue made it to the production environment. Boundary-value analysis is testing 101, virtually any testing book covers it circa chapter 1.
Would it not have been possible to provide X supported workflows to the distribution companies so the script could include all the possible combinations?
How can the teams justify the importance of having a testing environment? The work isn't available to our customers. Test engineers are often treated as a 2nd class citizen in terms of career paths, salary and visibility.
Use coupon code engineered at nordpass.com/engineered to get a free 3-month trial of NordPass Business, no credit card required.
🚀 Get promoted in 2025 by taking my FREE 5-Day Promotion Accelerator Challenge - geni.us/9P7CAM
💥 Continue the conversation on my Discord server with like-minded ambitious tech professionals. #accountability is *chef's kiss* and #wins is motivating - discord.gg/HFVMbQgRJJ
📈Transform your tech career with my free weekly newsletter - alifeengineered.substack.com/
Actually, This is so encouraging, executing an equivalent rm-rf command and still able reaching principal level.
I would go so far as to say that he didn’t get promoted in spite of this but it probably helped his career.
1. Highlighted the importance of the system he created.
2. He figured out a way to recover and save face while under immense pressure
3. Did some x org coordination that shows leadership qualities.
4. Everyone probably knew who he was after that and getting the spotlight helps you get promoted.
that passive aggressive interaction with s3... definitely came from experience
Folks, this is how Steve would answer in an interview. It is really valuable in: which failure you pick, how you describe in clear manner. Thank Stave!
That interaction with services team is 💯. Make no assumptions, treat everyone like AI 😂
The learning from the mistakes part is important. Once we had a principal level engineer with a prod breakage rate that was measurably 10x higher than anyone else in the company. The company took the blameless culture too far & each time the answer was what can we do to prevent this from happening rather than addressing the elephant in the room. We ended up spending significant resources babyproofing everything for one engineer rather than surgically operating on the root cause. "Lightsaber night is cancelled. Thanks Todd!": If you aren't willing to act on gross recklessness, then the organization will build layer on top of layer of bureaucracy that punishes everyone. We extended procedural due diligence by two weeks or more to release changes for the entire organization to blamelessly prevent one engineer from breaking prod nonstop.
I love a great SEV story. Thanks for sharing!
Me watching this right before I go on-call tomorrow, having dealt with a customer impacting issue last Friday :D.
PS : I am part of the the Media Ingestion and Processing team in Prime Video.
I had a feeling right when you mentioned the chunking logic that this was going to be a case of a script gone rogue due to a character. Everyone loves Little Bobby Tables, after all :)
Seriously though - I've only seen SEV2 in my time so far. I can't imagine being at the center of a SEV1.
This is very valuable. Thank you for sharing!
Great storytelling, explanations, and video!
Thank you, Steve for openly sharing these experiences. I enjoyed the video. Well crafted.
Now this is what we call great content 🙌
Subbed for the thumbnail meme, stayed for the knowledge.
The swiss cheese analogy is widely used in aviation to explain that accidents are never the result of a single error.
Thanks for sharing! Yes todays world offers a lot more possibilities to prevent such issues. Our infrastructure for example is fully event-driven and even when something breaks, we still have the dead letter queue. Great time to be alive! :)
Great insights! Thanks for sharing your experiences so that we can all avoid making the same mistakes.
Awesome video as always man!
Your describing between you and the S3 department sounds just a tad better than every interaction with AWS Business Support.
3 years as a software engineer and the worse thing I have done was a css styling bug that hid an add to cart button on mobile viewports.
played around with the z-index :D
Great video and channel Steve. Thanks so much. You have a wonderful gift for communication.
For the 2nd disaster shows that that’s why SDETs are important part of the application
Great video!
Great video!!
I really don't understand how the 2nd issue made it to the production environment. Boundary-value analysis is testing 101, virtually any testing book covers it circa chapter 1.
That first example must have left the deepest pit in your stomach! I would have been fighting back tears personally.
That seems like a problem of documentation on S3 side
By any chance, did you work with Ethan Evans? He shared a similar story in another podcast.
Yes I did. Same event.
9:33 you did not hesitate even a little bit before entering that quantity? 😂
awesome vid. as a mid level engineer, i echo with what's in the video. and hoepfully i'll not be at the end of a sev1.
When you caused a sev 1 and need to do a COE, your better CYA.
Would it not have been possible to provide X supported workflows to the distribution companies so the script could include all the possible combinations?
How can the teams justify the importance of having a testing environment? The work isn't available to our customers. Test engineers are often treated as a 2nd class citizen in terms of career paths, salary and visibility.
But this bigger than 5G file case and this path of delete code never got tested?
So could you please explain what did you do to solve the last incident? Just want to understand what had you guys done to fix it.