Dev Deletes Entire Production Database, Chaos Ensues

Поділитися
Вставка
  • Опубліковано 16 тра 2024
  • If you're tasked with deleting a database, make sure you delete the right one.
    Sources:
    about.gitlab.com/blog/2017/02...
    about.gitlab.com/blog/2017/02...
    Notes:
    1:05 - The middle bullet point about the account that had 47,000 IPs was never mentioned in the postmortem (there was an initial report the day of and a more detailed postmortem a bit over a week after that). Perhaps that was a red herring which they figured out later on didn't really matter.
    3:07 - I made the error say too many open connections since it's easier to understand than semaphores
    3:39 - This part was confusing, since the postmortem and the initial report conflicted. The postmortem said the engineers believed pg_basebackup was failing because previous attempts created some files in the data directory, but the initial report said the theory was because the data directory existed (despite being empty). So for some reason the engineers really wanted to delete the data directory, but for what reason who knows.
    4:37 - They probably didn't check for backups in this order. I'm sure team-member-1 immediately called out he had taken a backup 6 hours earlier, and then they just had to verify the other backups in case there was a better one.
    6:21 - Being reported by a troll will not automatically remove a user, but flag it for manual review. It was then incorrectly deleted after review.
    Chapters:
    0:00 Seconds before disaster
    0:16 Part 1: Database issues
    2:21 Part 2: The rm -rf moment
    4:32 Part 3: Restore from backup
    6:13 Part 4: Post incident discoveries
    7:27 Lessons learned
    9:46 The fate of team-member-1
    10:11 ???
    Music:
    - Thriller Trailer Teaser Tense by Cold Cinema • Thriller Trailer Tease...
    - Finding the Balance by Kevin MacLeod
    - Eyes Gone Wrong by Kevin MacLeod
    - Desert City by Kevin Macleod
    - Jane Street by TrackTribe
  • Наука та технологія

КОМЕНТАРІ • 2,5 тис.

  • @VestigialHead
    @VestigialHead Рік тому +19184

    Damn I cannot even imagine the stress that admin was feeling after he realised he deleted DB1. He must have aged twenty years.

    • @1996Pinocchio
      @1996Pinocchio Рік тому +1892

      The legendary Onosecond.

    • @NS-sd3mn
      @NS-sd3mn Рік тому +336

      ​@@1996Pinocchio I see that you see tom scott

    • @youngstellarobjects
      @youngstellarobjects Рік тому +805

      The stress should really be minimal if you have a backup and restore procedure, that it actually works and you know how it works. Mistakes happen.The problem wasn't the delete command, it was the nonexistent backups and documentation.

    • @LeoVital
      @LeoVital Рік тому +571

      @@youngstellarobjects Nah, still stressful. Most companies aren't making a backup on every write that happens to a DB, so whoever deletes a DB knows that they've just made an oopsie that will cause a lot of headache for multiple people. And probably cost a lot of money for the company as well.

    • @pqsk
      @pqsk Рік тому +43

      As long as you have a backup there's no problem. I've done this before, but if there's no backup you prolly die of stress 😅😅😅

  • @Misanthrope84
    @Misanthrope84 Рік тому +21024

    "You think it's expensive to hire a professional? Wait till you hire an amateur" - some old wise businessman.

    • @urbexingTss
      @urbexingTss Рік тому +366

      that indeed is wise

    • @shahriar0247
      @shahriar0247 Рік тому +34

      Loll

    • @blue5659
      @blue5659 Рік тому +564

      A professional costs you in bold italic and underline. An amateur mostly costs you in fineprint

    • @-na-nomad6247
      @-na-nomad6247 Рік тому +990

      The person here is not an amateur, anyone can get brain farts especially when working an unexpected overnight, you should try it sometime, you'll start seeing ducks and rabbits in the shell.

    • @Misanthrope84
      @Misanthrope84 Рік тому +207

      @@-na-nomad6247 I'm a veteran in the Devops field. This comedy of mistakes could have never happened to me since I'm following a protocol, which these guys obviously did not. They were guessing and experimenting as if it were an ephemeral development environment. Their level of fatigue had little to do with their incompetence in understanding the commands they were running.

  • @Chris_Cross
    @Chris_Cross 11 місяців тому +2725

    The fact they live streamed while trying to restore the data is a truly epic move.

    • @xpusostomos
      @xpusostomos 6 місяців тому +53

      Hope it was monetized

    • @godjhaka7376
      @godjhaka7376 5 місяців тому +33

      ​@@xpusostomosthat's why they live stream and post anyway. Not to educate but rather make money

    • @Elesario
      @Elesario 3 місяці тому +9

      Sounds like they had the spare bandwidth ;P

    • @joseaca1010
      @joseaca1010 3 місяці тому +4

      Programmer vtuber when?

    • @kv4648
      @kv4648 2 місяці тому

      ​@@joseaca1010already have one: vedal

  • @Webmage101
    @Webmage101 10 місяців тому +3446

    I think the biggest problem (seemingly addressed at 6:21) is the fact they could delete an employee account by spam reporting it.

    • @alex_zetsu
      @alex_zetsu 9 місяців тому +270

      Actually at the time of the video, what they addressed was the fact that deleting an account could cause problems with the server, it seems they didn't actually stop trolls from deleting an employee's account. I'd have thought employee accounts would be protected. The trolls didn't even get admin powers through privilege escalation, they just reported the target.

    • @Milenakos
      @Milenakos 9 місяців тому +7

      read the video description

    • @DevinDTV
      @DevinDTV 9 місяців тому +111

      @@Milenakos every company says they do a manual review, but none of them actually do

    • @Milenakos
      @Milenakos 9 місяців тому +2

      ​@@DevinDTV source??? (edit: i was mostly complaining about you just saying they are lying out of thin air)

    • @Therealpro2
      @Therealpro2 9 місяців тому +48

      ​@@Milenakos source????????????????????????????????????????????

  • @SIMULATAN
    @SIMULATAN Рік тому +15002

    So you're telling me a platform as big as GitLab went down because one engineer picked the wrong SSH session?
    Damn that makes me feel way better about my mistakes lol

    • @shahriar0247
      @shahriar0247 Рік тому +573

      i would highly high suggest people using customized shells, i use oh my zsh, i customize my themes to show git info, hostname (sometimes) and a lot more, not because i wanna know which ssh session im in, but i like the design :)

    • @syedmohammadsannan964
      @syedmohammadsannan964 Рік тому +262

      Dude IKR! No one engineer should have that much power to shutdown an entire company's operation for even a second.

    • @0xCAFEF00D
      @0xCAFEF00D Рік тому +307

      @@syedmohammadsannan964 No someone has to have that.
      The general problem is that there's no safety nets. I don't mean to suggest this is a good solution, because safe-rm is just jank. But using safe-rm would most likely have saved this situation. If you replace rm through a symlink to safe-rm you can configure a blacklist on production that doesn't allow for deleting the database or other critical data.
      I find many things about safe-rm to be unsafe. It doesn't protect if you cd into a directory and then do rm -rf *. A better program should simply evaluate the path its trying to delete and disallow it if the blacklist covers it.
      It also doesn't allow for custom messages through its blacklist. What you want is for a bad rm -rf to send a warning to the user. Otherwise there's no way of guaranteeing they don't just start avoiding the issues.
      For example, most likely you're not going to leave your backup unprotected by the blacklist just to create differences between production and backup. So a developer in this situation would expect to run into issues deleting postgres db on either server. It doesn't tell the user anything really. If you instead configure messages you can call attention to the hostname.
      The goal is just to induce further friction for dangerous actions. rm has always been so risky because it's so easy.

    • @Darkk6969
      @Darkk6969 Рік тому +85

      @@0xCAFEF00D I always check the hostname of the server and triple check the directory before using the rm -rf command. If in doubt I use the mv command to a different directory as backup. If everything works ok then I go in there and delete the old directory.
      Same thing happened to Pixar's movie Toy Story they were working on. Some storage admin used rm -rf on a directory by mistake and practically wiped out the movie. Lucky someone had a copy of the data on a laptop that was offsite at the time. They were able to rebuild the movie from that data.

    • @BuyHighSellLo
      @BuyHighSellLo Рік тому +68

      @@0xCAFEF00D no, NO single employee should have enough privilege to bring down anything business sensitive. except if you’re the CTO maybe. These operations all should require a flag or check from someone else first. Just like how one person usually shouldnt be able to push any code by themselves. They need 1 or more checks before that.

  • @rosscads
    @rosscads Рік тому +6443

    Given the trouble they were in after the deletion, a recovery time of 24h and a recovery point of 6h is actually pretty heroic. Especially considering the stress they would have been under. 😰

    • @TheDaern
      @TheDaern Рік тому +698

      ​@@L2002 Because of this? They were open and honest about their screwups which, for me, makes them a pretty good organisation to deal with. Plenty of others would not be and, at the end of the day, this stuff does happen from time to time. My measure of a company is not how well they work day to day, but how they handle adversity. Everyone screws up eventually and it's how you handle this that marks out the good ones from the bad ones.
      Also, a company who almost lost a production DB because of failed backups is unlikely to do it again ;-)

    • @MunyuShizumi
      @MunyuShizumi Рік тому +327

      @@L2002 Ah, yes, because Microsoft never has outages, data loss, or data leak incide- oh wait..

    • @sinnlos229
      @sinnlos229 Рік тому +70

      ​@@L2002Care to elaborate? Cause everyone else here, including me, disagrees.

    • @titan5064
      @titan5064 Рік тому +101

      Don't feed the troll, clearly not someone who's ever worked with computers on a proper level

    • @realpillboxer
      @realpillboxer Рік тому

      @@titan5064 exactly. Their handle is "L" -- they are a literal walking loss (loser).

  • @dragonfire4869
    @dragonfire4869 11 місяців тому +2833

    This reminds me of Toy Story, and how like a month before release the entire animation was accidentally deleted, causing absolute panic and hell at Disney. Luckily, one employee had the whole thing on a hard drive that they were taking home to work on. Her initials are on one of the number plates of one of the cars in the film.
    Always make a backup.
    Edit: She was a project manager who had to work from home, and the numberplate was actually "Rm Rf" in reference to the notorious line of code that did it.

    • @mrsharpie7899
      @mrsharpie7899 11 місяців тому +102

      I don't remember if it was the day-saving employee's initials, or RM-RF that was on the license plate

    • @alimanski7941
      @alimanski7941 10 місяців тому +280

      It was Toy Story 2, and the easter egg was in Toy Story 4, where the license plate had "rm rf" in it

    • @ScruffyNZ.
      @ScruffyNZ. 10 місяців тому +122

      they fired that person recently

    • @atulyadav3197
      @atulyadav3197 10 місяців тому +17

      @@ScruffyNZ. Yes, I heard this too

    • @GoatzombieBubba
      @GoatzombieBubba 10 місяців тому +157

      @@ScruffyNZ. That person should be happy to not work for a woke company like Disney.

  • @gosnooky
    @gosnooky Рік тому +2370

    Imagine for a moment, that you're that guy. That feeling of pure dread and the adrenaline rush immediately after the realization of what you've just done. We've all felt it at some point.

    • @omniphage9391
      @omniphage9391 10 місяців тому +140

      In my first job, ive gotten a 2 am call where in the first two weeks of working in the company, i accidentally left a process in prod shut down after maintanence, leading to intensive care patient data not making it into connected systems.
      Looking back, the entire company was set up super amateurish, yet they operate in several hospitals in my country.

    • @PixelSlayer247
      @PixelSlayer247 9 місяців тому +56

      Having exited my game without being sure I saved my progress before, this is very relatable.

    • @thephlophers
      @thephlophers 9 місяців тому +40

      the onosecond

    • @stacilynn604
      @stacilynn604 9 місяців тому +10

      like hitting a car in a parking lot 😵

    • @ashesagainst7236
      @ashesagainst7236 8 місяців тому +47

      At my second IT job I accidentally truncated an important table in the prod DB. The stress was immense but we identified a ton of issues and the team was pretty supportive. My boss ended up begging upper management to get us a backup server but they determined it wasn't important enough.
      The company went belly-up a few years later because of a ransomware attack they couldn't recover from.

  • @ludoviclagouardette7020
    @ludoviclagouardette7020 Рік тому +3833

    The rule I apply for backups is that no one should connect to both a backup server and a primary at the same time, two people should be working together. The employee that was logged on both DBs should have been really two physically separated employees

    • @act.13.41
      @act.13.41 Рік тому +193

      That is an excellent rule.

    • @refuzion1314
      @refuzion1314 Рік тому +142

      Yes, but, in the case that there is only one employee available and he has to connect to both he should either have different color schemes for the different servers OR do it all in one shell window and disconnect / connect to the server they have to edit that way it is a lot harder to execute commands on the wrong server.

    • @thoriumbr
      @thoriumbr Рік тому +35

      I try to follow this rule myself. Every time I have to connect to a prod server to get anything, I disconnect as soon as I get the info before getting back to the test/dev server window.

    • @thoriumbr
      @thoriumbr Рік тому +117

      @@refuzion1314 Different color schemes looks good but don't work during an outage, when you are stressed, exhausted, or anything distracts you. Sounds nice, but the mental load during crisis is too large to pay attention to that.

    • @onemprod
      @onemprod Рік тому +19

      I can't tell you enough how easy it is to accidentally overwrite the wrong file. While I was working on something on a test machine with a usb stick plugged in to save the current progress, I saved the script, thought I saved it in the local directory and copied the unmodified script to my just saved usb stick version...

  • @jarrod752
    @jarrod752 Рік тому +4031

    _Luckily team 1 took a snapshot 6 hours before..._
    This happened to me. I copied a clients database to my development environment about 2 hours before they accidently wiped it.
    They called our company explaining what happened and it got around that I had a copy. Our company looked like a hero that day, and I got a bunch of credit for good luck.

    • @abelkibebe577
      @abelkibebe577 Рік тому +81

      You are a Legend :)

    • @mipmipmipmipmip
      @mipmipmipmipmip Рік тому +290

      I think this was how most of Toy Story was saved. It's also bad security practice :)

    • @ilyasziani5504
      @ilyasziani5504 Рік тому +8

      @@mipmipmipmipmip Why is it bad security practice?

    • @amyx231
      @amyx231 Рік тому +31

      And now you routinely copy the client database every 24 hours?

    • @jarrod752
      @jarrod752 Рік тому +147

      @@amyx231 Actually, due to the nature of my current work, I have a script I run on demand approx every few days as needed that takes a snapshot. I usually get around to deleting everything that's more than a month old about twice a year or when my dev server starts btching about space.

  • @Nick77ab2
    @Nick77ab2 Рік тому +3571

    This is why problems like this are actually sometimes good. Of course extremely stressful, but they found sooo many issues and fixed them all. Amazing.

    • @federicocaputo9966
      @federicocaputo9966 11 місяців тому +86

      you are asuming they fixed them all
      Until it breaks again.

    • @JeyC_
      @JeyC_ 11 місяців тому +116

      ​@@federicocaputo9966 atleast next time they now have the experience to what not to do or what to do

    • @brett2258
      @brett2258 11 місяців тому +21

      That's a really good positive approach right there!

    • @djweavergamesmaster
      @djweavergamesmaster 10 місяців тому +14

      reminds me of that one ProZD skit, where the villain fixes everything

    • @mikabakker1
      @mikabakker1 9 місяців тому +3

      @@federicocaputo9966 that is life

  • @CryShana
    @CryShana 11 місяців тому +215

    When I was still a junior developer at some startup company, I was working on a specific PHP online store. Every time we would upgrade the site, we would first do it on Staging, then copy it over to Production. The whole process was kinda annoying as there was no streamlined upgrade flow yet and no documentation anywhere - it was a relatively new project we took over. I have upgraded it before so I knew what to do, and I just did the thing I always did.
    I was close to finishing it up and we had an office meeting coming up soon and lunch afterwards, so I wanted to be done with this before that - so I rushed a bit. And when I was copying files to Production, I overlooked something - I had also copied the staging config file (that contained database access info etc) to the production location and overwrote the production config file.
    After the copying had finished, thinking I was finally done, I relaxed and prepared myself for the meeting. As I was closing everything, I also tried refreshing the production site, just to see if it works. And then I realized... Articles weren't appearing, images weren't loading, errors everywhere. Initially I didn't believe this was production at all, probably just localhost or something, RIGHT?? However after re-refreshing it and confirming I had actually broke production, panic set in.
    Instead of informing anyone, I quietly moved closer to my computer, completely quiet, and started looking at what is wrong - with 100% focus, I don't think I was ever as focused as I was then - I didn't have time to inform anyone, it would only cause unnecessary delays. I had to restore this site ASAP.
    I remember sweating... the meeting was starting and I remember colleagues asking me "if I am coming" - and I just blurted "ye ye, just checking some things..." completely "calmly" as I was PANICKING to fix the site as soon as possible. Luckily I quickly found the source of the mistake within a minute and had to find a backup config file - and then after recovering the config file, everything was fixed. Followed by a huge sigh of relief. The site must have been down for only around 2 minutes.
    No one actually noticed what I had done - and I just joined the meeting as if nothing had happened - even though I was sweating and breathing quickly to calm myself down, I hid it pretty well.
    And this was a long time ago - and still to this day, I still remember that panic very well. Now I always make sure I have quick recovery options available at all times in case something goes wrong - and if possible always automate the upgrade process to minimize human errors

    • @valdimer11
      @valdimer11 2 місяці тому +11

      Well done. Having made mistakes like that, I can completely understand how you were feeling in that moment and how your brain just went "in the zone". It's only ever happened to me twice but I will NEVER forget them.

  • @maxcohn3228
    @maxcohn3228 Рік тому +4416

    Something my first boss taught me (when I broke something big in production in my first few weeks) is that post mortems are to identify problems in a system and how to prevent them, avoiding blame to individuals.
    This is huge. Making sure to identify why it was even possible for something like this to happen and how to prevent it in the future is a great way to handle a post mortem like this. Good on the GitLab team.

    • @lhpl
      @lhpl Рік тому +229

      Good boss. Bad ones often like when things are done fast and "efficient". And when this then establishes a culture of unsafe practies, thing will go fine, maybe for a long time. This one day, a human error occurs. Typically, such a boss will then blame the person who "did" it, even if the cause was the unsound culture. If as an employee you try to work safely, you get criticised for being slow and inefficient (and you technically are.)

    • @FireWyvern870
      @FireWyvern870 Рік тому +34

      Yeah, things like this are the problem of the system, not fault of the operators

    • @honkhonk8009
      @honkhonk8009 Рік тому +67

      You only fire people for their character, not cus of the inevitable fuckup.
      Also you basically sunk money into training this dude after that fuckup, so sacking him right after you inevitably paid to get him that experience, is counterproductive.

    • @gownerjones1450
      @gownerjones1450 Рік тому +36

      Also very cool that they did it completely in public even with livestreams. This will hopefully help other companies avoid mistakes like that.

    • @FlabbyTabby
      @FlabbyTabby Рік тому +10

      Depends. Many times, it's used as on opportunity to kick out people they consider undesirable, even if they're great employees.

  • @randomgeocacher
    @randomgeocacher Рік тому +1886

    A helpful hack is to set production terminal to red and test terminal to blue or something like that. Just a small helper to avoid human f’ups if you need to run manual commands in sensitive systems.

    • @tacokoneko
      @tacokoneko Рік тому +64

      i second this I also use colors to differentiate multiple environments

    • @vaisakhkm783
      @vaisakhkm783 Рік тому +17

      it was easy and changing prompt color... but make a huge differece

    • @Wampa842
      @Wampa842 Рік тому +60

      I use colored bash prompts to differentiate machine roles - my work PC uses a green scheme, non-production and testing servers use blue, backups use orange, and production servers use yellow letters on red background. It's very hard to miss.

    • @darrionwhitfield46
      @darrionwhitfield46 Рік тому +6

      I use oh-my-posh with different themes

    • @iUUkk
      @iUUkk Рік тому +6

      Both database servers were actually used in production.

  • @mxbx307
    @mxbx307 11 місяців тому +514

    There is an awful lot that could be learned from this.
    1) You should "soft delete" i.e. use mv to either rename the data e.g. renaming MyData to something like MyData_old or MyData_backup, or just mv it out of the way so you can restore it later if needed. Don't just rm -rf it from orbit
    2) Script all your changes. Everything you need to do should be wrapped in a peer-reviewed script and you just run the script, so that the pre-agreed actions are all that gets done. Do not go off piste, do not just SSH into prod boxes and start flinging arbitrary commands around
    3) Change Control - as above
    4) If you have Server A and Server B, you should NOT have both shell sessions open on the same machine. Either use a separate machine entirely or - better still - get a buddy to log onto Server A from their end and you get on Server B from yours. Total separation
    5) Do not ever just su into root. You use sudo, or some kind of carefully managed solution such as CyberArk to get the root creds when needed

    • @magicmulder
      @magicmulder 7 місяців тому +38

      Also for (2), never try to "improve" anything during the actual action.
      I once prepared a massive Oracle migration that I had timed to take about 3 hours. Preparation was three weeks.
      As I was watching the export script for the first schema during the actual migration, I thought "why not run two export jobs concurrently, it's gonna save some time". Yeah, made the whole thing slow down to a crawl, so it ended up taking 6 hours. Boss was furious.
      So no, never try to "improve" during the actual operation, no matter how big you think your original oversight was.

    • @lashlarue7924
      @lashlarue7924 7 місяців тому +1

      100%, upvoted.

    • @xpusostomos
      @xpusostomos 6 місяців тому +2

      I religiously never delete anything

    • @thedemolitionsexpertsledge5552
      @thedemolitionsexpertsledge5552 6 місяців тому +1

      I have no idea what any of this means but I feel like this is bad

    • @alvinbontuyan8083
      @alvinbontuyan8083 3 місяці тому

      Fucking up catastrophically with Bash commands is a canon event. It is religion for me to always copy a file/directory to "xxx.bak" before doing anything sensitive

  • @JeffThePoustman
    @JeffThePoustman 11 місяців тому +215

    Ugh, felt that "he slammed CTRL+C harder than he ever had before" (3:55). The only thing worse than deleting your own data is deleting everyone else's. In this case the poor guy kinda did both. Great story arc.

    • @ic6406
      @ic6406 3 місяці тому +3

      Yeah, I guess it was the most stressful moment in his life after realizing what you've done. I think he had a huge blackout

  • @helmchen1239
    @helmchen1239 Рік тому +1411

    I once accidentally ran a chmod -R 0777 /var because i've missed a dot before the slash (in a web project with a /var folder), which (as i've now learned) may make a unix system totally unresponsive. I can very well understand how it feels, the moment you realize what you have just done. That did cost us a few hundred euros and kept 2 technicians busy for an afternoon on the weekend. Lessons learned, today we can laugh about it.

    • @Darkk6969
      @Darkk6969 Рік тому +143

      Ya, Unix / Linux will do what you tell it to do without any warnings. Pretty sure you sat there and wondered why that command is taking a long time to finish before you realize your mistake. Right then there it's the "Oh Shit" moment. 😀 Lucky for me though I use VMs so can always revert to previous snapshots.

    • @desoroxxx
      @desoroxxx Рік тому +162

      the onosecond

    • @parlor3115
      @parlor3115 Рік тому +6

      @@Darkk6969 What if you ran it on the host?

    • @FurriousFox
      @FurriousFox Рік тому +50

      @@parlor3115 he doesn't, Noah only runs things in virtualized environments, making snapshots every minute

    • @aarondewindt
      @aarondewindt Рік тому +4

      Why does it make it unresponsive? I accidentally chmod 0777 the entire "/" once and well, I had to start again from scratch. Thankfully I was just creating a custom Ubuntu image with some preinstalled software for one of my professors. So it just cost me time. Still, I never figured out why opening up the permissions would lock everything up.

  • @LordHonkInc
    @LordHonkInc Рік тому +681

    "rm -rf" is one of those commands I have huge respect for cause it reminds me of looking down the barrel of a gun (or any similar example of your choosing): Best case, you do it a) seldom, b) after a lot of strict and practiced checks, and c) if there's no alternative; unfortunately, the worst case is when you _think_ you're in that best case scenario.

    • @givenfool6169
      @givenfool6169 Рік тому +44

      I sourced my bash history like an idiot about a week ago. I have so many cd's and "rm -rf ./"'s and other awful things in there. I somehow got lucky and hadn't used sudo in that terminal at the time. I got caught on a sudo check before it ran anything absolutely hell inducing. Just a bunch of cd's and some commands that require a sourced environment to execute. Super Lucky. Icould have wiped out everything, because just a couple commands after that was a "rm -rf ./" and it had already cd'd into root.

    • @henningerhenningstone691
      @henningerhenningstone691 Рік тому +39

      @@givenfool6169 Lmao it had never once occurred to me what havoc it could wreak if you accidentally source the bash history, since it had never occurred to me that that's even possible (because why the hell would you?!). But of course it is, what an eye opener!

    • @givenfool6169
      @givenfool6169 Рік тому +16

      @@henningerhenningstone691 Yeah, I was trying to source my updated .bashrc but my auto-tab is setup to cycle through anything that starts with whatevers been typed (even ignores case) so I tabbed and hit enter. Big mistake. I guess this is why the default auto-tab requires you to type out the rest of the file if there are multiple potential completions.

    • @Shadowserpant00
      @Shadowserpant00 Рік тому +6

      @@henningerhenningstone691 bro idk wtf you're talking about and it's scaring me

    • @oliverford5367
      @oliverford5367 Рік тому +1

      Do ll first, make sure you're wanting to delete that directory, the press up and change ll to rm

  • @TheDrTrouble
    @TheDrTrouble Рік тому +511

    The best practice is to rename the directory or file to something else. Idk how the developers are so calm when using deletion commands

    • @setasan
      @setasan 10 місяців тому

      Well, when you live in a poor country, being underpaid by a fucking contractor company, with a overloaded team. shit hapnz

    • @schwingedeshaehers
      @schwingedeshaehers 9 місяців тому +6

      I "deleted" on program from me with the cp command (I wanted to copy the config and the main file in a sub directory, but forgot to enter the directory after it, so it wrote the config to the main file)
      (I could get a older version of the file from the SD card, by manually read the content of that region and find one with it on it, as it doesn't override an save, but takes a new place)

    • @Funnywargamesman
      @Funnywargamesman 9 місяців тому +42

      On a home system? Absolutely. In a working environment? Doubtful. Maybe with a small company it would be acceptable, but creating an orphan database that may or may not contain sensitive information with no one in charge of it, or worse, no one who KNOWS ABOUT it, would be awful. God help you if that contains financial, medical, or government records.

    • @AndrewARitz
      @AndrewARitz 9 місяців тому +69

      @@Funnywargamesman you don't create it to keep it around forever, you create it as a failsafe for when you are doing potentially dangerous stuff, like deleting a whole database.

    • @Funnywargamesman
      @Funnywargamesman 9 місяців тому

      @@AndrewARitz I cannot tell you how many times "temporary" things become permanent on purpose, let alone the times people have said they are going to do something, like deleting a temp database they copied locally because their permissions didn't let them use it remotely, and then proceeded to forget to delete it. This will be especially true with the most sensitive databases, "because it's more important, so we should make a copy first, right?"
      Security is everyone's job and if you do (typically) irresponsible things like copying databases, "as a failsafe," chances are you are going to form a habit that means you will do it with a sensitive database. If you think YOU won't do it, that's fine, but assuming you are of average intelligence you need to remember 50% of people are dumber than you and some of them get REAL dumb. If you set policy to say that it would be allowed, then THEY will do it.
      This is exactly why I said that home environments and really tiny companies could be different, there it could/would be fine. Chances are, if you don't know the names of every single person in your company off the top of your head, it is too large to be that lax with data protection and management. Take it or leave it, it's my opinion.

  • @Dairunt1
    @Dairunt1 11 місяців тому +139

    One of my most stressful moments as a software designer was when I accidentally broke a test environment right before a meeting with our client; I managed to have the project running at a 2nd test environment but that really taught me the importance of backups and telling the rest of staff about a problem ASAP.

  • @matthias916
    @matthias916 Рік тому +891

    I once accidentally deleted 2000 rows in one of my companies production databases, everything was restored 5 minutes later but it felt so bad, can't imagine what deleting an entire database would feel like

    • @marco56702
      @marco56702 11 місяців тому +46

      terrible, sending the queries make you shiver

    • @varunkhadse5869
      @varunkhadse5869 8 місяців тому +4

      ig panick was at next level coz both dbs were deleted.

    • @Rncko
      @Rncko 7 місяців тому +5

      It feels like lighting a torch onto a sea of currency bank notes... that belongs to the company.
      (and company is just about to release year end bonus)

    • @Atulnavadiya
      @Atulnavadiya 6 місяців тому +4

      I have had good hands-on experience at my company on sql database but I'd check my query atleast 10 times before execute it..we had clients data saved in the database of more than 10 years..

    • @TrevoltIV
      @TrevoltIV 6 місяців тому

      @@marco56702Right, I’m always quadruple checking every query to make sure my retarded ass didn’t type delete * or something

  • @MechMK1
    @MechMK1 Рік тому +607

    For this reason, all our servers have color-coded prompts. Dev/Testing servers are green. Staging is yellow. Prod is bright red. When you enter a shell, you immediately see if you are on a server that is "safe" to mess around with, or not.
    The advantage to doing this in addition to naming your server something like "am03pddb", is that you don't have to consciously read anything. Doesn't matter if you accidentally SSH into the wrong server. If you meant to SSH into a "safe" server, then the bright red prompt will alert you that you are on prod. And if you meant to SSH into a prod server, then you better take the time to read which server it actually is.

    • @tacokoneko
      @tacokoneko Рік тому +13

      i agree except there are only so many colors, so if manually controlling a lot of different machines (something that could maybe be avoided depending on what the servers do) i believe it's important to use unique memorable hostnames. the two servers in this story had hostnames 1 character apart and the same length, unless the names were all changed for the artwork

    • @seedmole
      @seedmole Рік тому +9

      @@tacokoneko Yeah like imagine if those two characters were visually similar ones, like any combo of 3, 5, 6 and 8. Fatigued eyes could easily misleadingly "confirm" that you're on the right one when you're not.

    • @makuru_dd3662
      @makuru_dd3662 Рік тому +5

      Also, dont ever ever work on the live database, a lesson i have learned the hard way many times on my own.

    • @MunyuShizumi
      @MunyuShizumi Рік тому +14

      @@makuru_dd3662 That statement makes no sense. No matter how critical a system is, you'll have to perform some kind of maintenance at least semi-regularly.

    • @makuru_dd3662
      @makuru_dd3662 Рік тому +1

      @@MunyuShizumi you make a backup or anything, yes you need to maintain it but not by making massive untested changes.

  • @ErikPelyukhno
    @ErikPelyukhno 9 місяців тому +24

    Your editing is phenomenal. What an insane series of events 😂 Glad gitlab was able to get back to running, seeing all that public documentation was refreshing to see since it shows they were being transparent about their continued mistakes and their recovery process.

  • @christopherg2347
    @christopherg2347 Рік тому +106

    If you are working with multiple shells, VMs, remote sessions or the like - make sure they are color coded based on the machine you are running against!
    It can be as simple as picking a different color scheme in windows. But it is just too easy to mess up when all the visual difference is a single number, somewhere in the header.

    • @neekfenwick
      @neekfenwick 3 місяці тому +1

      Yep, I came here to say this. For any serious system I connect to, I use different params for my session, in my case I like old fashioned xterm, something like: alias u@s="xterm -fg white -bg '#073f00' -e 'ssh user@server'"
      It's very useful to see the green red, blue etc colouring and be sure which system you're talking to.

    • @Kalmaro4152
      @Kalmaro4152 2 місяці тому +2

      It's very nice that Linux shells actually support setting session colors

  • @GanerRL
    @GanerRL Рік тому +1100

    imagine flagging messing with some employee and managing to bring down the entire site by proxy

    • @batorerdyniev9805
      @batorerdyniev9805 Рік тому +2

      What

    • @hypenheimer
      @hypenheimer Рік тому +3

      Bot

    • @GanerRL
      @GanerRL Рік тому +53

      @@hypenheimer beep boop

    • @Jacob-ABCXYZ
      @Jacob-ABCXYZ Рік тому +3

      How to take down a site, the stealthy way

    • @kulled
      @kulled Рік тому +6

      @@hypenheimer nah. it was probably a minecraft shorts bot account before he bought it though.

  • @build-things
    @build-things Рік тому +811

    As an engineer for a large company you got me in the feels talking about asking for help or posting a pr and then seeing all the mistakes you made😊

    • @stingrae789
      @stingrae789 Рік тому +19

      In my previous position I worked closely with one guy and we used to joke about how we were using each other as a rubber duck :D.

    • @EChan-eu2co
      @EChan-eu2co Рік тому +4

      The buzzword is SRE and postmortems are supposed to be blameless now...

    • @jillfizzard1018
      @jillfizzard1018 Рік тому +2

      This is why you first mark the PR as a draft and read over the changes one more time before marking it as ready.

    • @mortache
      @mortache Рік тому +2

      @@stingrae789 Damn I didn't know this thing has a name! I legit have done this before while discussing weird math problems

  • @ChosenOne-wz6km
    @ChosenOne-wz6km Рік тому +4

    This video is awesome! The step by step analysis of what occurred during the outage coupled with the story telling format helped me learn some things I didn't know about database recovery procedures. Please make more videos in this format!

  • @CarrotCastle
    @CarrotCastle Рік тому +20

    One of my first jobs in IT was working as a big data admin and this video allows me to re-live the spicy moments of that job but with none of the responsibility attached

  • @HazySkies
    @HazySkies Рік тому +395

    "Slams Ctrl+C harder than he ever had before"
    As a relatively new linux user, I felt that one.

    • @ss-to7ii
      @ss-to7ii 8 місяців тому +2

      As a new Linux user use the "-i" flag for "interactive" when using rm and a couple other commands.

    • @KR-tk8fe
      @KR-tk8fe 5 місяців тому

      As a windows user, I was very confused

    • @LC-uh8if
      @LC-uh8if 4 місяці тому +3

      @@KR-tk8fe CTRL+C. On most Unix/Linux based CLIs, this combination aborts whatever command you were running. Technically, it sends a SIGINT (Interrupt) to the foreground process (active program), which usually causes the program to terminate, though it can be programmed to handle it differently. Its basically, the Oh Shit or This is taking too long button.

    • @MrCmon113
      @MrCmon113 Місяць тому +2

      ​@@LC-uh8ifIsn't that the same in Windows terminals? 🤔

  • @xmorse
    @xmorse Рік тому +246

    The real problem here is that you can delete any user data by simply mass reporting him

    • @technicolourmyles
      @technicolourmyles Рік тому +44

      I'm seeing a lot of serious problems here... I guess this is why I never heard of GitLab before.

    • @PatalJunior
      @PatalJunior Рік тому +6

      I highly doubt is instantly deleted, probably someone made the decision to delete it (could just be an account spamming a bunch of mess onto repositories, and that isn't good either.

    • @FighteroftheNightman
      @FighteroftheNightman 11 місяців тому +44

      ​@@technicolourmylesthey're literally the 2nd largest enterprise git solution provider in the world.

    • @nonamepasserbya6658
      @nonamepasserbya6658 10 місяців тому

      When in doubt, it's probably 4chan
      That low hanging fruit aside, not a good thing if someone can just do that with a bot acc. Maybe grant employees a special anti report protection can help until they find a more permanent solution against those trolls

    • @Webmage101
      @Webmage101 10 місяців тому +4

      ​@@PatalJunior6:21 literally says they fucked up by not making it check the details before deletion

  • @danusminimus9557
    @danusminimus9557 11 місяців тому

    Seen your video history and the evolution of your videos - this format is amazing and you're really good at it :D

  • @sortebill
    @sortebill 10 місяців тому +3

    your content is really good, please keep up making these mini documentaries about tech failures!

  • @jhyland87
    @jhyland87 Рік тому +59

    A few places i worked at as a linux admin or engineer, the shell prompts (PS1) were color coded. Green was dev, yellow was qa and red meant your in prod. Worked like a charm.

    • @blackbot7113
      @blackbot7113 8 місяців тому +4

      Yeah, that's the way I do it as well, just the other way round (red being test). Extends to the UI as well - if the theme is red, you're on the test instance of Jira, not the real one.

    • @jhyland87
      @jhyland87 8 місяців тому

      @@blackbot7113 Yeah, it's a very wise thing to do imo. Currently, I work at a bank, and I recommended we have the header in the UI of the colleague and customer portal be different colors for lower environments, as well as the PS1 prompt on the servers. And I kinda got snickered at and got a reply along the lines of "How about we just pay attention to the server and page were on?"
      Its crazy because it's such an easy change to implement and almost entirely prevents anyone making such silly (yet catastrophic) mistakes.
      Edit: I make the PS1 prompt for my own user on the servers different colors, but that only helps so much since I sudo into other service users (or root). Additionally, we "rehydrate" the servers every. couple months, which means they get re-provisioned/deployed, so any of those settings get wiped out entirely.
      For it to be permanent, it needs to be added in the Docker file.

  • @daigennki
    @daigennki Рік тому +122

    Awesome work on the video!! I love the editing being both funny and straight to the point, and your narration is easy to understand too. You seriously deserve more attention.

  • @minsiam
    @minsiam 8 місяців тому +6

    When I was just starting in a company, I accidentally deleted all the ticket intervals from the database. Causing all the tickets to close immediately and make some massive spam to the admins. I was really terrified of the situation and didn't know what to do, we didn't have any backup as well. I apologized as much as I can and didn't make another mistake like this again in years, sometimes mistakes make you work harder and be more careful in life.

  • @rishavmasih9450
    @rishavmasih9450 Рік тому +3

    Oh God my heart started sinking when you said he noticed the shell he was running the command in.

  • @karmatraining
    @karmatraining Рік тому +66

    An old best practice that so many people these days seem to forget or never have heard about is that every week, you try to pull a random file from your backup system, whatever that is. (Or systems, in this case). You will learn SO MUCH about how horribly your backups are structured by doing this - so many people think they set up good backup systems but never continuously test them in any way, and then they get big surprises (like the GitLab team) when they do need to fall back on them.

  • @wojtekpolska1013
    @wojtekpolska1013 Рік тому +249

    respect for not firing the guy, it was obviously just a small mistake, and it wasn't his fault that the backups didn't work. it shouldn't be possible for 1 command to completely delete everything in the first place. Good that they didn't just use him as a scapegoat :p

    • @yerpderp6800
      @yerpderp6800 Рік тому +124

      If they fired him they would just reintroduce the possibility of the same thing happening again in the future. I'm pretty sure the old employee will be paranoid for a loooong time and will double-check from now on lol. An expensive lesson but a lesson nonetheless.

    • @tuxie93
      @tuxie93 Рік тому +65

      Yep and he'll train new employees making super sure to emphasize triple checking before deleting from prod.

    • @D00000T
      @D00000T 9 місяців тому +4

      That’s Unix systems for you. Their open nature makes them super useful for a lot of things but it’s also so easy to break them.
      Plus that old trick of telling new linux users that sudo rm -rf is a cool easter egg command wouldn’t be the same with more safeties and preventions.

    • @BitTheByte
      @BitTheByte 8 місяців тому +3

      What if I want to delete everything? I don’t want a baby proofed OS. I want an OS that does what I want. Even if I want to burn it all

    • @wojtekpolska1013
      @wojtekpolska1013 8 місяців тому +8

      @@BitTheByte why buy a computer at that point lol

  • @theultimatetrashman887
    @theultimatetrashman887 Рік тому +35

    the realization of what you're doing before it finishes itself is so cruel and happens so often, thats why when you're doing a job you always do it slow but correctly

  • @WackoMcGoose
    @WackoMcGoose Рік тому +338

    As a former Amazonian (only QA for the now-ended Scout program, sadly), I read quite a few cautionary tales on the internal wiki about Wrong Window Syndrome. Sometimes, not even color-coded terminals and "break-glass protocols" (setting certain Very Spicy commands to only be usable if a second user grants the first user a time-limited permission via LDAP groups) is enough to save you from porking a prod database.

    • @Skyline_NTR
      @Skyline_NTR Рік тому +4

      This interests me. Got any resources/links to set that up (dangerous commands temporarily allowed by time-limited permissions via LDAP)

    • @WackoMcGoose
      @WackoMcGoose Рік тому +14

      @@Skyline_NTR Afraid not, it was several pay grades above me both in job role and in coding knowledge, and I lost access to the company slack back in december so I can't really ask anyone...

    • @ProgrammingP123
      @ProgrammingP123 10 місяців тому +1

      @@WackoMcGoose Ahh were you laid off also??? I was lol

    • @WackoMcGoose
      @WackoMcGoose 10 місяців тому +4

      @@ProgrammingP123 Yup, they disbanded the entire Scout division and then put a company-wide hiring freeze a month later so I had no hope of transferring...

  • @DomskiPlays
    @DomskiPlays Рік тому +279

    Our prod server has no staging environment or anything like that. I've asked the DB admin if the data and schema is safe in case of someone accidentally deleting everything and they told me everything is backed up daily. Kinda scared that I don't know how or where this is happening except for a job.

    • @indyalx
      @indyalx Рік тому +59

      I checked my database backup script a couple days ago and noticed it hadn't backed up in 5 days O_O I SLAMMED the manual backup immediately. Then went and fixed the issue and made sure it would notify if there was no backup in 6 hours.

    • @CMDRSweeper
      @CMDRSweeper Рік тому +46

      The next question is... "Have you tested the backups?"
      If they can't say for sure WHEN they were tested... Be very afraid...

    • @indyalx
      @indyalx Рік тому +8

      @@CMDRSweeper we load the prod backup into staging nightly

    • @forbiddenera
      @forbiddenera Рік тому +2

      6 hour full backups, mirroring/replicas, multiple servers and daily volume backups..

    • @robertbeisert3315
      @robertbeisert3315 Рік тому +4

      "Trust me, bro" only works in Dev. Every other environment needs regular verification.

  • @tatsuuuuuu
    @tatsuuuuuu 7 місяців тому +3

    Linux actually can in certain circumstances "undo" this wild kind of situation. Having ZFS as the file system will allow you to revert to a previous image of the filesystem. it's like versioning but for the entire file system. of course it takes up quite a bit of space so it's not done that often, software install are automated "imaging" points for instance. but you can trigger one manually when you think you're about to do something you're unsure about. (since the selection of save states is at GRUB, yes an unbootable system is still recoverable if you still have GRUB)

  • @jfbeam
    @jfbeam 11 місяців тому +116

    The #1 thing I learned WAY EARLY on in my IT career (three decades): Never delete anything you can't _immediately_ put back. Never do anything you can't undo. Instead of deleting the data directory, _rename_ it. If you're on the wrong system, that can easily be fixed. (and on a live db server, that alone will be enough of a mess to clean up.) As for backups, if you aren't actively checking that (a) they've run, (b) they've completed successfully, and (c) they're actually usable... well, this is the shit you end up in.
    (The fact they're actively hiding ("lying") about this fiasco should be criminal.)

    • @kurenaigames5357
      @kurenaigames5357 11 місяців тому +14

      yea renaming is the key. first rename, then setup everything and then delete the renamed folder like a few months later.

  • @matthewstott3493
    @matthewstott3493 Рік тому +130

    Testing to verify backups, replication, failover and the like is absolutely critical. As new scenarios occur, having a feedback loop to update the plan is key. It's a continuous process that most shops have learned the hard way. It is boring and tedious but if you don't test you will experience catastrophic consequences.

    • @-TheBugLord
      @-TheBugLord Рік тому +3

      Exactly. Just like a dam, if there is a weak-point at the bottom, it all may come crumbling down.
      There needs to be a lot of redundancy when it comes to backups. Especially when it comes to a big server. An engineer accidentally removing a database should not have that catastrophic of consequences.

    • @esa4573
      @esa4573 Рік тому +2

      Yeah, the general rule is/should exist for having to be ready for stuff like that. If your fuckup is non-recoverable or a massive pain, you did something wrong. I'm sure a lot of companies are practically "trained" for when someone yeets the whole database or service.

  • @jeromesimms
    @jeromesimms Рік тому +24

    Wow! This was great and so interesting. I'm so glad I found this channel. I would love to hear more in depth analysis of software engineering fails

  • @swaggy3987
    @swaggy3987 6 місяців тому +4

    What's far more impressive about this whole situation is how calm the engineers were in handling the situation. That to me is far more valuable than having engineers that are too gun-shy to make prod db changes at 12AM and panic when something goes wrong.

  • @HippieInHeart
    @HippieInHeart 9 місяців тому +2

    "should be safe" famous last words lmao

  • @hchris96
    @hchris96 Рік тому +18

    I didn’t realize I would like these videos, but you are a good storyteller for production issues and I hope to see more in the future
    I am gonna share this with some of my coworkers

  • @SteveAcomb
    @SteveAcomb Рік тому +106

    Great video! Well produced content about software engineering war/horror stories are exactly what I’ve been looking for, keep it up!

  • @MichaelJordan-hi4ed
    @MichaelJordan-hi4ed 10 місяців тому +1

    This genuinely made my day.

  • @CryptbloomEnjoyer
    @CryptbloomEnjoyer 7 місяців тому +1

    I know the exact feeling of terror the moment you realize the command you just ran has is about to cause havoc

  • @Simone-uu8ne
    @Simone-uu8ne Рік тому +127

    all things aside, that wasn't that bad. Yeah, they weren't operational for 24h, but that made many other companies realize their fault management. For example, my uni professor told us about this incident and we could comprehend the importance of backups and testing

    • @gblargg
      @gblargg Рік тому +16

      I think the biggest issue was losing 6 hours of commits and comments.

    • @kookie-py
      @kookie-py Рік тому +8

      @@gblargg people will cope

    • @gblargg
      @gblargg Рік тому +24

      @@kookie-py Agreed, virtually all of them will have the commits locally as well. Just noting that the data loss is a bigger deal than mere downtime.

    • @kookie-py
      @kookie-py Рік тому

      @@gblargg right

    • @_Titanium_
      @_Titanium_ Рік тому +6

      This is why programming in general is great, nobody dies if you fuck up. (Obvious exceptions, medical, aviation etc)

  • @TonytheCapeGuy
    @TonytheCapeGuy Рік тому +44

    I can just imagine the relief that team felt when they find SOMETHING that they could use to restore files.

  • @AndreGreeff
    @AndreGreeff 11 місяців тому

    I must say, I heard many stories about this.. but that was a very nice summary of the nitty-gritty details, thank you. (:

  • @derpnerpwerp
    @derpnerpwerp 11 місяців тому +8

    This reminds me of all the times I have been in the wrong ssh session just before doing something that would have been pretty bad. I setup custom PS2 prompts to tell me exactly what environment, cluster, etc I am in.. and even colorize them accordingly but the problem is.. you start to just ignore them after a while. Its also kinda dangerous when stuff becomes fairly routine that is manual and potentially damaging

  • @streetchronicles5693
    @streetchronicles5693 Рік тому +64

    Yesterday I was added to a support team because we are getting a lot of tickets from users not waiting long enough for a service to load and closing the connection early. I died laughing from this story.

  • @edc2186
    @edc2186 Рік тому +79

    As a dev for a large company who has been on a number of late night calls, I literally gasped at this. But good on the team to work through the issue, and good on management to keep these guys around

  • @glennog
    @glennog 8 місяців тому +1

    Been there, done that, only in my case it was taking down the main network interface on a Solaris YP server used by an entire site of Solaris servers and workstations. The entire site ground to a halt in an instant. I didn't have access to the DC to get local access, either, so I had to make a red-faced confession to my boss for him to make the 2 mile drive to the secure DC.

  • @spacemanmat
    @spacemanmat 9 місяців тому +7

    Two things to remember:
    1. Always backup before you start a change even if you have an automated backup system.
    2. Audit you recovery procedures.

  • @hummel6364
    @hummel6364 Рік тому +32

    In my vocational school I had a subject simply called "Databases" and our teacher there once told us a story about how one of his co-workers lost his job.
    In essence he did everything right, created his backups and backup scripts and everything worked. At some point during the lifetime of the server this was running on someone replaced a harddrive for whatever reason, this lead to a change of the device UUID, which he had hard-coded into his backup script, when the main database failed a year or two later, they tried restoring from this backup only to find that there was none.
    Wasn't even really his fault, the only mistake he made was not implementing enough fail-saves. Nowadays we have it comparatively easy with all the automatic monitoring and notifications, but this was at least 30 years ago.

    • @thewhitefalcon8539
      @thewhitefalcon8539 Рік тому +2

      I guess that could have been solved by testing the backups. Install the database software on a spare server or just your own workstation, and then restore the backup onto it

    • @hummel6364
      @hummel6364 Рік тому +6

      @@thewhitefalcon8539 well the backup ran properly for years, he just never thought that the UUID might change

    • @thewhitefalcon8539
      @thewhitefalcon8539 Рік тому +1

      @@hummel6364 I suppose as long as he's employed he should probably be checking the backup at least every couple months. Would I have remembered to do that? I dunno, but I'm not employed as a database admin.

    • @yerpderp6800
      @yerpderp6800 Рік тому +6

      ​@@hummel6364 yeah he kind of deserves to be fired...feel like it should be common sense the hdd could fail, no good excuse to not expect that. You should almost never hardcode stuff, not sure why they thought it was okay to hardcode the uuid of a drive that would one day fail.

    • @hummel6364
      @hummel6364 Рік тому +1

      @@yerpderp6800 I think the idea was that the device might change from sdX to sdY when other drives are added or removed, so using the UUID was the only simple and safe way to do it.

  • @daryl9915
    @daryl9915 Рік тому +35

    A couple of jobs ago, I had a colleague who managed to do worse than this.
    I think they were playing about with learning Terraform and managed to delete the entire account. Prod servers, databases, the dev/qa servers, disk images, even the backups. Luckily it was a smaller account hosting a handful of tiny trivial legacy sites, but even so, we didn't see them for the rest of the week after that mishap

    • @lashlarue7924
      @lashlarue7924 7 місяців тому

      😱😱😱😱😱😱😱😱😱😱😱😱😱😱

  • @hasanpatel9029
    @hasanpatel9029 9 місяців тому

    The GG part In what to do if you delete your production DB always gets me, nice content.

  • @cc3
    @cc3 8 місяців тому +2

    I deleted the main site from our backend in my first month as a full stack developer. Fortunately i figured out how to rebuild the apache server and clone the repository but i definitely worked well past my hours that day and the stress was crazy

  • @markh3684
    @markh3684 Рік тому +23

    Mistakes in the moment happen. I'm focusing more on the "we thought things were working as expected" parts. The backup process familiarity, backups not going to S3, Postgres version mismatches, insufficient WALs space, alert email failures, diligence on abuse deletes... These were all things that could have been and should have been caught way before the actual incident.

  • @bennythetiger6052
    @bennythetiger6052 Рік тому +345

    This video made me say "Oh... my... God..." way too many times 😂😂. Felt like some Chernobyl documentary about a bad sequence of actions. Love it! This is very insightful as to what things can take place on these types of environments as well as what are some measures that can prevent major falis like that. It's also super interesting to see that, no matter how perfect a software system is, humans will still find a way to screw it up 😂

    • @blazi_0
      @blazi_0 Рік тому +7

      Bro let's also don't forget the damage had already done, the server was down for like 18 hours thousands of prs, comments, issues and projects are all delete permanently, this should be a bigger deal

    • @mrsharpie7899
      @mrsharpie7899 11 місяців тому

      I'd love to see the USCSB do an animation on this incident lmao

  • @jamesrosemary2932
    @jamesrosemary2932 11 місяців тому +4

    A long time ago we implemented a policy that absolutely nobody operates the production console alone.
    There always has to be someone else looking over your shoulder to point out oversights like the one in the video.

  • @shashankh7768
    @shashankh7768 7 місяців тому

    The story telling/edit is unmatched. Hands down best docu/short movie on youtube😂!

  • @CharlesChacon
    @CharlesChacon Рік тому +19

    I’m pretty sure this event only ended up affecting things like comments and issues, but not the actual git repositories themselves, which would have been a huge relief, I imagine. Still, this was one of the most interesting things I’ve ever followed and ended up motivating me to learn a ton about databases, cloud practices, devops, and everything-as-code culture. Thanks for providing such a great lesson, GL. And huge kudos to them for transparency

  • @Socsob
    @Socsob Рік тому +8

    This is so cool to know the inner workings of a team like this

  • @SurfsUpSeth
    @SurfsUpSeth 8 місяців тому

    You can’t prevent mistakes but you can sure prepare for them!

  • @dany2685
    @dany2685 7 місяців тому

    I am working as a bank programmer and we have two important servers in production. One is in sync with the main one. If the main one is broken or something does not work properly we change it to the other one. Also they have many ways to backup like multiple storage units and maximum security of who has access to data. We had some issues on testing platform where a guy accidentally deleted the database but we had backup in less than 30 min made by our sys admin guy. We did not ever have any tragic issue on production.

  • @malborboss
    @malborboss Рік тому +3

    We need more videos like this one. This was amazingly interesting

  • @iTsBadboyJay
    @iTsBadboyJay Рік тому +11

    absolute nightmare. loved every min of this

  • @beatrizdominguez9149
    @beatrizdominguez9149 9 місяців тому

    This is really good! What editing software do you use?

  • @eswarnichtsmehrfrei
    @eswarnichtsmehrfrei 2 місяці тому +3

    All my backup jobs have to report to an uptime service.

  • @eboubaker3722
    @eboubaker3722 Рік тому +25

    Wow the amount of stuff i learned here is huge, please make more reviews like these i subscribed and turned on notifications please don't disappoint me

  • @bmo3778
    @bmo3778 Рік тому +16

    I barely understand anything here, but all I can say is massive thanks to the team who have worked hard, advancing our computer tech to the current state we have!

  • @thetophattedanon
    @thetophattedanon 9 місяців тому +1

    I do not know how I got here, I don't get most of the video, But I am absolutely lovin' It as It's bloody entertaining.

  • @atribhattacharyya2631
    @atribhattacharyya2631 9 місяців тому

    The most horrific event of a developer..

  • @jsvanderburgh
    @jsvanderburgh Рік тому +11

    Great video, nice editing, and just very entertaining overall!

  • @justdoityourself7134
    @justdoityourself7134 Рік тому +67

    Having a live screenshare with team members watching might seem a little wasteful. But for critical procedures like this, it is well worth the added cost.

    • @Navak_
      @Navak_ 9 місяців тому

      Most people don't see the importance of such extreme level of caution until it's too late. It's like handling a firearm.

  • @Ziggyzaggy300
    @Ziggyzaggy300 8 місяців тому +3

    Me who understands less than 50% of the words: hmnn yes interesting wow

  • @MPSmaruj
    @MPSmaruj 9 місяців тому +2

    Also one thing I used to scoff at when I was a newbie was assigning names as aliases to your servers. Like: actual words instead of numbers. It seemed a little asinine to me at first but even in this scenario: it's much easier to confuse db1 and db2 than, eg.: amelie and betrand.

  • @hououinkyouma2426
    @hououinkyouma2426 Рік тому +24

    Can't wait for part 2

    • @kevinfaang
      @kevinfaang  Рік тому +14

      Could just be missing the sarcasm but if you're referring to the ending Google bard isn't exactly the best at being factually accurate...

    • @Xanhast
      @Xanhast Рік тому +1

      @@kevinfaang maybe he's being ominous :o

  • @blank001
    @blank001 Рік тому +6

    One strict rule I always follow when connecting to prd servers via ssh or DB UI agent (pgadmin) is I always use different background colors,
    Red for prod
    Green for staging
    Black for test and local
    + double checking every command
    You can never be sure enough

  • @_tsu_
    @_tsu_ Рік тому

    This is a masterpiece of tech UA-cam. Fun to watch but also educational.

  • @jim2lane
    @jim2lane 9 місяців тому +3

    OMG, we have all been there haven't we? That awful, dreadful realization after deleting something that you shouldn't have. Mine was back in the days of manual code backups, before ALM tools were ubiquitous like today. I thought I had taken the last three days of code changes and overwritten the old backups that were no longer needed. And then I realized that I had done the exact opposite, and just deleted three complete days of coding - and would now have to recreate them from scratch 😒😭

  • @Rametesaima
    @Rametesaima Рік тому +8

    I've always been paranoid when working in Prod. Always make it a point to have at least the Ops Lead on a screen-sharing session where I show what I'm doing while requesting affirmative acknowledgement of each step before proceeding. It's annoying. It's slow. But boy ohh boy does it make me feel safer.

    • @isaiahsmith6016
      @isaiahsmith6016 11 місяців тому

      It may be slow but look at it this way. You're probably saving a lot more time in the long run by preventing something horrible from happening in the first place.

  • @torreip3012
    @torreip3012 Рік тому +8

    Thank's for the content I really love tech horror story! Hope to see more (last 2 where really good too ^^)

  • @Enteropy23
    @Enteropy23 9 місяців тому +3

    " my bad guys i missclicked"

  • @robbybankston4238
    @robbybankston4238 Місяць тому +2

    I'm glad they didn't fire the engineer. It goes to show the differences in mindsets from some organizations that care about it being a learning experience (albeit an expensive one). Many corporations would have fired the engineer as soon as the issue was resolved without hesitation. Thanks to those orgs who care about their team members and being more concerned with lessons learned.

  • @Blackmetalstudios
    @Blackmetalstudios Рік тому +3

    Absolutely goated move by team member 1 backing up the entire db1 database 6 hours prior

  • @seedmole
    @seedmole Рік тому +5

    A nightmare indeed. I've been working on a DAW replacement in the Pure Data environment, and while it has been amazing, my file management system for it has been a bit lacking... when creating a new audio file, I do not have a method to check if the filename is already in use, and the numeric incrementation I use doesn't cause a prompt to save the overall patch, and so in more tunnelvisiony moments where a perfect storm of conditions meet, I have accidentally deleted prior recordings. It's devastating, especially considering I don't even have a way to be certain about what recording was deleted. I'm starting to think my solution is to start inserting a random string at the end of each filename.. I used track length at one point before but it wasn't reliable because the default settings will result in the same length.
    Anyway, good lesson in how precious we must be when it comes to performing destructive operations on data.

    • @jowbloe4700
      @jowbloe4700 Рік тому +3

      Why not use a DateTime stamp in the filename?

  • @Factory400
    @Factory400 8 місяців тому

    The next day.....backup plan was audited and improved.

  • @stevencoetzee1597
    @stevencoetzee1597 11 місяців тому +1

    By far the most suspense I have felt during a dev story

  • @jonix24mejor
    @jonix24mejor Рік тому +8

    And yes... this is exactly the reason why I didn't study programming / engineering in college, and instead opted for graphic design / communication.
    if I write or design something wrong and it gets published, well, at worst the publication stays published as a reminder of my mistake, in programming all it takes is one finger mistake, misremembering something or just a simple distraction and you can absolutely wipe an entire company's network infrastructure out of existence.

  • @thedanyesful
    @thedanyesful Рік тому +1

    Great video! Very entertaining and great break down.

  • @hereandnow3156
    @hereandnow3156 9 місяців тому

    "They assumed the other backup procedures were sufficient."
    Your other backup procedures are _never_ sufficient.

  • @Dobaspl
    @Dobaspl 3 місяці тому +1

    Even before I started working in one company, one IT specialist deleted the directories of the new CC-supporting system. This was shortly after its implementation into production. Worse still, it turned out that the backup process was not working properly. For a week, the team responsible for programming this system practically stayed at work, recreating the environment almost from scratch. :D

  • @MrB10N1CLE
    @MrB10N1CLE Рік тому +4

    3:52 it was at this moment when the viewers collectively scream, transcending space-time and raising a cosmic choir of dread and regret.