Resiliency vs. Redundancy in the Data Center - 1177

Поділитися
Вставка
  • Опубліковано 4 жов 2024
  • You are not really wanting redundancy, you want resiliency, but you use redundancy to build resiliency.
    All Servers, storage, data centers, power, network, cooling and so on, needs to be resilient to be able to keep going if something goes wrong.
    Check out 𝐌𝐲 𝐏𝐥𝐚𝐲𝐇𝐨𝐮𝐬𝐞 𝐥𝐢𝐭𝐭𝐥𝐞 𝐬𝐡𝐨𝐩 : www.myplayhous...
    " Be aware that the shipping prices is worst case, until it know where to ship to!! "
    Twitter : / mortenhjorth
    Facebook : / mortensplayhouse
    [Affiliate Links]
    Bargain Hardware : www.bargainhar...
    Using the 𝘾𝙤𝙪𝙥𝙤𝙣 𝘾𝙤𝙙𝙚 : 𝙢𝙮𝙥𝙡𝙖𝙮𝙝𝙤𝙪𝙨𝙚 at checkout will give you 5% discount.
    ___________________________________________________________________________________________________
    / myplayhouse
    For 3$ a month, you get an extra weekly "What's UP" update video. Just for my Patrons. The Support I resave on patreon is all used on stuff to make interesting videos on UA-cam.
    My PlayHouse is a channel where i will show, what i am working on. I have this house, it is 168 Square Meters / 1808.3ft² and it is full, of half-finished projects.
    I love working with heating, insulation, Servers, computers, Datacenter, green power, alternative energy, solar, wind and more. It all costs, but I'm trying to get the most out of my money, and my time.

КОМЕНТАРІ • 85

  • @jasonricci6094
    @jasonricci6094 2 роки тому +6

    Great explanation! One of the first questions I ask leadership when they want a new system is what the cost to the business will be if the system goes down, not just financial but reputation as well. We then build the plan based on that cost on how redundant we make the systems. One big overlooked single point of failure I often see if the path of both power and network coming into the facility. Not only the physical path the cables take but where they are fed from, especially when it comes to network. Different providers that do not rely on the same backhaul transport is tough sometimes but is key to resiliency.

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +2

      Yes downtime can be extremely expensive.. it's not always possible to get two separate power or network providers,,here.

    • @guywhoknows
      @guywhoknows 2 роки тому

      GSM and DSL and fibre. Triple redundancy routes, this is why load monitoring and balance is essential.

    • @skorpionas1
      @skorpionas1 2 роки тому

      @@MyPlayHouse
      Not simple is datacenter like azure?

  • @SirHackaL0t.
    @SirHackaL0t. 2 роки тому +3

    I worked for a company that had 4 electric risers going up the building. One went off line which was fine, but certain Sun servers were single PSU and they were plugged into another riser. The second riser had had it’s fuse downrated to 80% and when that one popped we lost most of the building. Only the data centre for Europe.
    We then moved some kit to a third party data centre whilst we uprated the power in our building. Whilst in the third party data centre there was a power cut. No problem as they had battery backups and 3 generators. 1 generator was down for maintenance which was fine. A second generator failed to open it’s radiator louvres which wasn’t noticed and the generator shut down as it overheated.
    Needless to say the whole data centre went dark.
    Doesn’t matter what you try to do, it’ll still not be enough. :)

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +3

      We have a generator, with a 800liter fuel tank,, when that was getting low, fuel was to be pumped from a larger tank,,,, guess if the pump was on main power :-)
      (we caught it, before we needed it)

    • @SirHackaL0t.
      @SirHackaL0t. 2 роки тому +3

      @@MyPlayHouse I used to do IT for a chap who did electrics for companies in the UK. He had a customer who had a header tank for their generator at the top of the building with the main tank underground. When some battery cells died and they had a power cut they found out that it then set off the fire alarm.
      Guess what the fire alarm did. Yup, it emptied the header tank back to the underground tank.
      They had to find the faulty cells, bypass them, then carry buckets of diesel up the building to the header tank so they could get the power back on.

  • @kganesan9243
    @kganesan9243 Рік тому +1

    Excellent video Morten, thanks a lot for making the hard topic very simple to understand. Keep making such inspiring videos 👍

    • @MyPlayHouse
      @MyPlayHouse  Рік тому

      Hi K Ganesan
      Thank You very much! glad you liked the video :-)
      Thank you for watching! :-)

  • @gillekes
    @gillekes 2 роки тому +3

    We actually had an internet problem a few weeks ago here in Portugal at our office. The fibre cable was cut and it took the ISP 5 days to fix it. We had no backup plan for internet and as we run our servers locally there was no remote access for the remote workers… We now have a backup LTE connection as backup

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +2

      The need for internet,, is very clear,, and a company can die after just a few days without it.

  • @electrocyper
    @electrocyper 2 роки тому +3

    Great explanation, Morten! I really like the escavator drawing 😁. Thinking of a t-shirt with My Playhouse logo and a hand drawed escavator on it 😁 Greetings!

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +2

      Hi Pavel Atanasov
      Thank You very much! Excavator when you need to find fiber or power cables! - glad you liked the video :-)
      Thank you for watching! :-)

  • @catmantech
    @catmantech 2 роки тому +2

    Great video Morten, very informative.
    From the users point of view, this is the behind the scenes stuff they don't realise about, from us tech guys this is the stuff you are all too well aware of.

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +1

      Yes,, this is way overkill for home users :-)

  • @JohnnieHougaardNielsen
    @JohnnieHougaardNielsen 2 роки тому +2

    Another type of resiliency is to guard against same security or maintenance flaws hitting all over at the same time. Taken to the extreme, this could mean having different software setups, in the assumption that a zero-day would only allow the bad guys to take out one part. And of course avoid the errors hitting even the major cloud vendors when network maintenance causes outages....

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      Hi Johnnie Hougaard Nielsen
      Thank You very much! glad you liked the video :-)
      Thank you for watching! :-)

  • @Eo_Tunun
    @Eo_Tunun 2 роки тому +2

    That firewall in your face you mentioned only becomes a firewall when you snort cayenne pepper! ^^)
    My home solution is very simple, actually: A file server with 2x2TB in a raid 1 and a 2TB USB drive as safe long term storage that only gets plugged in every once in a while, when there actually is a change in what needs to be stored. The USB drive is secured in cotton balls and tinfoil with barbed wire against stupid people in a safe hideaway in a deep dungeon.

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +1

      This was not really for our home use,, more for business.

    • @Eo_Tunun
      @Eo_Tunun 2 роки тому

      @@MyPlayHouse Ooooh, err… I guess i over-interpreted that "play"-part, right? 😇

  • @rugglez
    @rugglez 2 роки тому +1

    Some really useful info Morten. Thank you.

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      Hi Rugglez
      Thank You very much! glad you liked the video :-)
      Thank you for watching! :-)

  • @sanjaybandi9565
    @sanjaybandi9565 2 роки тому +1

    Morten you really Redundancy to share knowledge us... keep going with better stuff... THX

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      Hi SANJAY BANDI
      Thank You very much! glad you liked the video :-)
      Thank you for watching! :-)

  • @relaxingnature2617
    @relaxingnature2617 2 роки тому +1

    Great opening music

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      Yarh,, I had the band sitting around anyway,, :-) same music for not 1177 videos..

  • @Pit_stains
    @Pit_stains 2 роки тому +1

    Not 2 different storage systems. Typically a single storage system, but the storage system has multiple nodes so it can failover.

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +1

      Yes,, that is the normal thing to do,, but with two data centers come two storage systems.

    • @Pit_stains
      @Pit_stains 2 роки тому

      @@MyPlayHouse True. I made this comment at 17:00 when a 2nd datacenter wasn't brought up yet.

  • @MeowHomelab
    @MeowHomelab Рік тому +1

    My homelab was resilient too, basically just like your datacenter in very small scale, It's have two PDU then two UPS then two electrical panel (fuse box, safety relay, etc) then two transformer then two difference electrical company operator also two difference high voltage substation of course, also it's have two network card then two switch then two main switch also two ISP plus one LTE mobile operator as a backup, also I use proxmox HA for those servers, only 20 servers for now (It's only used 10 servers normally, another 10 for redundancy through proxmox HA), My homelab was inside my playhouse and it's ex-factory, so I have very superior electrical supply, mine was supplied by two generators and it's not really redundant (one generator is only can support homelab load, without another playhouse stuff), but I can use each servers to use two generator in balance (So basically I used them as parallel, by using two PSU supplied from two generator in balance mode), also it's supplied by solar panel, also it's two generator and homelab was located in second floor, so I can still run my homelab when there is flooding also generator was supplied by fuel tank (for about three days, exclude 4 hours per day supported by solar panel), and second floor was supported by suction pump (to prevent water for coming when was flooding), also I have two insurance company to back it up, it's very resilient, but it's still just one playhouse with one homelab inside, not two homelab 😅 , it's still not resilient from earth quake, arson, nuclear war, etc 🤣

  • @VioletDragonsProjects
    @VioletDragonsProjects 2 роки тому +2

    Better to have 2 UPS can change batteries on the fly although a decent UPS allows you to change batteries when it's on but can do Power Load balancing so the 1 UPS is not fully overloaded. I need your advice Morten maybe you could contact me but I'm looking at installing Solar Panels so I don't have to pay the full price from the grid so maybe you could point me in the right direction 😁

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      If you have two separate power systems,, it is difficult to studently power them both from one UPS and generator.

  • @lpseem3770
    @lpseem3770 2 роки тому +2

    Remember to hire at least two electricians and sysadmins.

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      Hi lp seem
      Thank You very much! glad you liked the video :-)
      Thank you for watching! :-)

    • @guywhoknows
      @guywhoknows 2 роки тому

      Not if there friends or related or work in the same office... Cough cough.

  • @HomelabExtreme
    @HomelabExtreme 2 роки тому +2

    Personally, i have 3 copies of data, 2 UPS'es in each rack, with 2 PDUs, 1-2 ATS'es, 2 ToR switches, and 2-4 servers in a cluster.
    I don't see the value in doubling PCIe cards like ethernet and storage, because they fail so rarely, so they are pretty much on level with the motherboard in terms of risk.
    Regarding the server PSUs, they also fail rarely, but i use both, because otherwise a UPS test which is fairly frequent, could take out a whole string of servers - and yes, you could rely on an ATS, but those are SPOFs too.
    My servers run on A+B power directly from the PDUs/UPSes, but the storage system which is more critical, run on 2 ATSes, which means even if a UPS fails, all storage components still receive power on both power supplies.
    It is wired such, that the normal state of ATS A is connected to power A, and the normal state of ATS B is connected to power B, so by default, all storage systems runs on A+B power (also in case of ATS failure), but in case of power/UPS failure, it runs on all A or all B power.

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      There are limits to what you do in your home setup,, and every extra watt is like 23kr a year :-/

    • @HomelabExtreme
      @HomelabExtreme 2 роки тому

      @@MyPlayHouse but that IS my home setup 😁

  • @ahmedifhaam7266
    @ahmedifhaam7266 Рік тому +1

    I just wanted to point out that, both nose and mouth work together as a resilient system. XD

    • @MyPlayHouse
      @MyPlayHouse  Рік тому

      Well they do overlay for some functions :-)

  • @elminster8149
    @elminster8149 2 роки тому +3

    You only really need one UPS, the other route should bypass it in case of failure. The likely hood of loss of utility power and your UPS is very unlikely.

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +1

      I know,, But I guess it becomes difficult to power two sides with one generator..

    • @morosis82
      @morosis82 2 роки тому +1

      As long as the UPS is regularly tested. I worked for a company with a datacentre on site that had the UPS fail the moment it took over because the batteries were bad. They had, as you might say, all the gear and no idea. Had a couple days off because the client servers were the first ones to be brought back online.
      This is, I think, where cloud can be a great asset to an org, in that you can host your redundant solution there, with db and app servers ready to take over in case of failure. Could be a lot cheaper than creating the fully redundant solution yourself.

    • @guywhoknows
      @guywhoknows 2 роки тому

      @@morosis82 normal then.
      The cloud doesn't work oddly enough.
      It's mostly due to how it's used.
      Before the cloud it was SAAS.
      The problem for us old folks was that the dial up info exchange was never live enough. And there would be transmission loss.
      I've only had minor issues, cleaner wiped a server is the best guess. But it was down for a short time.
      Lead acid ups are a old thing which should be assigned to history. Lfp systems have a much better life expectancy (16 years) as apposed to SLA 3 years.
      But load testing is important.

    • @morosis82
      @morosis82 2 роки тому

      @@guywhoknows "cloud doesn't work"
      Uh, not sure what that means. I lead a team maintaining a global customer info system with millions of profiles that is 100% cloud based providing that data to many other parts of the business. It works because while you can employ the people and buy the kit, our business is not datacentre focused, so we pay someone else to do that for us.
      I actually think the best way is a hybrid setup, with private/public cloud. The trick though is making sure your private cloud is compatible with the way modern applications are designed for the cloud.

    • @guywhoknows
      @guywhoknows 2 роки тому +1

      @@morosis82 hybrid would work.
      But kill the internet connection and no access to the cloud, that why they didn't work.. also some businesses have a considerable slow uplink. Depending on the loads it can cause a lot of problems.

  • @liveyourbestlife1513
    @liveyourbestlife1513 2 роки тому +1

    LOCKSS: Lots of Copies Keeps Stuff Safe.
    Ceph: resilient and extensible storage system
    If you need to redesign the world to get a little resiliency, consider changing your software to accommodate the possibility that things fail. You would save a lot of money, effort, and complexity if your resiliency model started with software rather than leaving the resiliency only to hardware-based solutions. Since the core software is what you really care about, why not ask the software to exhibit better behavior in the case of hardware failures?

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +1

      The systems and the software running on them,, clustering and safety,, is a whole other chapter :-/

  • @guywhoknows
    @guywhoknows 2 роки тому +1

    I was hanging on in there.... Sorry your solution failed.
    And it went over budget.
    Here is why.
    Back before the web was a thing, we met and talked about redundancy an resilience.
    So there are three areas.
    Building. Data. Production.
    So most of what you have said is right, and your cross talking switches. But you missed a load balance and monitor just after the main switch.
    So what happens is that your data flows from a to z.
    The paths (network) can route via h,e,l and back. (Do you get that joke)
    The monitor detects and monitors h,e,l as well as a and z.
    It routes, if a route is broken it changes the route. This checks all the networks and points.
    -this is how I told a ISP before its failure, which it didn't listen and cost itself a few million.
    The storage and the backup which are two different but the same, must replicate. This is done off the main service to lighten the main production loads.
    So you get d1 and d1a
    But and it was/is common that you get a ram or CPU fault which then causes d1 and D2 to copy corrupt data.
    So you have to roll back to d1+2 daily backup or bihourly until you find safe data.
    Then it has to pulled d1 and find the missing data and enter the data into d2.
    In the meantime d3 with its separate server and data source takes over.
    When the fix is done then you combine the data from d1,D2 and d3 to make the replacement single data and restart all systems and backups.
    Meanwhile the load balancing knows the d1+D2 and it's server is off line and does not route traffic there but to d3.
    Once back on line most people will never notice, and the it guys look like they've been on a three week bender, messy hair, unshaven, stink and not changed clothes.
    Anyhow. I thought with VM the usual wasn't to fire up another instance when th main server fail and have high redundancy on the data/storage network?
    I've grid, generator, solar and battery.
    I also have two networks + storage.
    I have spares ready and VM as well as physical hardware and servers.
    Why? I wrote the book.
    Zero data loss. And that's not a good thing, as a "update and upgrade" takes a long time and there is much of the same...

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +2

      Hi Rory
      Please keep it a little shorter :-/
      I can't go into every details,, and I do also not know all the details,, there was lots of stuff I did not mention,,, fire suppression, cabling, power strips,,ect.. Can't fix all the problems in one video :-)

  • @DAVIDGREGORYKERR
    @DAVIDGREGORYKERR Рік тому +1

    Are you saying that the data centre should move to a place north of Helsinki Finland.

    • @MyPlayHouse
      @MyPlayHouse  Рік тому

      I do not remember saying so :-)

    • @DAVIDGREGORYKERR
      @DAVIDGREGORYKERR Рік тому

      You didn’t but maybe that’s the way to go as there is natural cooling

  • @matthiaslange392
    @matthiaslange392 2 роки тому +1

    But what if a big asteroid hits the earth? There's no (geo-) redundancy.
    I like hosting my own problems myself and not somwhere in the cloud, because where clouds are, there's often stormy wheather.
    There's nothing that helps with a worst-case-scenario. But for little bad scenarios i trust in local geo-redundancy and a good backup-solution with offline and offsite-backup. UPSes(ssss) help for short power-outages. And after an hour or so, the databases have to stop carefully and the servers have to gracefully shut down. Because: nice to know, that the servers are still running. But this doesn't help the employees. They have no UPSes on every workstation and on every switch. And also the printers need power and the lights in the office and also the electric doors. You can't fix every problem before it occures.

    • @morosis82
      @morosis82 2 роки тому

      If your redundant solution is the cloud, well then your employees can continurle to work from home when theain system dies.
      It can also then be hosted in a physically separate geolocation, for that redundancy. What if, for example, there is a fire that takes out the datacentre for an extended time.

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +1

      Big asteroid,,, we might have bigger problems, then out data :-)
      No cloud is the "Clear Sky Strategy".

    • @matthiaslange392
      @matthiaslange392 2 роки тому

      @@morosis82 like in OHV Datacenter...

    • @matthiaslange392
      @matthiaslange392 2 роки тому

      @@MyPlayHouse Cloud is a buzzword in a world of hiding other administrators mistakes in fog.

  • @skorpionas1
    @skorpionas1 2 роки тому +1

    How not to AZURE ? but i need geolocation backups and some servers ?

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +1

      I am not an expert on AZURE,, but I would assume that Petabytes of storages/backup,, needs to be payed for in AZURE at some point.

  • @zx8401ztv
    @zx8401ztv 2 роки тому +2

    Hmm complicated, yep there is always a power cut created by digging in the wrong place.
    I've been cut off twice because of the digger of death lol.
    Your drawings are funny but make the understanding of server hell more logical.
    Oh hell cling film on servers, what a pillock lol.

  • @GeoffSeeley
    @GeoffSeeley 2 роки тому +1

    TL;DW: Double everything at a minimum.

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +1

      That is externally expensive.

  • @relaxingnature2617
    @relaxingnature2617 2 роки тому +1

    I want to buy an Excavator now

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      Why,, have you lost your power or fiber cables :-)

  • @leadiususa7394
    @leadiususa7394 2 роки тому +1

    LOL You design resiliency witch will leads you to Redundancy.... /:> Been doing that kind of stuff for over 30 years now... lol

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      Yes,, but you do not really want Redundancy,, you want resiliency. If the system never fails, you would not want Redundancy.

  • @f1r3man1000
    @f1r3man1000 2 роки тому +1

    hi
    good explanation about redundancy ( and ear, s xD )

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      Thank You - Glad you liked it! :-)

  • @relaxingnature2617
    @relaxingnature2617 2 роки тому +1

    Many fish have 4 nostrils

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      I will have to take your word for that :-)

  • @EmmanuelRAYMOND69
    @EmmanuelRAYMOND69 2 роки тому +1

    Very good video Morten ! (IT for morons.... ^^ )

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому +1

      ahh you do not need to be an moron,, a pretty standard idiot will do :-)
      it is important,, sometimes redundancy can be made obsolete by logic thinking.

    • @EmmanuelRAYMOND69
      @EmmanuelRAYMOND69 2 роки тому

      @@MyPlayHouse PS : you have a mail for an idea about your playhouse 🙂

  • @richardfuller7506
    @richardfuller7506 2 роки тому +1

    Well that should trigger the climate change/save the planet nutters 😀

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      I do not really follow :-/

    • @morosis82
      @morosis82 2 роки тому

      Not sure why. None of this is triggering to me.

  • @Ahmed-gm8li
    @Ahmed-gm8li 2 роки тому +2

    First

    • @MyPlayHouse
      @MyPlayHouse  2 роки тому

      ╭━━━┳╮╱╱╱╱╱╱╱╱╱╭╮╭╮╭╮╭╮╱╱╱╱╱╱╭━━━╮
      ┃╭━╮┃┃╱╱╱╱╱╱╱╱╱┃┃┃┃┃┃┃┃╱╱╱╱╱╱┃╭━╮┃
      ┃┃╱┃┃╰━┳╮╭┳━━┳━╯┃┃┃┃┃┃┣━━┳━━╮┃┃╱┃┣━╮╭━━┳━━┳━━╮
      ┃╰━╯┃╭╮┃╰╯┃┃━┫╭╮┃┃╰╯╰╯┃╭╮┃━━┫┃┃╱┃┃╭╮┫╭━┫┃━┫━━┫
      ┃╭━╮┃┃┃┃┃┃┃┃━┫╰╯┃╰╮╭╮╭┫╭╮┣━━┃┃╰━╯┃┃┃┃╰━┫┃━╋━━┃
      ╰╯╱╰┻╯╰┻┻┻┻━━┻━━╯╱╰╯╰╯╰╯╰┻━━╯╰━━━┻╯╰┻━━┻━━┻━━╯
      ╭━━━╮╱╱╱╱╱╱╱╱╱╱╭━━━╮
      ┃╭━╮┃╱╱╱╱╱╱╱╱╱╱┃╭━╮┃
      ┃┃╱┃┣━━┳━━┳┳━╮╱┃┃╱┃┣╮╭╮╭┳━━┳━━┳━━┳╮╭┳━━╮
      ┃╰━╯┃╭╮┃╭╮┣┫╭╮╮┃╰━╯┃╰╯╰╯┃┃━┫━━┫╭╮┃╰╯┃┃━┫
      ┃╭━╮┃╰╯┃╭╮┃┃┃┃┃┃╭━╮┣╮╭╮╭┫┃━╋━━┃╰╯┃┃┃┃┃━┫
      ╰╯╱╰┻━╮┣╯╰┻┻╯╰╯╰╯╱╰╯╰╯╰╯╰━━┻━━┻━━┻┻┻┻━━╯
      ╱╱╱╱╭━╯┃
      ╱╱╱╱╰━━╯
      ╱╱╱╱╱╱╱╱╭━━━┳━━┳━━━┳━━━┳━━━━╮
      ╱╱╱╱╱╱╱╱┃╭━━┻┫┣┫╭━╮┃╭━╮┃╭╮╭╮┃
      ╱╱╱╱╱╱╱╱┃╰━━╮┃┃┃╰━╯┃╰━━╋╯┃┃╰╯
      ╭━━╮╭━━╮┃╭━━╯┃┃┃╭╮╭┻━━╮┃╱┃┃╱╱╭━━╮╭━━╮
      ╰━━╯╰━━╯┃┃╱╱╭┫┣┫┃┃╰┫╰━╯┃╱┃┃╱╱╰━━╯╰━━╯
      ╱╱╱╱╱╱╱╱╰╯╱╱╰━━┻╯╰━┻━━━╯╱╰╯