Highly Available Storage in Proxmox - Ceph Guide

Поділитися
Вставка
  • Опубліковано 20 гру 2024

КОМЕНТАРІ • 108

  • @TechnoTim
    @TechnoTim 6 місяців тому +28

    Nice work! Thanks for making this easy! I need to try it out someday!

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +3

      Thanks, Tim. I'm finding it particularly useful for K3S Servers and my firewall. Having the VMs failover automatically means there's no disruption to the cluster, no pulling pods etc.

  • @ewenchan1239
    @ewenchan1239 6 місяців тому +13

    1) You don't TECHNICALLY need a separate drive, you just need a separate PARTITION that Ceph can take over and have full control over.
    For example, in my OASLOA Mini PC (N95, 16 GB, 512 GB NVMe 2242 M.2 SSD), I partitioned the 512 GB NVMe SSD on each of my 3 nodes such that 128 GB is given for the Proxmox install, and the local-lvm, and then the rest is a separate partition that is given to Ceph to have dominion over.
    (My OASLOA Mini PC doesn't HAVE another slot where I can add additional storage devices, so I had to make do with what it has.)
    Once you have it partitioned like that, you can proceed with putting the 3 nodes into a Proxmox HA cluster, per usual, and you can then set up the Ceph cluster as well, also via the Proxmox GUI to perform the initial install, and also to set up your first monitor.
    2) re: iGPU passthrough
    This is why I DON'T recommend you install any VMs/CTs until the infrastructure has been set up to be what you want it to be.
    Set up the clustering and Ceph first, THEN set up your VMs/CTs.
    That way, the IOMMU groups will stabilise, such that it will be USABLE for what you're trying to do with it before deploying VMs/CTs/services.

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +1

      Thanks for the tips, I'll consider that on the next deployment.

    • @ewenchan1239
      @ewenchan1239 6 місяців тому

      @@Jims-Garage
      No problem.
      In my case, because my storage was dependent on the Ceph RBD/Ceph FS being up and running, before I can store the VM/CT disks, so; that meant that the clustering and Ceph had to be up and running first before I could do anything else.
      I know that you are storing the VM/CT disks on local storage, rather than storing it on the Ceph storage system, so you were able to start installing VMs/CTs before your Ceph system was set up.

    • @0xKruzr
      @0xKruzr 6 місяців тому

      yeah, but you don't want to write-exhaust the device if it's also booting the node.

    • @ewenchan1239
      @ewenchan1239 6 місяців тому

      @@0xKruzr
      Depends on how much traffic you're putting on the system/cluster.
      For my case, my 3-node HA Proxmox cluster running Ceph exists only to serve Windows AD DC, DNS, and AdGuard Home.
      So none of that is intensive.
      The monthly backups is probably more write intensive than anything else that happens for the rest of the month.
      (My N95 Mini PC, with only 16 GB of RAM, is too slow to really do much of anything else.)

    • @MrNGm
      @MrNGm 4 місяці тому

      In the constrained setup ewanchan1239 describes, using a separate partition on a single drive may be acceptable. Readers with other setups and/or reliability wishes should take into account that Ceph's reliability stems from (among others) being able to spread out data chunks to a larger number of OSDs (object storage daemons), such that unavailability of 1, 2, or 10 OSD's doesn't impact the cluster. The latter depends on the configured rules regarding failure domains (further reading in the Ceph documentation: CRUSH maps). I would always advise reading a bit more on Ceph, its architecture on a high level, and the failure modes.
      In setup ewenchan1239 describes (3-replicated Ceph with Proxmox), the cluster will become unavailable if you're, for example, performing maintenance on 1 host, and the disk of another one fails. Nevertheless, having a setup where VM data is accessible on all hypervisors through shared (network) storage, maintenance on a single hypervisor becomes a lot more simple.

  • @Layer2Clouds
    @Layer2Clouds 3 місяці тому +3

    Great Video - we support Hosted Proxmox clusters in the US and your guides are a go to for our clients! Thank you Jim.

    • @Jims-Garage
      @Jims-Garage  3 місяці тому

      @@Layer2Clouds wow, thanks for sharing. That's great to hear.

  • @Chris-rm1pn
    @Chris-rm1pn 6 місяців тому +13

    MS-01s also have vPro which supports Serial over Lan, so if you lock yourself out and don't have GPU used by host you can use that to fix issues

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +6

      Thanks, I'm still to get that working. It's quite buggy from my limited trialling.

    • @Chris-rm1pn
      @Chris-rm1pn 6 місяців тому

      @@Jims-Garage I recommend using meshcentral and their guides if you haven't tried it's the best working solution I found so far

    • @Andy-fd5fg
      @Andy-fd5fg 6 місяців тому

      Long live the serial port!
      Tis a shame they don't have physical 9 pin serial connector

    • @cschwartz
      @cschwartz 6 місяців тому

      @@Jims-Garageagreed. The implementation unfortunately is lacking and quirky. I loaded the meshcommander firmware on it to get web based kvm without needing meshcommander sw running on client or hosted app. However even that had quirks but enhanced functionality. I ended up giving up and going lacp with the 2.5 ports and reverted back to a trusty raritan ipkvm and a usb tty console. I never could get the wol aspect of it working and had to be in a booted state for it to function.

    • @cschwartz
      @cschwartz 6 місяців тому

      @@Andy-fd5fgtty to usb…. No need for a db9

  • @muhammadabidsaleem7048
    @muhammadabidsaleem7048 6 місяців тому +4

    Thank You Jim
    Keep posting new videos specially on SDN please

  • @davidbuchaca
    @davidbuchaca 4 місяці тому +2

    Very nice and detailed tutorial! abbadon, sanguinius, dorn, proposing names for the following nodes: lion, khan, corax

    • @Jims-Garage
      @Jims-Garage  4 місяці тому

      @@davidbuchaca awesome! Sage choices too!

  • @DS-ou7xm
    @DS-ou7xm 6 місяців тому +1

    Its Ok, Mate nothing wrong with having Cold and Flu symptoms..... And awesome video ... thanks

  • @johnwalshaw
    @johnwalshaw 6 місяців тому +5

    I opted for 3x Nextorage NEM-PA2TB for 2GB DDR4 SDRAM. Very happy so far. It's great having a 3 node CEPH cluster.

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +1

      That's great, sounds like a solid setup.

  • @nadtz
    @nadtz 6 місяців тому +2

    If I hadn't already built a new proxmox host before the MS01 came out I might have gone this route (though with dedicated hardware for opnsense), it's kind of crazy what minisforum was able to pack into the MS01 for the price and that ceph + proxmox HA is available for home users for free.

    • @Jims-Garage
      @Jims-Garage  6 місяців тому

      I agree. There are quirks but it's impressive.

    • @Carlos-Rodrigues
      @Carlos-Rodrigues Місяць тому

      I was waiting for this machine for so many years. Now I have 4 of MS-01. 3 for the cluster and another just for OPNSense. It's fast. It's stable. It's amazing. I just wonder if I can create a network with the MS-A1 through Thunderbolt so I can use it as a backup server with PBS.

  • @Insightfill
    @Insightfill 6 місяців тому +1

    Oh! I've been looking forward to this one!

  • @NickS34252
    @NickS34252 2 місяці тому +1

    Excellent video - I've been following along while tinkering with my own cluster. When it comes to fast nodes like the MS-01, it's a bit tricky to figure out what to put into ceph vs local storage given the performance limitations.

    • @Jims-Garage
      @Jims-Garage  2 місяці тому

      @@NickS34252 thanks. I totally agree! I'm often scratching my head thinking which should I use.

  • @cschwartz
    @cschwartz 6 місяців тому +4

    If you are going to continue to do iGPU passthrough, have you thought of passing a TTY console via USB to serial, that way you can connect up should HW change and pve wants to move around your NIC naming.

    • @Jims-Garage
      @Jims-Garage  6 місяців тому

      Good idea, I'll look into that. Thanks

  • @fbifido2
    @fbifido2 6 місяців тому +1

    @4:33 - the thunderbolt backhaul does not show up as a network bridge inside Proxmox ???

    • @Jims-Garage
      @Jims-Garage  6 місяців тому

      Eno5 and eno6 are the thunderbolt adapters. You could create a bridge if you wanted.

  • @rodneykahane4994
    @rodneykahane4994 3 місяці тому +1

    not sure what the performance implications are, but the nvme osds that were created were classified as ssd. in the advanced tab, you can manually select the drive type (hdd,ssd, or nvme).

    • @Jims-Garage
      @Jims-Garage  3 місяці тому +1

      @@rodneykahane4994 thanks, let me check that!

  • @johnvandenhurk8650
    @johnvandenhurk8650 Місяць тому +1

    First of all, I love your videos and have watched many of them.
    I have had a similar CEPH configuration on MSI Cubi Proxmox cluster using Samsung 990 Pro NVME SSD's. I was pretty happy with this until I noticed that less then six months in the SMART monitoring is failing on two of VNME's. Wearout for the three 990 Pro's, are (150% ,255%, 6%). On the proxmox forum I'm told that this is due to consumer grade SSD's.
    The 255% is from the node that does the most IO, but my no means these are heavily loaded systems.
    i wonder what your experience is so far on wearout because of Ceph?

    • @Jims-Garage
      @Jims-Garage  Місяць тому

      @@johnvandenhurk8650 thanks. It does chew through consumer SSDs. Mine is on about 40%, I think it's good for about 4 years in total.

    • @johnvandenhurk8650
      @johnvandenhurk8650 Місяць тому

      @@Jims-Garage Thanks for the swift response!
      perhaps it is only mine that have an issue, but mine are failing within a year. I will reach out to my vendor and create a ticket.
      I hope yours are better!
      How happy are you with your MS-01's? I'm considering an upgrade to an MS01 (i9-12900) cluster for the SFP+

  • @jeffersonsantos4603
    @jeffersonsantos4603 6 місяців тому +2

    Great job, man. Do you have full network performance for Opnsense via the VirtIO bridges?

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +1

      Yeah, it maxes out 10Gb via iperf3 and full 2Gb up/down via speedtest.net

    • @romseaaccthree1448
      @romseaaccthree1448 6 місяців тому

      ​@@Jims-Garage i'm assuming this is for the same VLAN iperf test. Would you also be able to test iperf results for inter VLAN traffic?

  • @georgelza
    @georgelza 11 днів тому

    ... have you done a video where you expose ceph storage to a K8S cluster via a csi driver? I have a Proxmox cluster with Ceph configure over it, running a K8S cluster and would like to place my shared block storage for the EBS onto my ceph pool.

  • @MarkConstable
    @MarkConstable 6 місяців тому +4

    I'm pretty sure if you used the gui to set up Ceph you would have had less problems. I've done it a number of times and did not have to use the cli at all.

    • @Jims-Garage
      @Jims-Garage  6 місяців тому

      The cli is necessary for the backhaul network. if it was simply the vmbr0 route then you're right, GUI would be a good choice.

  • @sku2007
    @sku2007 6 місяців тому +3

    there's some pcie passthrough translation in pve8. meaning you can set the hw for each node and in the vm the "friendly name" (don't know their wording right now, it's in datacenter somewhere)

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +1

      Thanks, wasn't aware of that. I'll take a look

    • @sku2007
      @sku2007 6 місяців тому +2

      it's called resource mappings, right below metric server

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +1

      @@sku2007 thanks, I took a look just now and the i226-v isn't on the node. Very odd!

    • @sku2007
      @sku2007 6 місяців тому +1

      @@Jims-Garage very odd! even when forwarded, the HW gets listed with lspci in host shell. with lspci -v you'll see a line with Kernel driver in use: vfio-pci

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +1

      @@sku2007 I've tried all of those to no avail. I'm going to load a live Linux installation. If I don't see it I'll rma

  • @vonwerderc
    @vonwerderc 6 місяців тому +2

    Very interesting. I'm curious how HA with OPNsense would work. Wouldn't the WAN connection from your Modem only go into one node? If that one dies, how would the other nodes be connected?

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +3

      The WAN connection goes into a switch that splits the internet to the nodes via a vLAN. They are all members.

    • @headlibrarian1996
      @headlibrarian1996 6 місяців тому +1

      How does routing work then? Only one member of the cluster should get the traffic and the switch wouldn’t know which one that is.

    • @Jims-Garage
      @Jims-Garage  6 місяців тому

      @@headlibrarian1996 well there's only one firewall at a time.

  • @zxxz-ob7ll
    @zxxz-ob7ll 4 місяці тому +1

    The grim reality of the universe requires a grim order. The machine requires perfection. Any error can become a catastrophe

  • @CastilloCrasher
    @CastilloCrasher 3 місяці тому +1

    How would one tap into this Ceph cluster from a Kubernetes cluster running on VMs in the HA Proxmox cluster?

    • @Jims-Garage
      @Jims-Garage  3 місяці тому

      @@CastilloCrasher you'd simply select the storage volume on the ceph as the storage volume for the VM. You can see that in my OPNSense video afterwards whereby the OPNSense uses the ceph storage to make it HA with a single node.

  • @majoryoshi
    @majoryoshi 6 місяців тому +2

    I could be mistaken on this, but in regards to your HA OPNsense is there any reason why you couldn't your WAN in to a switch (even an unmanaged would do the trick) and plug whatever port your WAN ports on your notes into said switch? Since you're doing HA through Proxmox/Ceph and not through OPNsense, I see no reason why that wouldn't work. Please correct me if I'm wrong though.

    • @Jims-Garage
      @Jims-Garage  6 місяців тому

      That's what I'm going to try.

  • @dimitristsoutsouras2712
    @dimitristsoutsouras2712 5 місяців тому

    Nice presentation of the procedure and your special case scenario as well.
    At the part where you created a cephfs (after you created individual ceph managers), where does that fs created on? The same1Tb nvme storage? If yes shouldn t it have some kind of partition seperation between VMs storage and ISOs or those object storage services arrange that automatixally (where goes what).

  • @hyperprotagonist
    @hyperprotagonist 6 місяців тому +6

    He’s only gone and bloody done it 👏

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +4

      Haha, thanks. A lot of late nights behind this one for something that on the surface is quite straightforward!

    • @hyperprotagonist
      @hyperprotagonist 6 місяців тому +1

      @@Jims-Garagekudos for persevering. On twitter you highlighted the setbacks, on discord you kept everyone reassured, and in the video your demeanour was as if it was merely a hiccup. You weren’t lying when you said I didn’t know half of it 😂

  • @Copernicus22
    @Copernicus22 6 місяців тому +1

    Hi, very impressive work! are those ceph benchmark speeds normal though? I was expecting more given 25gbit/NVMe?

    • @Jims-Garage
      @Jims-Garage  6 місяців тому

      Normal for consumer devices. Ceph isn't about performance, it's about reliability. It's perfectly fine from my experience. Anything super heavy you want local.

    • @Copernicus22
      @Copernicus22 6 місяців тому

      @@Jims-Garage ok thanks, yeah I did it once years ago, I think I had stimular results with ceph using microk8s.

  • @DavidC-rt3or
    @DavidC-rt3or 5 місяців тому

    After having setup somewhat of a test PBS server and backing up the nodes of the cluster, trying to find the steps of how to do a restore of a node that is in a cluster and has ceph.. just to make sure all of the needed information was backed up and how to restore (ahead of time :) ) Ideas?

  • @Eli-q5z9h
    @Eli-q5z9h 29 днів тому

    in the system file /etc/hosts, I put the ip addresses of the public network or the ceph network?

  • @janstasik9094
    @janstasik9094 2 місяці тому +1

    Hello, may i ask you about stability of ms-01 from time you've deployed th4 and ceph? I've ordered boxes but meanwhile i've read horrible stories about ms-01, how hard is to deploy vPRO, proxmox installation is nightmare, bios upgrade and microcode deployment nearly unrealistic, how impossible is to configure and run TH4 ports and overal ceph and box stability is nightmare, every 3 days to reboot etc..what is your real life experience? Is it worth to buy em? From my side, the best hardware for homelab. Thank you.

    • @Jims-Garage
      @Jims-Garage  2 місяці тому +1

      I haven't had a single issue since buying about 3 months ago. They've been on all that time, are on stock bios and are running ceph via TB4. Proxmox installation is the same as any other device. I don't vpro as I don't have a need to but I've heard it's a nightmare. Only issue I had was to disable ASPM in the BIOS.

    • @janstasik9094
      @janstasik9094 2 місяці тому

      @@Jims-Garage Thanks...

  • @JonatanCastro
    @JonatanCastro 6 місяців тому

    This is amazing, I just got the MS-01 to create some content for my channel, but definitely would love to have the needed hardware to do a CEPH setup. Anyway, I digress; just want to ask you how quick it is to move a CT, considering you can't live migrate them, but on the other hand, the storage is already shared!

  • @kienanvella
    @kienanvella 6 місяців тому

    You can absolutely run with spinning disks with ceph, but you need quite a few of them, and definitely want some SSD DB/WAL devices.
    I'm running a cluster of 4 nodes, with 24 spinning disks, 6 per node. 3:1 OSD to DB/WAL drive ratio (3 OSDs share one DB/WAL SSD).
    Having said that, it's not stupendously fast - especially for my write-heavy workload, but it's fast 'enough'. I've got about 35 guests, which includes a Zabbix server with DB, 3x elasticsearch, and a graylog system.
    It was quite affordable however, buying used drives in bulk.

    • @Jims-Garage
      @Jims-Garage  6 місяців тому

      That's awesome, thanks for sharing. I'll do some more testing.

  • @monish05m
    @monish05m 6 місяців тому

    May i ask for a video on how to set up that virtual nic you have running on you opnsense.
    Thanks and really loved your video.

  • @simuman
    @simuman 6 місяців тому

    Hey jim, really like your videos. I tried this a few months back and not sure if I got this ceph system wrong or not, but couldn't get it to work with a connected external NAS storage through mapped CIFS mount as the HA did not recognize the IP address for media for plex on fail over. Do you know if this is possible or have I got the wrong end of the stick about HA and how it works?

  • @Irish2086
    @Irish2086 6 місяців тому

    I have been looking for this answer for a while... How would one figure out the right number for a 5-7-9 nodes CEPH configuration... I just foun information about a 3 nodes config

    • @headlibrarian1996
      @headlibrarian1996 6 місяців тому

      I like 5 more than 3, but 5 MS-01s is fairly pricey and you can’t do a full-mesh thunderbolt network with 5. With five shutting down a node for maintenance doesn’t completely degrade the cluster and erasure coding works better with more nodes.
      A 5-node Qotom cluster is interesting because they have 2 SFP+ 10G ports, but I don’t know how well it would actually perform. You could have one set of SFP interfaces on a dumb switch for the private backhaul network, and you need 5 ports on your main switch for the public facing interfaces.

  • @lsimsdj
    @lsimsdj 2 місяці тому

    My mini pcs have one 512GB NVME SDD each... This will not work? Does it mean I need to buy one additional NVME SSD for each mini pc in the cluster?

    • @Jims-Garage
      @Jims-Garage  2 місяці тому

      Correct, CEPH requires a dedicated drive.

  •  Місяць тому

    @9:33 Try to _ALWAYS_ have a serial console. That never fails.

  • @orgind7778
    @orgind7778 6 місяців тому +1

    Thanks great video

  • @RoiskiaFilms
    @RoiskiaFilms 6 місяців тому

    I just noticed that naming scheme and i am confused. Failbaddon the Harmless and then the two primarchs? Anyway, great video. Looking forward to try this myself in the future.

    • @Jims-Garage
      @Jims-Garage  6 місяців тому

      Thanks 👍 Cadia stands (oh wait!) 😲

  • @voldllc9621
    @voldllc9621 6 місяців тому +1

    I did not see you creating a shared storage for vm and ct disks. Cephfs cannot host these because that gives you posix file storage only, not block storage. You need RADOS block storage.

    • @Jims-Garage
      @Jims-Garage  6 місяців тому +1

      Thanks, as mentioned that was in the previous video.

    • @voldllc9621
      @voldllc9621 6 місяців тому

      Sorry, i missed that, probably since i saw you installing Ceph from scratch,and after creating a replicated pool, going straight to Cephfs for ISO and CT template file storage. ISO and CT template are not crucial for HA.

    • @DavidC-rt3or
      @DavidC-rt3or 6 місяців тому

      In my setup I've got one crush rule and pool setup for ssd's for the vm disk and another with hdd's for data virtual disk of the vms. Not a high volume/performance need

  • @snowballeffects
    @snowballeffects 4 місяці тому +1

    SO... that lock out problem when you pass through the GPU - I have a standby PCI (yup PCI 😂) GPU that I popped into that previously annoyingly unused slot - leaving the original gpu in place. plug in the SVGA monitor 😂 and boom - hello cli 😅

    • @Jims-Garage
      @Jims-Garage  4 місяці тому

      @@snowballeffects nice, that's a good failsafe!

  • @cberthe067
    @cberthe067 6 місяців тому +1

    There is no Erasure Coding in Crush Rule ?

    • @Jims-Garage
      @Jims-Garage  6 місяців тому

      It's a trade off from my understanding. Erasure coding ensures better replication (data loss prevention) but impacts on performance. As I always abstract my data I'm less worried about it as a long term storage mechanism (more for failover capability).

  • @BenjaminBenStein
    @BenjaminBenStein 6 місяців тому +1

    🎉

  • @MelroyvandenBerg
    @MelroyvandenBerg 5 місяців тому +1

    is covid back again in the country? blehh.

    • @Jims-Garage
      @Jims-Garage  5 місяців тому +2

      @@MelroyvandenBerg yeah, I think there has been a summer spike

    • @dazealex
      @dazealex 5 місяців тому

      @@Jims-Garage Even here in California.

  • @mridulranjan1069
    @mridulranjan1069 5 місяців тому +1

    You didn't show or guide through the setup of anything, just talked, showed your face and a couple of screenshots. Seriously man, what CRAP!

    • @Jims-Garage
      @Jims-Garage  5 місяців тому

      @@mridulranjan1069 did you ensure that your monitor was on and that the sound wasn't muted?

  • @randallsalyer
    @randallsalyer 3 місяці тому +1

    the fix for your ipv4 is now in the setup documentation , you have it after your source line, just fyi hope you see this
    also add this is as the last line to the interfaces file unless there is a sources file in which case put it immeditately before the sources lines (or delete the sources line) /etc/network/interfaces
    # This must be the last line in the file unless there is a sources line in which case put this immediately above the sources line (or delete the sources line)
    post-up /usr/bin/systemctl restart frr.service

    • @Jims-Garage
      @Jims-Garage  3 місяці тому +1

      @@randallsalyer thanks, I will look at that!