Something to bear in mind when doing RAID5 or 6, the number of data drives should ideally be a power of two. The reason for this is down to how the checksum is calculated and it is just more efficient when the number of data drives is a power of two. So for RAID5 2D+P, 4D+P and 8D+P. For RAID6 4D+2P and 8D+2P are the sensible options in an array. There is a scheme where you do divide you drives up into say 8GB chunks and create loads of RAID6 8D+2P arrays from the chunks scattered over all the drives. All the arrays are then concatenated together to make a large volume. Goes by various names but I don't think any free software does it to my knowledge. Gives really fast rebuilds from failed drives as every drive you have participates in the rebuild. You also don't get any hotspots so performance is on average much better.
Thanks for sharing your thoughts! yes, I chose to use RAID-0 in this demonstration to keep thing simple. there are definitely other factors for consideration when using RAID levels that involve parity.
...and, if you have random parallel reads, your strip size needs to be bigger than your read size to minimize iops. (awesome visual presentation btw, really really impressed with the setup and the effort)
Awesome experiment, thank you! Assuming your Supermicro 836 server uses "not" one-to-one corresponding number of connections from initiator(s) to backplane's output ports for the disks which are under test. (IMHO obviously, number of initiator ports used should be equal to backplane's number of input ports used). I mean... Total number/count of initiator(s) "ports or single/narrow PHYs" used, and also number of backplane's input ports used are (equal to each other, but are) less than the number of "disks" (16). If I'm correct, here's my question: For the blocksizes that cause penalty wrt chunksize, meaning "blocksize < chunksize"... (i.e. chunk=128K , bs=1K , bs=16K & chunk=1M , bs=1K , bs=16K , bs=128K) What if... one-to-one connections (not multiplexed via backplane's expander/switch I guess) were used from initiator(s) ports (16 ports) to backplane's same number of input ports (16 ports) for 16 drives? . I can only hope I correctly reflected what I mean.
This 836 has the BPN-SAS2-836EL1 backplane. However, unless the test saturates the 8x6Gbps link to the HBA, I don't think there's much difference compared to the "TQ" backplane version.
It would be really educational, if you add some diagrams about the file allocation happening in nowadays storage systems. I mean, if I consider a single hard disk, the physical sector size, the logical sector size, and filesystem clustersize all must be chosen optimally. Now, lets add RAID into the equation. For example the RAID stripe size is seen in the OS as the physical sector size? If I want to store a 1-byte, or a 1-KB file, 1-MB, 1-GB file, how many sectors, clustera, raid strips are filled with actual data? Also, how much will be waste of storage due to cluster size inefficiency? I hope my question is clear.
I believe I saw a Dell SC200 Compellent in your rack. It would be great if you could do a video on that unit. I have one and would like for my NUT server to be able to command a shutdown, also the fans seem to not throttle at all. I really like the unit, but I am having trouble finding info on the unit.
That's a great question, but perhaps for a future video. Briefly, I'll just say that the number of "effective" spindles helps to improve throughput performance, but IOPS is still limited since an entire stripe must be read / written for each I/O operation. Large sequential read/write benefit the larger the strip size and stripe width.
This was a nice video. I have a similar related question. How does drive allocation sizes during formatting have an effect in these scenarios? What would be the ideal allocation size during formatting of the virtual disk? I'm guessing that it should be equal to or smaller than the strip size.
That's a really good question, because in an ideal situation, the various data management layers should all align exactly, and should be sized towards the most frequently used transaction sizes by the applications. When the layers are not aligned, a single I/O can amplify into multiple I/O in an underlying layer. And of course, trying to match the applications, of which there could be many that do vastly different things when it comes to data I/O, is always challenging. But, even if you went for the "average", the objective is to minimize the amount of I/O to complete the entire transaction stack. This is particularly important for magnetic storage where IOPS capacity is limited, but is less of an issue for solid state storage.
@@ArtofServer Okay, then I guess the ideal allocation size would be the size of the strip which would be written on 1 of the disks, ie. (total strip size ÷ number of disks) OR a factor of it. Like you pointed out, allocation equal to strip size might only work when all layers are aware of each other and work in tandem to make best use of the resources. For sequential reads/writes, it could end up having little to no effect at all. But this is certainly interesting from the point of random reads/writes.
unless you are doing incredibly large files all the time 128k or even 256k is the absolute biggest. for HDD i will bounc3e between 64k or 128k depending on the use case. for file servers with tons of small files i will go with 64k. for backup servers, or file servers with a mix or all large files i will go with 128k. for ssd based servers i stick with 128k...it's just simpler that way.. I only use ZFS for my filesystems now for non windows servers. I now do not use hardware raid at all. I will put linux(zfs) or truenas on the base metal. if i intend to run vm's then i will use linux and run zfs off that. for storage onlys ervers i use truenas core/enterprise. The concepts are similar though..:)
thanks for sharing your thoughts. the default for Linux software raid chunk size (strip size) is 512KB. I chose 128KB to spread the difference between 128KB and 1MB for the demonstration. please note that ZFS recordsize is not the same as strip/chunk size. the behavior of a ZFS record size being written to a raidz vdev is not like traditional raid (hardware or software).
What is this chassis? Do you have any recommendations for a decent chassis for lots of hot-swappable drives? How are the red LED's activated on the front panel? Can they be used to identify faulted or specified drives under Linux? Your rig is very cool... I'm finally outgrowing my old Antec 300 case in terms of 3.5" bays.
@Art of Server Quick question - have 10 2TB HDDs ready for hardware based RAID 10 (dedicated controller), what are the best settings for it assuming all types of file sizes are there (videos, 1KB small TXT, any size)?
That's a good question and what I was hoping to demonstrate in this video. You can't really optimize traditional RAID strip size for all use cases. You sort of have to pick where you want to optimize and move your strip size to cater to your most common use case. If you're going to be handling equal amounts of all types of I/O sizes, then I would just aim for the middle or defaults.
In general, a 2-way mirror will get you to about 99.99% reliability per pool per year (assuming 44-vdev mirrored zpools) But that means in one out of 10,000 pools per year, you're going to lose the whole pool due to a double-disk failure. Scheduled scrubs can reduce the risk somewhat, but never eliminate it. I've researched some interesting statistics from our internal ZFS data supporting this assertion but am not at liberty to share them yet...
not sure what this has to do with the video, but i find mirror vdevs to be wasteful for storage efficiency. good for iops perhaps, but if you're after iops these days you should just be using flash storage. if you're using magnetic storage, you should not expect high iops. if you're using flash, it seems too wasteful to throw half of it away for mirror redundancy.
@@ArtofServer mirror-vdevs is basically the equivalent of RAID 10 in software ZFS, which I'm sure you already know but that's my favorite setup. Like hardware RAID, you get some form of redundancy running 2-way ZFS mirrors, but you also get scalability in terms of random and sequential IOPS, something you don't get with parity RAID--hardware or even ZFS I believe. Only sequential, random is stagnant. I use lz4 compression in ZFS, which gives me significant storage gains on data that isn't already heavily compressed like videos, so while a mirror-udev configuration only has 50% usable data on paper and technically slightly less because mirror-udevs applies parity bits to each sector (think T10-PI), I'm getting closer to 70%-75% effective capacity thanks to lz4 compression while getting read/write increases on sequential and random IOPS, while also getting the redundancy. I use Lustre for parallel file system since it's open source and the best type of parallel file system (it's used by more than half of the supercomputers on the Top500), and it allows me to basically use zfs as a backend with lz4 compression, chaining together multiple mirror-udev pools across multiple systems (24X4 TB configurations across 3 systems). It's like a software-based 3 RAID 10s wired together via InfiniBand network in Linux, and I also use U.3 TLC NAND SSDs used as a cache as Lustre supports tiered storage configurations like the one I'm using.
Agreed, I feel they are safer than the often used z* vdevs when properly setup. Each of my mirrors is one exos and one wd ultrastar, what are the chances of them failing at the same time? If I do lose a drive, pool performance seems entirely unaffected and resilvers are quick and painless with no parity bs involved - less stress on the remaining drive. I dont mind the hit on storage efficiency, the peace of mind and iops is worth it
I don't remember exactly for sure, but I think that my strip size right now is probably the ZFS default, which I *think* is 128 kiB. Also please bear in mind though, that the size of your files as a function of the native, physical sector/block size of your HDD (whether that's 512, 512e, 520, 528, or 4k native (a.k.a. 4kn)) will also make a difference in terms of the performance of your file system. If you have a wide spread of file sizes (ranging from really small files to really large files), trying optimise it for that is virtually impossible as the configuration that would be good for really small files would be terrible for the really large files and vice versa. For that reason, I think that I either tend to stick with something like 64 kiB or 128 kiB strip size because writing a lot of tiny files hurts performance much more than having a-less-than-optimal strip size for writing large files. (i.e. if the strip size is 64 kiB, it's not great for writing large files, but it isn't going to completely kill it neither, but if you set the strip size to be too large when you're writing a lot of tiny files, then your throughput can be single digit MiB/s.)
@@ArtofServer Yeah. They introduced the concepts of ashift and record size, and I'm not really sure how it relates to the more "traditional" definitions of the various sizes, in more "traiditional" RAID implementations, per @2:01 in your video here. Nevertheless, the optimisation of the performance is a function of all of that, plus the histogram, by size, of the data that you plan on putting onto your ZFS pool.
This is a really great video, I appreciate how you explained the basics first and then went into detail. May your kitty cats be merry and meowy!
Glad you enjoyed it!
Something to bear in mind when doing RAID5 or 6, the number of data drives should ideally be a power of two. The reason for this is down to how the checksum is calculated and it is just more efficient when the number of data drives is a power of two. So for RAID5 2D+P, 4D+P and 8D+P. For RAID6 4D+2P and 8D+2P are the sensible options in an array. There is a scheme where you do divide you drives up into say 8GB chunks and create loads of RAID6 8D+2P arrays from the chunks scattered over all the drives. All the arrays are then concatenated together to make a large volume. Goes by various names but I don't think any free software does it to my knowledge. Gives really fast rebuilds from failed drives as every drive you have participates in the rebuild. You also don't get any hotspots so performance is on average much better.
Thanks for sharing your thoughts! yes, I chose to use RAID-0 in this demonstration to keep thing simple. there are definitely other factors for consideration when using RAID levels that involve parity.
Appreciate the time and effort you put into this presentation!
Thanks! Hope it was helpful! :-)
...and, if you have random parallel reads, your strip size needs to be bigger than your read size to minimize iops. (awesome visual presentation btw, really really impressed with the setup and the effort)
Thanks!
Awesome experiment, thank you!
Assuming your Supermicro 836 server uses "not" one-to-one corresponding number of connections from initiator(s) to backplane's output ports for the disks which are under test. (IMHO obviously, number of initiator ports used should be equal to backplane's number of input ports used).
I mean... Total number/count of initiator(s) "ports or single/narrow PHYs" used, and also number of backplane's input ports used are (equal to each other, but are) less than the number of "disks" (16). If I'm correct, here's my question:
For the blocksizes that cause penalty wrt chunksize, meaning "blocksize < chunksize"...
(i.e. chunk=128K , bs=1K , bs=16K & chunk=1M , bs=1K , bs=16K , bs=128K)
What if...
one-to-one connections (not multiplexed via backplane's expander/switch I guess) were used from initiator(s) ports (16 ports) to backplane's same number of input ports (16 ports) for 16 drives?
.
I can only hope I correctly reflected what I mean.
This 836 has the BPN-SAS2-836EL1 backplane. However, unless the test saturates the 8x6Gbps link to the HBA, I don't think there's much difference compared to the "TQ" backplane version.
Thanks for the video, some very useful information here.
Glad it was helpful!
Thanks for the video, gave it a thumbs up
Thanks for watching!
OMG!!!! Dancing Bear Stripper! Good times!
LOL.. just having some fun! :-) thanks for watching!
It would be really educational, if you add some diagrams about the file allocation happening in nowadays storage systems. I mean, if I consider a single hard disk, the physical sector size, the logical sector size, and filesystem clustersize all must be chosen optimally. Now, lets add RAID into the equation. For example the RAID stripe size is seen in the OS as the physical sector size? If I want to store a 1-byte, or a 1-KB file, 1-MB, 1-GB file, how many sectors, clustera, raid strips are filled with actual data? Also, how much will be waste of storage due to cluster size inefficiency? I hope my question is clear.
Thanks for the suggestion. I'll put this on my list of future videos.
I believe I saw a Dell SC200 Compellent in your rack. It would be great if you could do a video on that unit. I have one and would like for my NUT server to be able to command a shutdown, also the fans seem to not throttle at all. I really like the unit, but I am having trouble finding info on the unit.
Good eye. yes it is a SC200. It is very loud and there's not a good way to control the fans. It otherwise works great as a disk enclosure.
Nice video! can you explain what happed to the performance on different with situation? for example a Raid 6 (14+2) vs Raid 6 volume with 2x(6+2)
That's a great question, but perhaps for a future video. Briefly, I'll just say that the number of "effective" spindles helps to improve throughput performance, but IOPS is still limited since an entire stripe must be read / written for each I/O operation. Large sequential read/write benefit the larger the strip size and stripe width.
This was a nice video. I have a similar related question. How does drive allocation sizes during formatting have an effect in these scenarios? What would be the ideal allocation size during formatting of the virtual disk? I'm guessing that it should be equal to or smaller than the strip size.
That's a really good question, because in an ideal situation, the various data management layers should all align exactly, and should be sized towards the most frequently used transaction sizes by the applications. When the layers are not aligned, a single I/O can amplify into multiple I/O in an underlying layer. And of course, trying to match the applications, of which there could be many that do vastly different things when it comes to data I/O, is always challenging. But, even if you went for the "average", the objective is to minimize the amount of I/O to complete the entire transaction stack. This is particularly important for magnetic storage where IOPS capacity is limited, but is less of an issue for solid state storage.
@@ArtofServer Okay, then I guess the ideal allocation size would be the size of the strip which would be written on 1 of the disks, ie. (total strip size ÷ number of disks) OR a factor of it. Like you pointed out, allocation equal to strip size might only work when all layers are aware of each other and work in tandem to make best use of the resources.
For sequential reads/writes, it could end up having little to no effect at all. But this is certainly interesting from the point of random reads/writes.
unless you are doing incredibly large files all the time 128k or even 256k is the absolute biggest. for HDD i will bounc3e between 64k or 128k depending on the use case. for file servers with tons of small files i will go with 64k. for backup servers, or file servers with a mix or all large files i will go with 128k. for ssd based servers i stick with 128k...it's just simpler that way.. I only use ZFS for my filesystems now for non windows servers. I now do not use hardware raid at all. I will put linux(zfs) or truenas on the base metal. if i intend to run vm's then i will use linux and run zfs off that. for storage onlys ervers i use truenas core/enterprise. The concepts are similar though..:)
thanks for sharing your thoughts. the default for Linux software raid chunk size (strip size) is 512KB. I chose 128KB to spread the difference between 128KB and 1MB for the demonstration.
please note that ZFS recordsize is not the same as strip/chunk size. the behavior of a ZFS record size being written to a raidz vdev is not like traditional raid (hardware or software).
💯 thanks so much I'd love to take 4tb x 4 stripe and partition and use das for hosting vms lol
Good luck!
@@ArtofServer lol base calculations and price are roadblocks lol
What is this chassis? Do you have any recommendations for a decent chassis for lots of hot-swappable drives? How are the red LED's activated on the front panel? Can they be used to identify faulted or specified drives under Linux? Your rig is very cool... I'm finally outgrowing my old Antec 300 case in terms of 3.5" bays.
This machine is a Supermicro 836. I've made some videos about it on my channel so search around if you want to know more about it.
@Art of Server
Quick question - have 10 2TB HDDs ready for hardware based RAID 10 (dedicated controller), what are the best settings for it assuming all types of file sizes are there (videos, 1KB small TXT, any size)?
That's a good question and what I was hoping to demonstrate in this video. You can't really optimize traditional RAID strip size for all use cases. You sort of have to pick where you want to optimize and move your strip size to cater to your most common use case. If you're going to be handling equal amounts of all types of I/O sizes, then I would just aim for the middle or defaults.
Whats the bes strip size for HW Raid hosting VMs on ReFS?
Does it matter what kind if file sizes I have within the vhdx containers?
In general, a 2-way mirror will get you to about 99.99% reliability per pool per year (assuming 44-vdev mirrored zpools) But that means in one out of 10,000 pools per year, you're going to lose the whole pool due to a double-disk failure. Scheduled scrubs can reduce the risk somewhat, but never eliminate it.
I've researched some interesting statistics from our internal ZFS data supporting this assertion but am not at liberty to share them yet...
not sure what this has to do with the video, but i find mirror vdevs to be wasteful for storage efficiency. good for iops perhaps, but if you're after iops these days you should just be using flash storage. if you're using magnetic storage, you should not expect high iops. if you're using flash, it seems too wasteful to throw half of it away for mirror redundancy.
@@ArtofServer mirror-vdevs is basically the equivalent of RAID 10 in software ZFS, which I'm sure you already know but that's my favorite setup. Like hardware RAID, you get some form of redundancy running 2-way ZFS mirrors, but you also get scalability in terms of random and sequential IOPS, something you don't get with parity RAID--hardware or even ZFS I believe. Only sequential, random is stagnant.
I use lz4 compression in ZFS, which gives me significant storage gains on data that isn't already heavily compressed like videos, so while a mirror-udev configuration only has 50% usable data on paper and technically slightly less because mirror-udevs applies parity bits to each sector (think T10-PI), I'm getting closer to 70%-75% effective capacity thanks to lz4 compression while getting read/write increases on sequential and random IOPS, while also getting the redundancy.
I use Lustre for parallel file system since it's open source and the best type of parallel file system (it's used by more than half of the supercomputers on the Top500), and it allows me to basically use zfs as a backend with lz4 compression, chaining together multiple mirror-udev pools across multiple systems (24X4 TB configurations across 3 systems). It's like a software-based 3 RAID 10s wired together via InfiniBand network in Linux, and I also use U.3 TLC NAND SSDs used as a cache as Lustre supports tiered storage configurations like the one I'm using.
Agreed, I feel they are safer than the often used z* vdevs when properly setup. Each of my mirrors is one exos and one wd ultrastar, what are the chances of them failing at the same time? If I do lose a drive, pool performance seems entirely unaffected and resilvers are quick and painless with no parity bs involved - less stress on the remaining drive. I dont mind the hit on storage efficiency, the peace of mind and iops is worth it
Hello, I like your wallpaper. Can you share it?
artofserver.com/downloads/wallpapers/aos_wallpaper.png
Thank you.
any metric on disk failures?
Not sure what you mean? This video is about RAID strip size...
Hmm, so the kitty likes the size of the sausage? lol
lmao
From my experience strip size of 64k or 128k is the most optimal for majority of the workloads.
Thanks for sharing!
Thank u
sub'd
Thanks! :-)
Larger font please
What device were you watching this on?
Dank, danke und nochmals Danke !
Herzlichste Grüße aus Baden-Württemberg,
27.08.´23
Thanks for your comment. I hope you find my videos helpful!
I don't remember exactly for sure, but I think that my strip size right now is probably the ZFS default, which I *think* is 128 kiB.
Also please bear in mind though, that the size of your files as a function of the native, physical sector/block size of your HDD (whether that's 512, 512e, 520, 528, or 4k native (a.k.a. 4kn)) will also make a difference in terms of the performance of your file system.
If you have a wide spread of file sizes (ranging from really small files to really large files), trying optimise it for that is virtually impossible as the configuration that would be good for really small files would be terrible for the really large files and vice versa.
For that reason, I think that I either tend to stick with something like 64 kiB or 128 kiB strip size because writing a lot of tiny files hurts performance much more than having a-less-than-optimal strip size for writing large files.
(i.e. if the strip size is 64 kiB, it's not great for writing large files, but it isn't going to completely kill it neither, but if you set the strip size to be too large when you're writing a lot of tiny files, then your throughput can be single digit MiB/s.)
Keep in mind ZFS works differently than traditional RAID. the ZFS record size does not behave like the RAID strip size.
@@ArtofServer
Yeah. They introduced the concepts of ashift and record size, and I'm not really sure how it relates to the more "traditional" definitions of the various sizes, in more "traiditional" RAID implementations, per @2:01 in your video here.
Nevertheless, the optimisation of the performance is a function of all of that, plus the histogram, by size, of the data that you plan on putting onto your ZFS pool.