Chapters: 00:00 - Introduction 01:27 - Storage Requirements for Small to Medium-Sized Businesses 02:05 - Introducing ZFS as a Solution 02:32 - What is the required workload for this ZFS system? 04:46 - The Hardware 06:44 - Architecting a ZFS Storage Server for a Mixed Workload 22:40 - Building a ZFS Storage Server for a Mixed Workload 25:14 - Creating the RAIDZ2 Vdevs (storage pool) 29:44 - Adding Additional Virtual Devices 31:10 - Adding a Special Vdev 33:38 - Creating/Configuring File Systems 37:22 - Outro (Including a Look at ZFS Send & Receive)
While I know this is not the point of the video it honestly cracks me up watching two storage experts play the role of the customer acting like they don’t know what’s going on. Pure comedy and I love it. Ranks right up there with Doug’s love for his colored dry erase markers.
I’m 14 minutes in and I have to comment. Thank you for the knowledge in this video. As much as I know about ZFS, it’s always nice to get a lesson from two brilliant guys that know it very well. I mostly run ZFS at home for my esxi backend/all in one box for my home really. I wish I got to play with ZFS more in the wild, but I work at a MSP, so it’s not really a solution they would sell for their various reasons. Ok I’m gonna continue, but thanks again gents.
It created a single 4 drive mirror. Great catch. You probably don’t need to go that crazy with a slog and instead would get better iops capabilities out of creating the first 2 disks into a mirror and then go back and add another LOG with another 2 disks - this would result in your two 2 drive mirror vdevs in your slog.
Worth noting: If you add a special vdev, it's best to mirror, or perhaps two mirrors (RAIDZ10?), because if that special vdev fails you will lose the entire pool!
And maybe because this a new feature and doesn't have all the kinks worked out, I have found that during testing, when the special vdev fills up, it will just start writing new data to the HDD vdev. And it will not re-arrange that data in the future, like to promote frequently used data or metadata to the special. I found that setting the special vdev to only store metadata was the best use in light of this.
@@jdeee.mp3 We actually explain both of these things in the video :) that special VDEV functionality is actually by design. Since its first and most important use case is for metadata, once it starts to fill up it will eventually cut off the small block writes and send them back to the HDD's to ensure that it has enough space for the metadata.
I understand some of this and am yet further confused. But that is ok :P A couple of questions. Where does the OS that controls all of this live? Is it on the platter and spindle disks? or on the SSD's? Or on the NVME devices? Also how do you access the drives for replacement? Does the unit slide out with enough slack on the power/network cabling. So you can take the top off and then access the drives to replace the faulty one while the device is up? Or do you leave space above the device in the rack so that you can take the top off to access the drives? Or do you have to take the device offline to replace a faulty disk?
Curious why Z2 instead of a bunch of mirrored vdevs. Has your experience with the 2nd drive going bad while the replacement resilvered happened that often? So thankful for the knowledge and experience you guys are sharing. I loved how this video tied in all the special Vdevs and ZFS components into a practical build!
Hi guys, I don't quite understand why you only see the recordsize when setting up the vdevs? I haven't seen anything about a blocksize (ashift) setting. What are these for the different workloads such as database, vms and shares? Or what are the recommendations for this? Thanks
Thank you for lab presentation of your system. My question is how this system in presentation be coupled to Host as Datastore ? Fabric or iSCSI ? or just NFS ? thank you.
This is my exact scenario of needing a system for mixed use. A small SMB roughly 15 users. The system will do file storage (smb - mostly ms office and some images), host a few vms (domain controller, db server), and a database for client appointments. Needs are small enough that I can just go all flash, somethinglike 8, 4tb ssds. Do you think a slog or special cache would be beneficial in my case? If so I guess NVME SSD? I can do 10 or 25G nics.
So, so informative and HELPFUL, guys. I am in the beginning stages of setting up a home NAS, and I'll be using TrueNAS CORE, too. A couple questions, please. First, what I have interms of hardware: 4 1-TB M.2 NVMe drives, 2 SATA III SSDs, and a mirrored zpool of 2 6-TB HDD NAS-level (WD Red Plus) devices already hosting the data for an existing Nextcloud server I'm running on another machine. 1) can one have the ZIL (SLOG) and the ARC (or L2ARC) on the same SATA III SSD, but with separate partitions for each, like sdx1 and sdx2, and 2) what's the difference between SCALE's apps and CORE's plugins? Does the Nextcloud plugin on CORE start up and run a Docker instance of Nextcloud or the snap version of Nextcloud, or a plain server not in Docker, but in it's own VM, or something else like a jail on the boot disk where the TrueNAS CORE OS is running?
Hey Gerald, Appreciate the question! If you have lots of RAM, you can probably get away with no special vdev. However, you will want to watch out for the arc hit rate, as if it is up north of 90% percent then your ARC cache is serving most of the I/O requests. In this case, a spec vdev will only be for the other 10% or colder data on the zpool. The special vdev is only going to serve metadata faster. So, if faster listing times and/or very fast searches is crucial to your workflow than keeping your pool simple with lots of RAM only, and regular vdevs is a great choice. If you check out our video on NVMe special devices ( ua-cam.com/video/0aM1iZJkOaA/v-deo.html ), you will see that even a regular HDD pool can serve metadata workloads pretty darn fast with just spinners and a good ARC cache. Hope this answer clarifies things up for you. Thanks again!
The OS we are using is actually just Rocky Linux, but the menu system is actually our own Houston UI ! It is totally free and open source and easy to install if you'd like to try it. We officially support Ubuntu and Rocky... Might be some small hoops to jump through for some other Debian or RHEL derivatives :)
Great video guys! I'm actually building out a Q30 for mixed use now and found this very helpful. My build out is very similar except that we have historically kept our database data in the VM stack. Are you presenting the db storage as ISCSI LUNs to your VMs?
How durable are those Sata SSDs? Would it have made sense to have 2 of them be the SLOG and the other two be your ddt? I'm thinking that data store could make great use of dedupe.
The Micron 5300s we use are definitely robust and reliable, certainly enough to be used as a SLOG or Special VDEV. In general though a SLOG will only be useful in very specific circumstances, generally when your workload involves a lot of sync writes (databases, vm hosting), often times adding them will not provide any significant performance benefits. As for your DDT/Special VDEV, you can certainly add SSDs as these devices but you'll generally get more benefit out of using NVMe for these purposes. As a note we recommend against dedupe as the storage efficiency benefit is usually fairly small but the performance impact can be substantial. You will also want solid resiliency on a special VDEV, we'd recommend at least a 3 way mirror rather than 2 way. If your special VDEV fails so does the rest of your pool.
The LOG seems to be huge 4 SSDs, typically for millionaires without any financial constraints. Since 2018 my LOG is a SSD partition of 5GB, enough to cover all my writes during 5 seconds :) The interesting part of the video was about the special device for meta data and small files, maybe I will use my 30GB SSD partition for that purpose instead of using it as L2ARC :) I assume, that like on the HDD datapool they will write all meta data twice. My ZFS system is a $349 desktop; Ryzen 3 2200G, 16GB DDR4; 512GB nvme-SSD (3400/2300MB/s) as ZFS boot device and to store my main VMs. Other storage are 2 data pools; - one for VMs on the first 1TB partition of the HDD with a L2ARC partition of 90GB and that LOG of 5GB and - the other one for my data on the last 1TB partition of the HDD with L2ARC partition of 30GB and a LOG of 3GB. All L2ARCs and LOGs together are on the 128GB of my sata-SSD.
I love this stuff, I wish I could work on it professionally. I'd end up spending thousands for a home setup like this to run my plex and like 1 vm... sigh
Dear 45Drives Team, regarding Mixed Workload I‘m currently planning a Storinator with Houston UI for SMB Fileshares for a Active Directory in combination with Proxmox Backup Server, on a separate Drive Array, but everything within the same Storinator, maybe your team has some experience with this Config and is able to share the Key Configurations with me/us. I’ve tried installing Ubuntu server 20.04 with Houston UI and ad PBS also but this did not work, from the other side it didn’t work either (installing pbs and adding Houston ui) maybe you had this use case before
Proxmox backup servers run on Debian, I can't imagine it would run on Ubuntu without significant modification, if even possible. You could install Proxmox Backup Server off their official ISO, then install cockpit and all of our Houston Modules on top, we do not officially package for Debian, but the ubuntu packages may work with minimal modifications. Generally not a use case we're familiar with as having backup infrastructure, and file sharing infrastructure on the same device is not our recommendation. Backup infrastructure should always be on its own gear.
Could you do an architecture and build video on a dual controller ZFS storage server? Would really like to see a setup with higher availability than recovering from a replica but without having to build a 3+ node ceph cluster.
you’ll honestly loose your jobs in 5 years, when u.3/e1 nvme drives gets a lot faster. ZFS is already a bottleneck on something like 8x Micron 7450 Max. We have CPU’s now with 128 PCI5.0 Lanes, that makes up for an big amount of ultra fast ssds for almost all companies. If ZFS/Ceph will get a bottleneck, no one is going to use it anymore. Cheers
@@sbagel95 There is not much to explain, ZVOLs (ZFS Blockdevices that you use to split ZFS over ISCSI) are 20 times slower as every alternative that exists. You can share the Whole pool as it is over whatever you like, for simplicity lets take ISCSI, its still at least 2x slower as any alternative way. Lets say it differently, ZFS is the slowest File and Blockstorage at the moment for NVME’s that exist, simply because of the extreme ZIL and Cache overhead. While for spinning drives, for what ZFS was developed for, its a superrior filesystem. To explain this in detail, a youtube comment wouldn’t be enough.
Chapters:
00:00 - Introduction
01:27 - Storage Requirements for Small to Medium-Sized Businesses
02:05 - Introducing ZFS as a Solution
02:32 - What is the required workload for this ZFS system?
04:46 - The Hardware
06:44 - Architecting a ZFS Storage Server for a Mixed Workload
22:40 - Building a ZFS Storage Server for a Mixed Workload
25:14 - Creating the RAIDZ2 Vdevs (storage pool)
29:44 - Adding Additional Virtual Devices
31:10 - Adding a Special Vdev
33:38 - Creating/Configuring File Systems
37:22 - Outro (Including a Look at ZFS Send & Receive)
While I know this is not the point of the video it honestly cracks me up watching two storage experts play the role of the customer acting like they don’t know what’s going on. Pure comedy and I love it. Ranks right up there with Doug’s love for his colored dry erase markers.
I’m 14 minutes in and I have to comment. Thank you for the knowledge in this video. As much as I know about ZFS, it’s always nice to get a lesson from two brilliant guys that know it very well. I mostly run ZFS at home for my esxi backend/all in one box for my home really. I wish I got to play with ZFS more in the wild, but I work at a MSP, so it’s not really a solution they would sell for their various reasons. Ok I’m gonna continue, but thanks again gents.
This is what I was looking for from many days. I was searching various blogs and forums to find a way to use HDDs to host my Proxmox.
LOL You can totally hear the Canadian East Coast accent in there. Subbed, and loved it.
Love this kind of format, thanks a lot!
I love this video, you guys make learning zfs very entertaining
this is some damn good info, thanks guys!!
Great job love it, storinators are very nice 😊
Very nice info guys! Thanks for the video.
Finally something other than ceph :phew Thank you 🙏
When you added your slog @30:38 and selected the 4 ssds as mirror, did it create a 4way mirror, or stripe+mirror?
It created a single 4 drive mirror. Great catch. You probably don’t need to go that crazy with a slog and instead would get better iops capabilities out of creating the first 2 disks into a mirror and then go back and add another LOG with another 2 disks - this would result in your two 2 drive mirror vdevs in your slog.
Worth noting: If you add a special vdev, it's best to mirror, or perhaps two mirrors (RAIDZ10?), because if that special vdev fails you will lose the entire pool!
And maybe because this a new feature and doesn't have all the kinks worked out, I have found that during testing, when the special vdev fills up, it will just start writing new data to the HDD vdev. And it will not re-arrange that data in the future, like to promote frequently used data or metadata to the special. I found that setting the special vdev to only store metadata was the best use in light of this.
@@jdeee.mp3 We actually explain both of these things in the video :) that special VDEV functionality is actually by design. Since its first and most important use case is for metadata, once it starts to fill up it will eventually cut off the small block writes and send them back to the HDD's to ensure that it has enough space for the metadata.
@@mitcHELLOworld Ok good to know, thanks. Edit: Must not have been paying attention. Re-watching the video I did see that you mentioned it!
@@jdeee.mp3 yeah I believe the threshold is around 75% but don’t quote me on it. It can be tuned also! Thanks for commenting and watching !
I understand some of this and am yet further confused. But that is ok :P A couple of questions. Where does the OS that controls all of this live? Is it on the platter and spindle disks? or on the SSD's? Or on the NVME devices? Also how do you access the drives for replacement? Does the unit slide out with enough slack on the power/network cabling. So you can take the top off and then access the drives to replace the faulty one while the device is up? Or do you leave space above the device in the rack so that you can take the top off to access the drives? Or do you have to take the device offline to replace a faulty disk?
Amazing ... Video about ZFS storage....
I like that ZIL it should be have NVram battery i guess
Curious why Z2 instead of a bunch of mirrored vdevs. Has your experience with the 2nd drive going bad while the replacement resilvered happened that often? So thankful for the knowledge and experience you guys are sharing. I loved how this video tied in all the special Vdevs and ZFS components into a practical build!
Hi guys,
I don't quite understand why you only see the recordsize when setting up the vdevs? I haven't seen anything about a blocksize (ashift) setting. What are these for the different workloads such as database, vms and shares? Or what are the recommendations for this?
Thanks
Thank you for lab presentation of your system. My question is how this system in presentation be coupled to Host as Datastore ? Fabric or iSCSI ? or just NFS ? thank you.
pardon, I found the answer myself on the replay, Michelle has already mentioned that it can be hooked up to Proxmox via iSCSI.
This is my exact scenario of needing a system for mixed use. A small SMB roughly 15 users. The system will do file storage (smb - mostly ms office and some images), host a few vms (domain controller, db server), and a database for client appointments. Needs are small enough that I can just go all flash, somethinglike 8, 4tb ssds. Do you think a slog or special cache would be beneficial in my case? If so I guess NVME SSD? I can do 10 or 25G nics.
So, so informative and HELPFUL, guys.
I am in the beginning stages of setting up a home NAS, and I'll be using TrueNAS CORE, too.
A couple questions, please. First, what I have interms of hardware: 4 1-TB M.2 NVMe drives, 2 SATA III SSDs, and a mirrored zpool of 2 6-TB HDD NAS-level (WD Red Plus) devices already hosting the data for an existing Nextcloud server I'm running on another machine.
1) can one have the ZIL (SLOG) and the ARC (or L2ARC) on the same SATA III SSD, but with separate partitions for each, like sdx1 and sdx2, and
2) what's the difference between SCALE's apps and CORE's plugins? Does the Nextcloud plugin on CORE start up and run a Docker instance of Nextcloud or the snap version of Nextcloud, or a plain server not in Docker, but in it's own VM, or something else like a jail on the boot disk where the TrueNAS CORE OS is running?
Do I need the special VDEV if I have enough RAM?
Hey Gerald,
Appreciate the question!
If you have lots of RAM, you can probably get away with no special vdev. However, you will want to watch out for the arc hit rate, as if it is up north of 90% percent then your ARC cache is serving most of the I/O requests. In this case, a spec vdev will only be for the other 10% or colder data on the zpool.
The special vdev is only going to serve metadata faster. So, if faster listing times and/or very fast searches is crucial to your workflow than keeping your pool simple with lots of RAM only, and regular vdevs is a great choice.
If you check out our video on NVMe special devices ( ua-cam.com/video/0aM1iZJkOaA/v-deo.html ), you will see that even a regular HDD pool can serve metadata workloads pretty darn fast with just spinners and a good ARC cache.
Hope this answer clarifies things up for you.
Thanks again!
can we do high availability for zfs (iscsi) using ceph ?
What OS is you NAS using ? I don't recognize the menu system.
The OS we are using is actually just Rocky Linux, but the menu system is actually our own Houston UI ! It is totally free and open source and easy to install if you'd like to try it. We officially support Ubuntu and Rocky... Might be some small hoops to jump through for some other Debian or RHEL derivatives :)
@@mitcHELLOworld thank you for the answer. My bad, I asked this question a few comments up.
Great video guys!
I'm actually building out a Q30 for mixed use now and found this very helpful. My build out is very similar except that we have historically kept our database data in the VM stack. Are you presenting the db storage as ISCSI LUNs to your VMs?
How durable are those Sata SSDs? Would it have made sense to have 2 of them be the SLOG and the other two be your ddt? I'm thinking that data store could make great use of dedupe.
The Micron 5300s we use are definitely robust and reliable, certainly enough to be used as a SLOG or Special VDEV. In general though a SLOG will only be useful in very specific circumstances, generally when your workload involves a lot of sync writes (databases, vm hosting), often times adding them will not provide any significant performance benefits.
As for your DDT/Special VDEV, you can certainly add SSDs as these devices but you'll generally get more benefit out of using NVMe for these purposes. As a note we recommend against dedupe as the storage efficiency benefit is usually fairly small but the performance impact can be substantial. You will also want solid resiliency on a special VDEV, we'd recommend at least a 3 way mirror rather than 2 way. If your special VDEV fails so does the rest of your pool.
The LOG seems to be huge 4 SSDs, typically for millionaires without any financial constraints. Since 2018 my LOG is a SSD partition of 5GB, enough to cover all my writes during 5 seconds :) The interesting part of the video was about the special device for meta data and small files, maybe I will use my 30GB SSD partition for that purpose instead of using it as L2ARC :) I assume, that like on the HDD datapool they will write all meta data twice.
My ZFS system is a $349 desktop; Ryzen 3 2200G, 16GB DDR4; 512GB nvme-SSD (3400/2300MB/s) as ZFS boot device and to store my main VMs. Other storage are 2 data pools;
- one for VMs on the first 1TB partition of the HDD with a L2ARC partition of 90GB and that LOG of 5GB and
- the other one for my data on the last 1TB partition of the HDD with L2ARC partition of 30GB and a LOG of 3GB.
All L2ARCs and LOGs together are on the 128GB of my sata-SSD.
I love this stuff, I wish I could work on it professionally. I'd end up spending thousands for a home setup like this to run my plex and like 1 vm... sigh
L2arc is not required ?
Dear 45Drives Team, regarding Mixed Workload I‘m currently planning a Storinator with Houston UI for SMB Fileshares for a Active Directory in combination with Proxmox Backup Server, on a separate Drive Array, but everything within the same Storinator, maybe your team has some experience with this Config and is able to share the Key Configurations with me/us. I’ve tried installing Ubuntu server 20.04 with Houston UI and ad PBS also but this did not work, from the other side it didn’t work either (installing pbs and adding Houston ui) maybe you had this use case before
Proxmox backup servers run on Debian, I can't imagine it would run on Ubuntu without significant modification, if even possible. You could install Proxmox Backup Server off their official ISO, then install cockpit and all of our Houston Modules on top, we do not officially package for Debian, but the ubuntu packages may work with minimal modifications.
Generally not a use case we're familiar with as having backup infrastructure, and file sharing infrastructure on the same device is not our recommendation. Backup infrastructure should always be on its own gear.
@@45Drives ok thanks for the clarification 👌
Could you do an architecture and build video on a dual controller ZFS storage server? Would really like to see a setup with higher availability than recovering from a replica but without having to build a 3+ node ceph cluster.
We'll add it to the queue
"Is that eight?"... nope, add another line and now its 7. In all seriousness, great video.
sick khamzat shirt
Most assurance on ZFS would be a mirrored drive setup.
It is NOT Copy On Write! Don't have a COW.
ZFS is redirect on write like B Tree File System.
19:24 sounds like AccuBattery (:
Amen 🙏
you’ll honestly loose your jobs in 5 years, when u.3/e1 nvme drives gets a lot faster. ZFS is already a bottleneck on something like 8x Micron 7450 Max.
We have CPU’s now with 128 PCI5.0 Lanes, that makes up for an big amount of ultra fast ssds for almost all companies. If ZFS/Ceph will get a bottleneck, no one is going to use it anymore.
Cheers
Can you explain this more?
@@sbagel95 There is not much to explain, ZVOLs (ZFS Blockdevices that you use to split ZFS over ISCSI) are 20 times slower as every alternative that exists. You can share the Whole pool as it is over whatever you like, for simplicity lets take ISCSI, its still at least 2x slower as any alternative way.
Lets say it differently, ZFS is the slowest File and Blockstorage at the moment for NVME’s that exist, simply because of the extreme ZIL and Cache overhead.
While for spinning drives, for what ZFS was developed for, its a superrior filesystem.
To explain this in detail, a youtube comment wouldn’t be enough.