Hi Jens. Thank you for watching my videos. Well, you are in some sense. Not exactly sure on what level you are doing the journaling. That is done in Ceph at some point. So the two concepts. Journaling is where you write data in a journal and figure out the source of truth by reading the full log and deciding on what the current state should be. In some sense, the WAL (write-ahead log) feature in Ceph does that. In that case, you split your data into two drives. One with the database and data object and one with the WAL that could be faster. This could help you if you have faster drives that could help you with write bursts. The data objects and database could also be on separate drives. Another way to do journaling is that you have a journaling filesystem. And before bluestore was the default storage method, you could have filestore on any filesystem to run journaling at the bottom. The cache tier in Ceph adds extra pools where you can store data. Then you decide what drives to put these extra pools on. So, for instance, we have long-term storage with PetaBytes of spinning disks that we save all our data on. Then we have a cache tier with hundreds of Gigabytes of NVME storage, so we have faster access to recently stored or retrieved objects. I hope this helps. Best regards Daniel
Hiya. Thanks for the great videos. I do have a doubt about cache tiers. If I have 3 hdd and 3 ssd disks, would it be best to use 1 ssd for the cache tier pool and 3hdd+2ssd for the data pool(set as cold hdd storage)? Thanks
Hi C J. If you are doubtful about the process and don't need the actual feature, then I would not use it. If I had 3 SSD disks and 3 HDDs. I might set up the SSDs as databases or WAL (write-ahead-log) for the OSD HDD on different hosts. Depending on if I had a heavy write- or read-centric data load. I hope this helps. Thank you for watching my videos. Best regards Daniel
Thank you for the guide. I am a ceph beginner. At 3:48 how does ceph know to put the hot_storage pool in the ssd osd (instead of in the hdd osd). Is this automatic? I already have 2 ssd and 2 hdd setup as osd in the host.
Hi A P. Yes, if it's set up correctly, it should be automatic. You have a map of object identifiers that holds information on which pool holds the data. And you can create rules for when you want to send objects between pools. In simple terms, you can improve write speed by waiting a bit before writing the objects to the backend storage. And you can increase reading speed by copying objects to the cache when reading from storage. I hope this helps. Thank you for watching my videos. Best regards Daniel
Hi The rules that governs what hardware is used for each pool depends on a crush rule. I've made a video to cover that subject. ua-cam.com/video/lFHi66F6g08/v-deo.html I hope this helps. Best regards Daniel Persson
hi Daniel, in writeback mode, do you directly write data to the cachepool or write to the basepool first and then promote_object() to promote the object to the cachepool?
Hi Abo. Thank you for watching my videos. It's not something I can answer definitively. It depends on configuration. You could configure the writeback to write to both pools simultaneously. You could use the cache as a write cache and flush the changes back to your base pool at some point. But the idea is to use it as a cache in both directions. I hope this helps. Best regards Daniel
What do you think would happen in a Read-Intensive pool if the criteria you set for Data to become promoted to Cache in local RAM far exceeds the amount of local RAM? Do these caching settings care if the Pools are Replicated or Erasure Coded ?
Hi Chris My experience is that it will introduce swapping to your system, and in worst cases, you will have unstable OSDs that will crash. When it comes to the setup for Pools, it doesn't really matter, but Replicated pools are easier to handle when you need to modify it later. There are a lot of documentation red flags when migrating and working with cache pools on Erasure Coded systems. It might be solved in later versions, but I would not try it today. Thank you for watching my videos. Best regards Daniel
Hi again. Do you know what the default time is for the cache_min_flush_age if you don't mually set it? You set it to 10 mins. How can I reset it to the default or none? Thanks
Hi The default value is 0 so it will never flush. You can reset it by setting it again. I hope this helps. Thank you for watching my videos. Best regards Daniel
Hi A P. Following this guide, you create all the original pools in the HDD replication rules, and then you add a caching layer to that. When you enable the tiering, it will switch which pool is used for communication. I hope this helps. Thank you for watching my videos. Best regards Daniel
Hi Andrés Well it depends on what you are using. I can say that running our cluster on only HDD is not possible because we can't serve our clients. But we have an actual saving in running a cache layer with NVMe drives infront of our HDD drives. But you might want to add WAL devices as well for maximum performance. Thank you for watching my videos. Best regards Daniel
Hi Student. Thank you for watching my videos. The short answer is no. Ceph is an object storage solution where it stores at the smallest 4k objects and they could be as large as 128mb. This is configurable where the lower and upper bounds are. So when you are storing larger images they will be split into smaller objects and the meta data will keep track on what is used so more recently fetched or stored data will be kept in the cache. I hope this helps. Best regards Daniel
Hi Damien To force specific pools to use specific drive types, you can create a replication rule and set that rule as your data policy for that drive. The concept is described on this page docs.ceph.com/en/quincy/rados/operations/crush-map/ Seems like I have not touched on this topic in my videos, so I will probably create a video on the topic as well :) Best regards Daniel
Hi Yes, you could but don't. This video is now deprecated as in the later versions of Ceph this feature is deprecated and will be removed in newer versions. My suggestion would be to use nvme drives for pools with Metadata and similar tasks, or separate your workloads so each OSD have multiple drives. If you can have 2 nvme and one normal drive per OSD and set up one nvme for database, one for write ahead log and then use the harddrive for storage. That would give you a good middle way of performance and storage. If you need speed on the whole pool then use only nvme for all of these features. I hope this helps. Thank you for watching my videos. Best regards Daniel
Hi Piotr. Thank you for watching my videos. To my knowledge, you can not do that. But using multiple pools and keeping some of them on faster hardware as they are usually relatively small, I think of metadata pools and similar. And then use cache pools for larger ones where you only access some of the data regularly. I hope this helps. Best regards Daniel
I really wish that ceph developers didn't seem hell-bent on removing/deprecating/telling everyone NOT to use this feature. After using several enterprise commercial SANs in my career that use both spinning rust and SSDs - because we're not billionaires and no, we can't afford all-flash, Pure! Stop sending me marketing crap for all-flash SANs! - anyway, it just seems crazy to me that ceph doesn't want to put more development hours into making cache tiering BETTER rather than warning everyone away from it. Why wouldn't I want a pool of SSDs capturing incoming writes at high speed, then demoting it down to the HDDs in the background? This functionality is built into every single SAN of the past 15 years that has multiple speeds of drive in them. Ceph is so good and so functional, and they have a tiering feature... yet they want people to stop using it, rather than improving it? I just don't get it.
Hi HeiseHeise. If they have deprecated it that would be really unfortunate. We are running it at work and are really happy about the setup. There is challenges that you need to overcome when it comes to scaling but using it has it's benefits. I've read that Redhat no longer recommends it for their enterprise customers but the documentation don't mention deprecation. If you have any good sources that I could quote I would love to look into this subject more. Thanks for watching my videos. Best regards Daniel
A Red Hat certified Enterprise course didn't even cover this, even when directly asked. Excellent video.
Of course they didn't cover it. Cache tiering was deprecated over 2 years ago...
@@bpx2798 so this doesn’t apply in Ceph 5 ?
@@cmarotta82 That's not true. In fact, ceph's official document does not explain this
Nice video Daniel! Is caching preferred over journal disks or am I mixing apples and oranges?
Hi Jens.
Thank you for watching my videos.
Well, you are in some sense. Not exactly sure on what level you are doing the journaling. That is done in Ceph at some point.
So the two concepts. Journaling is where you write data in a journal and figure out the source of truth by reading the full log and deciding on what the current state should be. In some sense, the WAL (write-ahead log) feature in Ceph does that. In that case, you split your data into two drives. One with the database and data object and one with the WAL that could be faster. This could help you if you have faster drives that could help you with write bursts. The data objects and database could also be on separate drives.
Another way to do journaling is that you have a journaling filesystem. And before bluestore was the default storage method, you could have filestore on any filesystem to run journaling at the bottom.
The cache tier in Ceph adds extra pools where you can store data. Then you decide what drives to put these extra pools on. So, for instance, we have long-term storage with PetaBytes of spinning disks that we save all our data on. Then we have a cache tier with hundreds of Gigabytes of NVME storage, so we have faster access to recently stored or retrieved objects.
I hope this helps.
Best regards
Daniel
Hiya. Thanks for the great videos. I do have a doubt about cache tiers. If I have 3 hdd and 3 ssd disks, would it be best to use 1 ssd for the cache tier pool and 3hdd+2ssd for the data pool(set as cold hdd storage)? Thanks
Hi C J.
If you are doubtful about the process and don't need the actual feature, then I would not use it. If I had 3 SSD disks and 3 HDDs. I might set up the SSDs as databases or WAL (write-ahead-log) for the OSD HDD on different hosts. Depending on if I had a heavy write- or read-centric data load.
I hope this helps. Thank you for watching my videos.
Best regards
Daniel
Thank you for the guide. I am a ceph beginner.
At 3:48 how does ceph know to put the hot_storage pool in the ssd osd (instead of in the hdd osd). Is this automatic? I already have 2 ssd and 2 hdd setup as osd in the host.
Hi A P.
Yes, if it's set up correctly, it should be automatic. You have a map of object identifiers that holds information on which pool holds the data. And you can create rules for when you want to send objects between pools.
In simple terms, you can improve write speed by waiting a bit before writing the objects to the backend storage. And you can increase reading speed by copying objects to the cache when reading from storage.
I hope this helps. Thank you for watching my videos.
Best regards
Daniel
Will Ceph automatically use ssd drives/osds for the hot_storage pool you have created or we have to associate SSDs with that particular pool?
Hi
The rules that governs what hardware is used for each pool depends on a crush rule. I've made a video to cover that subject.
ua-cam.com/video/lFHi66F6g08/v-deo.html
I hope this helps.
Best regards
Daniel Persson
I think can separately test the HDD and SSD cache tiering + HDD
hi Daniel, in writeback mode, do you directly write data to the cachepool or write to the basepool first and then promote_object() to promote the object to the cachepool?
Hi Abo.
Thank you for watching my videos.
It's not something I can answer definitively. It depends on configuration. You could configure the writeback to write to both pools simultaneously. You could use the cache as a write cache and flush the changes back to your base pool at some point.
But the idea is to use it as a cache in both directions.
I hope this helps.
Best regards
Daniel
What do you think would happen in a Read-Intensive pool if the criteria you set for Data to become promoted to Cache in local RAM far exceeds the amount of local RAM? Do these caching settings care if the Pools are Replicated or Erasure Coded ?
Hi Chris
My experience is that it will introduce swapping to your system, and in worst cases, you will have unstable OSDs that will crash.
When it comes to the setup for Pools, it doesn't really matter, but Replicated pools are easier to handle when you need to modify it later. There are a lot of documentation red flags when migrating and working with cache pools on Erasure Coded systems. It might be solved in later versions, but I would not try it today.
Thank you for watching my videos.
Best regards
Daniel
Hi again. Do you know what the default time is for the cache_min_flush_age if you don't mually set it? You set it to 10 mins. How can I reset it to the default or none? Thanks
Hi
The default value is 0 so it will never flush. You can reset it by setting it again.
I hope this helps. Thank you for watching my videos.
Best regards
Daniel
In proxmox, with your guide I have created SSD caching for my HDD pool.
Which pool do I add as the storage pool for vms, the HDD pool?
Hi A P.
Following this guide, you create all the original pools in the HDD replication rules, and then you add a caching layer to that. When you enable the tiering, it will switch which pool is used for communication.
I hope this helps. Thank you for watching my videos.
Best regards
Daniel
It was a nice video but it'd have been nice to see statistics about the change in performance
Hi Andrés
Well it depends on what you are using. I can say that running our cluster on only HDD is not possible because we can't serve our clients. But we have an actual saving in running a cache layer with NVMe drives infront of our HDD drives. But you might want to add WAL devices as well for maximum performance.
Thank you for watching my videos.
Best regards
Daniel
Hi! Can you tell me, please, if the storage pool contains rbd images, all the data inside rbd will be moved to the cache tier? Thanks
Hi Student.
Thank you for watching my videos.
The short answer is no.
Ceph is an object storage solution where it stores at the smallest 4k objects and they could be as large as 128mb. This is configurable where the lower and upper bounds are. So when you are storing larger images they will be split into smaller objects and the meta data will keep track on what is used so more recently fetched or stored data will be kept in the cache.
I hope this helps.
Best regards
Daniel
I have a similar question as the majority of my storage is vm/lxc as well
you are awesome! 😎
how do you force hot_storage to only use the NVME disks ?
Hi Damien
To force specific pools to use specific drive types, you can create a replication rule and set that rule as your data policy for that drive.
The concept is described on this page
docs.ceph.com/en/quincy/rados/operations/crush-map/
Seems like I have not touched on this topic in my videos, so I will probably create a video on the topic as well :)
Best regards
Daniel
how i use nvme disk for cache?
Hi
Yes, you could but don't. This video is now deprecated as in the later versions of Ceph this feature is deprecated and will be removed in newer versions.
My suggestion would be to use nvme drives for pools with Metadata and similar tasks, or separate your workloads so each OSD have multiple drives. If you can have 2 nvme and one normal drive per OSD and set up one nvme for database, one for write ahead log and then use the harddrive for storage. That would give you a good middle way of performance and storage.
If you need speed on the whole pool then use only nvme for all of these features.
I hope this helps. Thank you for watching my videos.
Best regards
Daniel
Is it possible to use ONE cache pool for multiple pools with data?
Hi Piotr.
Thank you for watching my videos.
To my knowledge, you can not do that. But using multiple pools and keeping some of them on faster hardware as they are usually relatively small, I think of metadata pools and similar. And then use cache pools for larger ones where you only access some of the data regularly.
I hope this helps.
Best regards
Daniel
I really wish that ceph developers didn't seem hell-bent on removing/deprecating/telling everyone NOT to use this feature. After using several enterprise commercial SANs in my career that use both spinning rust and SSDs - because we're not billionaires and no, we can't afford all-flash, Pure! Stop sending me marketing crap for all-flash SANs! - anyway, it just seems crazy to me that ceph doesn't want to put more development hours into making cache tiering BETTER rather than warning everyone away from it. Why wouldn't I want a pool of SSDs capturing incoming writes at high speed, then demoting it down to the HDDs in the background? This functionality is built into every single SAN of the past 15 years that has multiple speeds of drive in them. Ceph is so good and so functional, and they have a tiering feature... yet they want people to stop using it, rather than improving it? I just don't get it.
Hi HeiseHeise.
If they have deprecated it that would be really unfortunate. We are running it at work and are really happy about the setup. There is challenges that you need to overcome when it comes to scaling but using it has it's benefits.
I've read that Redhat no longer recommends it for their enterprise customers but the documentation don't mention deprecation. If you have any good sources that I could quote I would love to look into this subject more.
Thanks for watching my videos.
Best regards
Daniel
Hi, can u help me to fix some problems with ceph on proxmox? How can I contact you?
Hi Roberto
Yes you may. I have an email on the channel description for similar inquires: hello@danielpersson.dev
Best regards
Daniel