liquid clustering concepts nicely explained with practical examples, using databricks dataset helps to easily follow along , comparison with hive-style partitioning & zorder techniques helps to see the performance impact of liquid clustering, thank you for sharing :)
First table created using partitionBy on origin and filtering on dayofWeek = 1 and in second table you clustered by "dayofWeek" and filter on dayofWeek = 1 then it will obliviously take more time in case of partition table. I agree it will create files based on total number partitions and it would skip more files to read if table created using partitionBy dayofWeek and add filter on same column.
Partition by is not good for small tables The old approach was partition and Optimize with Zorder By . Instead of partition By We can use cluster By Then we can apply optimize. No need to use partition By and Zorder By for less than 1TB tables.
Cluster by is alternate to partition by and z ordering and recommended table size to implement partition &z orderis 1TB . So does this conclude that we should not apply liquid clustering for table less than 1TB size ?
totally agree with @jeetash1. if you want to correctly compare and benchmark partitionBy and clusteredby you should use same column otherwise that comparison doesn't make sense. if you created using partitionBy on dayofWeek and filtering on dayofWeek = 1 and in second table you clustered by "origin " and filter on dayofWeek = 1 partitionby will take less time
@@TRRaveendra If this comparision is between hive partitining + Z order and clustering , then the keys for clustering should be (origin, dayofweek) right? (ref: official documentation: use liquid clustering for delta tables)
Hi Ravi, This video was of great use. I have one question. Is it possible to convert an existing table with partitioned having data to liquid cluster? If so can you please suggest the steps?
as of now you can use only SQL Table DDL for liquid clustering like while creating a table using SQL CREATE TABLE Table_name(col...) cluster by (col1,col2.) after that you can alter a table for changing cluster by columns. using alter table ....
Hello Rajesh, Did you find an answer ? Did you try directly applying the clustering on the existing table ? was about to try it on one of the tables at my end.
On implementing liquid clustering, when I call desc detail table table name, I see clustering columns..but when I insert data to liquid clustering table using dataframe.write ND then execute same desc detail table, clustering columns are lost.i ran optimize but no use.i have datBricks runtime 13.2
You can find the notebook in below github location :
github.com/raveendratal/PysparkRaveendra/blob/master/Liquid%20Clustering.ipynb
Thanks Ravi! Great explanation
Thank you 🙏
liquid clustering concepts nicely explained with practical examples, using databricks dataset helps to easily follow along , comparison with hive-style partitioning & zorder techniques helps to see the performance impact of liquid clustering, thank you for sharing :)
What a great explanation. Ravi, Day by day the value of your presentations goes higher and higher. It would be greate, If you can share Notebook also.
github.com/raveendratal/PysparkRaveendra/blob/master/Liquid%20Clustering.ipynb
First table created using partitionBy on origin and filtering on dayofWeek = 1 and in second table you clustered by "dayofWeek" and filter on dayofWeek = 1 then it will obliviously take more time in case of partition table. I agree it will create files based on total number partitions and it would skip more files to read if table created using partitionBy dayofWeek and add filter on same column.
Partition by is not good for small tables
The old approach was partition and Optimize with Zorder By .
Instead of partition By
We can use cluster By
Then we can apply optimize.
No need to use partition By and Zorder By for less than 1TB tables.
Cluster by is alternate to partition by and z ordering and recommended table size to implement partition &z orderis 1TB .
So does this conclude that we should not apply liquid clustering for table less than 1TB size ?
totally agree with @jeetash1. if you want to correctly compare and benchmark partitionBy and clusteredby you should use same column otherwise that comparison doesn't make sense. if you created using partitionBy on dayofWeek and filtering on dayofWeek = 1 and in second table you clustered by "origin " and filter on dayofWeek = 1 partitionby will take less time
@@TRRaveendra If this comparision is between hive partitining + Z order and clustering , then the keys for clustering should be (origin, dayofweek) right? (ref: official documentation: use liquid clustering for delta tables)
I want to have personalized training from you. Could you please let me know about it please ?
Hi Ravi, Is your cluster photon acceleration enabled.
No, optimize was executed without photon cluster.
Sir, Please share the code and also dataset to practice .
github.com/raveendratal/PysparkRaveendra/blob/master/Liquid%20Clustering.ipynb
Hi Ravi,
This video was of great use. I have one question. Is it possible to convert an existing table with partitioned having data to liquid cluster? If so can you please suggest the steps?
as of now you can use only SQL Table DDL for liquid clustering like while creating a table using SQL CREATE TABLE Table_name(col...) cluster by (col1,col2.)
after that you can alter a table for changing cluster by columns. using alter table ....
Hello Rajesh,
Did you find an answer ? Did you try directly applying the clustering on the existing table ? was about to try it on one of the tables at my end.
@TRRaveendra can u share the dataset link please
It’s 📌 pinned in comments
Verify the link
thank you Sir! One question - will liquid clustering be same as Z order for NON Partitioned table?
On partition by why not using coalesce during writing so you can have few files
On implementing liquid clustering, when I call desc detail table table name, I see clustering columns..but when I insert data to liquid clustering table using dataframe.write ND then execute same desc detail table, clustering columns are lost.i ran optimize but no use.i have datBricks runtime 13.2