Hello Number of partitions for data is not only determined using partition size, there are some other factors too checkout this article blog.devgenius.io/pyspark-estimate-partition-count-for-file-read-72d7b5704be5
Thanks 👍 The datasets are huge and its very difficult to upload them. However, you can find most of the at this Github url: github.com/subhamkharwal/pyspark-zero-to-hero/tree/master/datasets If you like my content, Please make sure to share with your network over LinkedIn 👍 This helps a lot 💓
how i join small table with big table but i want to fetch all the data in small table like the small table is 100k record and large table is 1 milion record df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin') it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help
@23:03, the tasks showed only 4 tasks here , usually it will come's up with 16 tasks due to actual config in the cluster, but only 4 tasks is being taken due to the data is being bucketed before reading. Is it correct ?
Hello, There is very less chance that some will run into issues with Shuffle Hash Join. The majority of challenges comes when you have optimize Sort Merge which is usually used for bigger datasets. And in case of smaller datasets now a days everyone prefers broadcasting.
@@easewithdata Hello Subham, can u please come up with session where u can show how can we use delta table (residing on golden layer) for power bi reporting purpose or import into power bi
Hello, show and display doesn't trigger the complete dataset. Best way to trigger complete dataset is using count or write. And for write we are noop. This was already explained in past videos of the series. Have a look.
Delta lake tables doesnt supports bucketing. Please avoid using it for the delta lake tables. Try to use other optimization like z ordering while dealing with delta lake tables.
truly an amazing video
Thank you 👍 Please make sure to share with your network over LinkedIn 🙂
very nice , so far best vid for beginners on join
thanks ❤️
Amazingly explained
Most expected video😊
Thank you
PySpark Coding Interview Questions and Answer of Top Companies
ua-cam.com/play/PLqGLh1jt697zXpQy8WyyDr194qoCLNg_0.html
nice explaination
Thanks please make sure share with your network on LinkedIn ❤️
how 16 partition(task) is created because partition size is 128mb and here we have only 94.8 MB OF DATA
.. @please explain please
Hello
Number of partitions for data is not only determined using partition size, there are some other factors too
checkout this article blog.devgenius.io/pyspark-estimate-partition-count-for-file-read-72d7b5704be5
Good stuff. Can you provide me the dataset?
Thanks 👍 The datasets are huge and its very difficult to upload them. However, you can find most of the at this Github url:
github.com/subhamkharwal/pyspark-zero-to-hero/tree/master/datasets
If you like my content, Please make sure to share with your network over LinkedIn 👍 This helps a lot 💓
high cardinality --- bucketing and low cardinality --- partition?
Yes
how i join small table with big table but i want to fetch all the data in small table like
the small table is 100k record and large table is 1 milion record
df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin')
it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help
df = largedf.join(broadcast(smalldf), smalldf.id==largedf.id , how = 'right join') may it will work here
@23:03, the tasks showed only 4 tasks here , usually it will come's up with 16 tasks due to actual config in the cluster, but only 4 tasks is being taken due to the data is being bucketed before reading. Is it correct ?
Yes, the bucketing would restrict the number of tasks to avoid shuffling. So it's important to decide number of buckets.
Hello Subham, why did not cover Shuffle hash join practically over here? as I can see here you have explained only in theory
Hello,
There is very less chance that some will run into issues with Shuffle Hash Join. The majority of challenges comes when you have optimize Sort Merge which is usually used for bigger datasets. And in case of smaller datasets now a days everyone prefers broadcasting.
@@easewithdata suppose we don't choose any join behavior then u meant to say shuffle hash join is by default join?
AQE would optimize and choose the best possible join
@@easewithdata Hello Subham, can u please come up with session where u can show how can we use delta table (residing on golden layer) for power bi reporting purpose or import into power bi
@@alishmanvar8592 save the table in delta format, open powerBI, load that file and do your visualisation
Hi,
I have noticed that you use "noop" to perform an action. Any particular reason to not use ".show()" or .display()?
Hello,
show and display doesn't trigger the complete dataset. Best way to trigger complete dataset is using count or write. And for write we are noop.
This was already explained in past videos of the series. Have a look.
Bucketing can't be applied when the data resides in a Delta Lake table, right?
Delta lake tables doesnt supports bucketing. Please avoid using it for the delta lake tables. Try to use other optimization like z ordering while dealing with delta lake tables.
@@easewithdata So, in real-world project bucketing need to be applied on rdbms table or files?
@@svsci323 on dataframes and dataset