22. Databricks| Spark | Performance Optimization | Repartition vs Coalesce

Raja's Data Engineering

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 11 вер 2024
#DatabricksPerformance, #SparkPerformance, #PerformanceOptimization, #DatabricksPerformanceImprovement, #Repartition, #Coalesce,
#Databricks, #DatabricksTutorial, #AzureDatabricks
#Databricks
#Pyspark
#Spark
#AzureDatabricks
#AzureADF
#Databricks #LearnPyspark #LearnDataBRicks #DataBricksTutorial
databricks spark tutorial
databricks tutorial
databricks azure
databricks notebook tutorial
databricks delta lake
databricks azure tutorial,
Databricks Tutorial for beginners,
azure Databricks tutorial
databricks tutorial,
databricks community edition,
databricks community edition cluster creation,
databricks community edition tutorial
databricks community edition pyspark
databricks community edition cluster
databricks pyspark tutorial
databricks community edition tutorial
databricks spark certification
databricks cli
databricks tutorial for beginners
databricks interview questions
databricks azure

КОМЕНТАРІ • 83

@Aramakishore 2 роки тому ⁺³
I have never seen any video elaborated like this..Appreciate you really..It understands very clearly
@rajasdataengineering7585 2 роки тому
Thank you
@Akshaykumar-pu4vi 2 роки тому ⁺⁵
Follow this playlist , it is tremendous sir and you provide concepts in a very good way. Thank you sir.
@rajasdataengineering7585 2 роки тому
Thank you Akshay
@mynamesathish 3 роки тому ⁺⁶
Nice explanation! In the mentioned example I can see the Repartiton(2) created partition of unequal size(one with 8 record and another with 2records), but I expect it to be of almost equal size.
@riyazalimohammad633 2 роки тому ⁺¹
@Sathish I also had the same doubt when watching the video. repartition(2) created partitions of unequal size but coalesce(2) had partitions with each 5 records per partition. Got me confused.
@Raja sir, please clarify on the same.
@rajasdataengineering7585 2 роки тому ⁺⁷
@@riyazalimohammad633 Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment
@rajasdataengineering7585 2 роки тому ⁺³
@Sathish, Sorry for late reply. Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment
@riyazalimohammad633 2 роки тому ⁺⁴
@@rajasdataengineering7585 Thank you for your prompt response! Much appreciated.
@somesh512 Рік тому ⁺¹
I just watched the video and had the exact same doubt. But Raja Sir already provided the answer
@avinash1722 8 днів тому ⁺¹
Very Informative. Way better then paid courses
@rajasdataengineering7585 7 днів тому
Thank you!
@gurumoorthysivakolunthu9878 Рік тому ⁺¹
Great, Sir...
1. What is the maximum value that can be set to - maxPartitionBytes....
2. What parameters should be considered to decide the partitionbytes , repartition count...
Thank you, Sir...
@vipinkumarjha5587 3 роки тому ⁺²
Vey nice Video Sir, I clear all the basics doubt of Partitioning. Hope to see video on Optimizations approach like cache , persist, z order. Thanks again
@rajasdataengineering7585 3 роки тому
Thank you Vipin. Sure, will post videos with optimization concepts such as cache, persist, Z order in delta etc.,
@arindamghosh3787 Рік тому ⁺¹
This is the video I was searching for .. thanks a lot ❤
@rajasdataengineering7585 Рік тому
Thanks Arindam!
@shaileshsondawale2811 Рік тому
Wow.. Wonderful Delivery sir...!!!! A wonder content
@rajasdataengineering7585 Рік тому
Thanks Shailesh!
@mrpoola49 Рік тому ⁺¹
That was amazingly explained! You rock!
@rajasdataengineering7585 Рік тому
Glad it was helpful!
@AIFashionistaGuide Рік тому ⁺⁵
****************************** 1.Performance Tuning *****************************************
1.Performance Optimization | Repartition vs Coalesce
Performance Optimization | Repartition vs Coalesce
--spark is know for its speed,speed comes from concept of parallel computing , parallel computing comes from repartition
--partition is the key for parallel processing
--if we design the partition ,automatically improves the performance
-- hence partition plays an important role in error handling,debugging,performance
--while partiotioning we must know
1.right size of partition done
--scenario -2 partitons done 1000 MB,10 MB ,one with 10 MB will execute faster and remain idle which is not good.
2.right number of partitions
-- scenario - we have 16 core executors, only 10 partitions created
Then :
1.out of 16 cores , 10 cores will pick each partition Hence partitions cannot be shared among cores,6 cores are remaining idle. hence right number of partitions must be chosen as 6 are idle here.
2.choose 16 partitions or multiples of core available atleast.In 1rst iteration all 16 cores will pick 16 partitions and in 2nd iterations 16 cores will pick next 16 partitions.hence here no idle cores present.
Spark.default.parallelism
Spark.default.parallelism was introduced with RDD hence this property is only applicable to RDD. The default value for this configuration set to the number of all cores on all nodes in a cluster, on local, it is set to the number of cores on your system.For RDD, wider transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling.
Default value is 8, it creates 8 partitions by default.
spark.sql.files.maxPartitionBytes
When data is to be read from external tables,partitions are created on this above parameter.
The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
Default size is 128 MB
The above 2 parameters are configurable depending upon on your need.
DataFrame.repartition()
pyspark.sql.DataFrame.repartition() method is used to increase or decrease the RDD/DataFrame partitions by number of partitions or by single column name or multiple column names. This function takes 2 parameters; numPartitions and *cols, when one is specified the other is optional. repartition() is a wider transformation that involves shuffling of the data hence, it is considered an expensive operation.
Key Points
• repartition() is used to increase or decrease the number of partitions.
• repartition() creates even partitions when compared with coalesce().
• It is a wider transformation.
• It is an expensive operation as it involves data shuffle and consumes more resources.
• repartition() can take int or column names as parameter to define how to perform the partitions.
• If parameters are not specified, it uses the default number of partitions.
• As part of performance optimization, recommends avoiding using this function.
coalesce()
--Spark DataFrame coalesce() is used only to decrease the number of partitions.
--This is an optimized or improved version of repartition() where the movement of the data across the partitions is fewer using coalesce().
--Coalesce() doesnot require a full shuffle as coalesce() combines few partitions or shuffles data only from
few partitions thus avoiding full shuffle.
--Due to partition merge it produces uneven size of partitions
@vutv5742 4 місяці тому ⁺¹
Great explaination...🎉🎉🎉
@rajasdataengineering7585 4 місяці тому
Glad you liked it! Keep watching
@lokeshv4348 10 місяців тому ⁺¹
At 5:30, There is a mention that snappy and gzip both are not splittable. But snappy is splittable and can have partitions.
@rajasdataengineering7585 10 місяців тому ⁺¹
All snappy files are not splittable. Snappy with parquet/avro are splittable but snappy with json is not splittable.
We can't generalise that all snappy files are splittable or non-splittable
@gulsahtanay2341 6 місяців тому ⁺¹
Very helpful content, thank you!
@rajasdataengineering7585 6 місяців тому
Glad it was helpful! You are welcome
@varun8952 2 роки тому ⁺¹
Very detailed explanation, sir.
@rajasdataengineering7585 2 роки тому
Thank you Varun
@phanisrikrishna Рік тому ⁺²
Hi sir, I was looking for a complete pyspark series with more emphasis on architecture and its components.
I am having a good learning time with your UA-cam series on pyspark.
I was wondering if I can get the slides for this course which can help me in referring back quickly when attending interviews.
@vydudraksharam5960 Рік тому ⁺¹
Raja Sir, Very well explained with example. I would like to know in the pictures you have given 2 executers for repartition and coalesce, but in the same picture you have shown output you named it as executer1 for both. is it by mistake or didn't i understood properly. Could you please clarify. this is difference is there in both the slides. -- Thank you Vydu
@rajasdataengineering7585 Рік тому ⁺²
Yes it is by mistake
@vidhyalakshmiparthasarathy8573 Рік тому
Thank you so much sir for making such great videos. I'm learning a lot of nuances and best practices for practical applications.😊🙏
@rajasdataengineering7585 Рік тому
Thank you for your comment!
Happy to hear that these videos are helpful to you.
@sameludhanaraj 4 місяці тому ⁺¹
well explained.Thanks
@rajasdataengineering7585 4 місяці тому
Glad it was helpful! Thanks
@kamalbhallachd 3 роки тому ⁺¹
Really nice 👍
@rajasdataengineering7585 3 роки тому
Thank you Kamal
@kamalbhallachd 3 роки тому ⁺¹
Wow amazing
@rajasdataengineering7585 3 роки тому
Thank you Kamal
@ririraman7 2 роки тому ⁺¹
awesome tutorial
@rajasdataengineering7585 2 роки тому
Thanks Ramandeep!
@vedantbopardikar3507 6 місяців тому ⁺¹
All credits to you sir
@rajasdataengineering7585 6 місяців тому
Thank you! Hope it helps you gaining the knowledge
@vishalaaa1 Рік тому ⁺¹
excellent
@rajasdataengineering7585 Рік тому
Thanks Vishal! Glad you liked it
@gauthamn2844 5 місяців тому
It was good session is there any indication keyword to set for increase or decrease partition in repartition?. Because repartition (20) how will we know its increased or decreased?. After execution only will come to know its increased/decreased.
@kamalbhallachd 3 роки тому ⁺¹
Helpful tips
@rajasdataengineering7585 3 роки тому
Thank you Kamal
@bollywoodbadshah.796 Місяць тому ⁺¹
Please make video on liquid clustering..
@rajasdataengineering7585 Місяць тому
Sure will create soon
@maurifkhan3029 Рік тому
QQ- The changes for default partition size will be at the cluster level or it will be only implemented for the notebook only. In case other jobs are running on cluster than will those also be impacted by the change in settings.
@ayushiagarwal528 6 місяців тому
In example repartition produce uneven output for 2 partition but coalesce produce even result. Please explain??
@avisinha2844 Рік тому ⁺¹
hello sir, i have a small doubt, when we are supplying 3 separate files in a single df at 14:03 , then why the number of partitions is 3 , when the default partition size is 128 mb given the fact that the size of the df containing the 3 files is a lot less than 128 mb.
@rajasdataengineering7585 Рік тому ⁺²
I will post another video on this concept which will explain in detail
@da8233 Рік тому ⁺¹
thank you so much , its wonderful explanation
@rajasdataengineering7585 Рік тому
Thank you
@raghavendarsaikumar Рік тому ⁺¹
I see executors 1 and 2 in the picture before coalesce or repartition but post the action, I see both of them as executor 1. Is this pictorially wrong or does this operation reduces the num of executors as well.
@rajasdataengineering7585 Рік тому
Good catch. It's pictorial mistake. Repartition or coalesce is nothing to do with number of executors
@vamsi.reddy1100 Рік тому ⁺¹
one doubt...!
when we have use repartition(2), then we got unevenly distributed partitions.'ie 8 in 1st partition and 2 in the other.
but repartition should give us evenly distributed partition right? Please help me understand.
@rajasdataengineering7585 Рік тому
Hi Vamsi, good question.
Data is getting evenly distributed in repartition. Here we can see some differences because of small data set. From spark point of view, 2 rows or 8 rows are almost same. We can see the difference between repartition and coalesce while dealing with huge amount of data like billion or millions of rows
@vamsi.reddy1100 Рік тому ⁺¹
@@rajasdataengineering7585 thank you for clarification..
@vamsi.reddy1100 Рік тому ⁺¹
@@rajasdataengineering7585 your videos are so good...
@rajasdataengineering7585 Рік тому
Thank you
@amiyaroy6789 2 місяці тому
@@rajasdataengineering7585 had the same question, thank you for explaining!
@robinshaw4641 9 місяців тому
In real time scenario, when we will use coalsec and when repartiotion?
@CoopmanGreg Рік тому ⁺¹
👍
@rajasdataengineering7585 Рік тому
👍🏻
@kalyanreddy496 Рік тому
Good evening recently I came across with a question in capgemini client interview. Consider a scenario 2 gb of file is distributed in hadoop. After doing some transformations we got 10 dataframe. By applying the repartition(1) all the data is sits in one dataframe the dataframe size is 1.8 gb but your data node size is 1gb only. Does this 1.8 gb will sit in the data node or not. If yes how? Uf no what error it willbe given
Requesting you sir please tell me the answer for this question
@suresh.suthar.24 Рік тому ⁺¹
Hello Raja Sir, few days before i gave interview in that they asked a question like if we want to create 1 partition from multiple partition then which method you will choose coalesce or repartition ? i answered coalesce but they said we will use repartition. is it correct ?
@rajasdataengineering7585 Рік тому ⁺²
Hi Suresh,
In this case, number of partitions needs to be reduced. Coalesce and repartition both can be used to reduce number of partitions but choosing one of them is highly depending on the use case. So you should have asked more input from the interviewer to understand the use case better. If so many transformations would be applied after resizing the partition, repartition would be better choice. Otherwise coalesce is better choice
@suresh.suthar.24 Рік тому ⁺¹
@@rajasdataengineering7585 thanks 🙏
@kalyanreddy496 Рік тому ⁺¹
Good afternoon sir
Requesting you to answer this question sir which I recently faced in interview sir please
Consider you have read 1GB file into a dataframe.
The max partition bytes configuration is set to 128MB.
you have applied the repartition(4) or coalesce (4) on the dataframe any of the methods will decrease the number of partitions.If you apply the repartition(4) or coalesce (4) Partition size gets increase >128MB . but the max Partition bytes is configured to 128MB. Does it throws any error (or) not throws any error? If it throws an error what is the error we will get when we execute the program? If not what is the behaviour of spark in this scenario?
Could you tell me the answer for this question sir. Recently I faced this question. Requesting you sir please
@rajasdataengineering7585 Рік тому
The configuration 'maxPartitionBytes' is playing the role while ingesting data from external system into spark memory. Once data is loaded into spark memory, the partition size can vary according to various transformation and has nothing to do with maxPartitionBytes.
So in this case, it wont through any error. Coalesce would produce unevenly distributed partitions, where repartition would create evenly distributed partitions in this case.
Hope it clarifies your doubts.
Thanks for sharing your interview experience. others can be benefitted in this community
@kalyanreddy496 Рік тому ⁺¹
@@rajasdataengineering7585 thank you very much sir. I understand. If possible please do a video on this question sir. So we get more understanding visually sir. If possible please do it sir 🙏
@rajasdataengineering7585 Рік тому
Sure Kalyan, will create a video on this requirement
@a2zhi976 Рік тому ⁺¹
in the code i see sc.parallelieze (range(100),1) , where is the reference for sc ?.
@rajasdataengineering7585 Рік тому
In databricks, spark context is implicit, no need to define separately
@tarunpothala2071 Рік тому
Hi sir, I was great explanation and good to see the practical implementation of it. But the only question is theoritically it was said that repartition will evenly distribute the data and coalesce will unevenly distribute the data. we it was practically implemented, I saw opposite results coalesce is taking evenly distrubuted values in two partitions but repartition doesn't. Can you please check ?
@tarunpothala2071 Рік тому ⁺¹
Sorry just saw the below comments. will try with larger datasets
@rajasdataengineering7585 Рік тому
Pls check with larger dataset and you can see the difference
@shreyanvinjamuri Рік тому
sc.defaultParalellism is for RDD's and wil only work with RDD ? spark.sql.shuffle.partitions was introduced with DataFrame and it only works with DataFrame ?

Наступне

Автоматичне відтворення

23. Databricks | Spark | Cache vs Persist | Interview Question | Performance Tuning