22. Databricks| Spark | Performance Optimization | Repartition vs Coalesce

Поділитися
Вставка
  • Опубліковано 11 вер 2024
  • #DatabricksPerformance, #SparkPerformance, #PerformanceOptimization, #DatabricksPerformanceImprovement, #Repartition, #Coalesce,
    #Databricks, #DatabricksTutorial, #AzureDatabricks
    #Databricks
    #Pyspark
    #Spark
    #AzureDatabricks
    #AzureADF
    #Databricks #LearnPyspark #LearnDataBRicks #DataBricksTutorial
    databricks spark tutorial
    databricks tutorial
    databricks azure
    databricks notebook tutorial
    databricks delta lake
    databricks azure tutorial,
    Databricks Tutorial for beginners,
    azure Databricks tutorial
    databricks tutorial,
    databricks community edition,
    databricks community edition cluster creation,
    databricks community edition tutorial
    databricks community edition pyspark
    databricks community edition cluster
    databricks pyspark tutorial
    databricks community edition tutorial
    databricks spark certification
    databricks cli
    databricks tutorial for beginners
    databricks interview questions
    databricks azure

КОМЕНТАРІ • 83

  • @Aramakishore
    @Aramakishore 2 роки тому +3

    I have never seen any video elaborated like this..Appreciate you really..It understands very clearly

  • @Akshaykumar-pu4vi
    @Akshaykumar-pu4vi 2 роки тому +5

    Follow this playlist , it is tremendous sir and you provide concepts in a very good way. Thank you sir.

  • @mynamesathish
    @mynamesathish 3 роки тому +6

    Nice explanation! In the mentioned example I can see the Repartiton(2) created partition of unequal size(one with 8 record and another with 2records), but I expect it to be of almost equal size.

    • @riyazalimohammad633
      @riyazalimohammad633 2 роки тому +1

      @Sathish I also had the same doubt when watching the video. repartition(2) created partitions of unequal size but coalesce(2) had partitions with each 5 records per partition. Got me confused.
      @Raja sir, please clarify on the same.

    • @rajasdataengineering7585
      @rajasdataengineering7585  2 роки тому +7

      @@riyazalimohammad633 Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment

    • @rajasdataengineering7585
      @rajasdataengineering7585  2 роки тому +3

      @Sathish, Sorry for late reply. Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment

    • @riyazalimohammad633
      @riyazalimohammad633 2 роки тому +4

      @@rajasdataengineering7585 Thank you for your prompt response! Much appreciated.

    • @somesh512
      @somesh512 Рік тому +1

      I just watched the video and had the exact same doubt. But Raja Sir already provided the answer

  • @avinash1722
    @avinash1722 8 днів тому +1

    Very Informative. Way better then paid courses

  • @gurumoorthysivakolunthu9878
    @gurumoorthysivakolunthu9878 Рік тому +1

    Great, Sir...
    1. What is the maximum value that can be set to - maxPartitionBytes....
    2. What parameters should be considered to decide the partitionbytes , repartition count...
    Thank you, Sir...

  • @vipinkumarjha5587
    @vipinkumarjha5587 3 роки тому +2

    Vey nice Video Sir, I clear all the basics doubt of Partitioning. Hope to see video on Optimizations approach like cache , persist, z order. Thanks again

    • @rajasdataengineering7585
      @rajasdataengineering7585  3 роки тому

      Thank you Vipin. Sure, will post videos with optimization concepts such as cache, persist, Z order in delta etc.,

  • @arindamghosh3787
    @arindamghosh3787 Рік тому +1

    This is the video I was searching for .. thanks a lot ❤

  • @shaileshsondawale2811
    @shaileshsondawale2811 Рік тому

    Wow.. Wonderful Delivery sir...!!!! A wonder content

  • @mrpoola49
    @mrpoola49 Рік тому +1

    That was amazingly explained! You rock!

  • @AIFashionistaGuide
    @AIFashionistaGuide Рік тому +5

    ****************************** 1.Performance Tuning *****************************************
    1.Performance Optimization | Repartition vs Coalesce
    Performance Optimization | Repartition vs Coalesce
    --spark is know for its speed,speed comes from concept of parallel computing , parallel computing comes from repartition
    --partition is the key for parallel processing
    --if we design the partition ,automatically improves the performance
    -- hence partition plays an important role in error handling,debugging,performance
    --while partiotioning we must know
    1.right size of partition done
    --scenario -2 partitons done 1000 MB,10 MB ,one with 10 MB will execute faster and remain idle which is not good.
    2.right number of partitions
    -- scenario - we have 16 core executors, only 10 partitions created
    Then :
    1.out of 16 cores , 10 cores will pick each partition Hence partitions cannot be shared among cores,6 cores are remaining idle. hence right number of partitions must be chosen as 6 are idle here.
    2.choose 16 partitions or multiples of core available atleast.In 1rst iteration all 16 cores will pick 16 partitions and in 2nd iterations 16 cores will pick next 16 partitions.hence here no idle cores present.
    Spark.default.parallelism
    Spark.default.parallelism was introduced with RDD hence this property is only applicable to RDD. The default value for this configuration set to the number of all cores on all nodes in a cluster, on local, it is set to the number of cores on your system.For RDD, wider transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling.
    Default value is 8, it creates 8 partitions by default.
    spark.sql.files.maxPartitionBytes
    When data is to be read from external tables,partitions are created on this above parameter.
    The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
    Default size is 128 MB
    The above 2 parameters are configurable depending upon on your need.
    DataFrame.repartition()
    pyspark.sql.DataFrame.repartition() method is used to increase or decrease the RDD/DataFrame partitions by number of partitions or by single column name or multiple column names. This function takes 2 parameters; numPartitions and *cols, when one is specified the other is optional. repartition() is a wider transformation that involves shuffling of the data hence, it is considered an expensive operation.
    Key Points
    • repartition() is used to increase or decrease the number of partitions.
    • repartition() creates even partitions when compared with coalesce().
    • It is a wider transformation.
    • It is an expensive operation as it involves data shuffle and consumes more resources.
    • repartition() can take int or column names as parameter to define how to perform the partitions.
    • If parameters are not specified, it uses the default number of partitions.
    • As part of performance optimization, recommends avoiding using this function.
    coalesce()
    --Spark DataFrame coalesce() is used only to decrease the number of partitions.
    --This is an optimized or improved version of repartition() where the movement of the data across the partitions is fewer using coalesce().
    --Coalesce() doesnot require a full shuffle as coalesce() combines few partitions or shuffles data only from
    few partitions thus avoiding full shuffle.
    --Due to partition merge it produces uneven size of partitions

  • @vutv5742
    @vutv5742 4 місяці тому +1

    Great explaination...🎉🎉🎉

  • @lokeshv4348
    @lokeshv4348 10 місяців тому +1

    At 5:30, There is a mention that snappy and gzip both are not splittable. But snappy is splittable and can have partitions.

    • @rajasdataengineering7585
      @rajasdataengineering7585  10 місяців тому +1

      All snappy files are not splittable. Snappy with parquet/avro are splittable but snappy with json is not splittable.
      We can't generalise that all snappy files are splittable or non-splittable

  • @gulsahtanay2341
    @gulsahtanay2341 6 місяців тому +1

    Very helpful content, thank you!

  • @varun8952
    @varun8952 2 роки тому +1

    Very detailed explanation, sir.

  • @phanisrikrishna
    @phanisrikrishna Рік тому +2

    Hi sir, I was looking for a complete pyspark series with more emphasis on architecture and its components.
    I am having a good learning time with your UA-cam series on pyspark.
    I was wondering if I can get the slides for this course which can help me in referring back quickly when attending interviews.

  • @vydudraksharam5960
    @vydudraksharam5960 Рік тому +1

    Raja Sir, Very well explained with example. I would like to know in the pictures you have given 2 executers for repartition and coalesce, but in the same picture you have shown output you named it as executer1 for both. is it by mistake or didn't i understood properly. Could you please clarify. this is difference is there in both the slides. -- Thank you Vydu

  • @vidhyalakshmiparthasarathy8573

    Thank you so much sir for making such great videos. I'm learning a lot of nuances and best practices for practical applications.😊🙏

  • @sameludhanaraj
    @sameludhanaraj 4 місяці тому +1

    well explained.Thanks

  • @kamalbhallachd
    @kamalbhallachd 3 роки тому +1

    Really nice 👍

  • @kamalbhallachd
    @kamalbhallachd 3 роки тому +1

    Wow amazing

  • @ririraman7
    @ririraman7 2 роки тому +1

    awesome tutorial

  • @vedantbopardikar3507
    @vedantbopardikar3507 6 місяців тому +1

    All credits to you sir

  • @vishalaaa1
    @vishalaaa1 Рік тому +1

    excellent

  • @gauthamn2844
    @gauthamn2844 5 місяців тому

    It was good session is there any indication keyword to set for increase or decrease partition in repartition?. Because repartition (20) how will we know its increased or decreased?. After execution only will come to know its increased/decreased.

  • @kamalbhallachd
    @kamalbhallachd 3 роки тому +1

    Helpful tips

  • @bollywoodbadshah.796
    @bollywoodbadshah.796 Місяць тому +1

    Please make video on liquid clustering..

  • @maurifkhan3029
    @maurifkhan3029 Рік тому

    QQ- The changes for default partition size will be at the cluster level or it will be only implemented for the notebook only. In case other jobs are running on cluster than will those also be impacted by the change in settings.

  • @ayushiagarwal528
    @ayushiagarwal528 6 місяців тому

    In example repartition produce uneven output for 2 partition but coalesce produce even result. Please explain??

  • @avisinha2844
    @avisinha2844 Рік тому +1

    hello sir, i have a small doubt, when we are supplying 3 separate files in a single df at 14:03 , then why the number of partitions is 3 , when the default partition size is 128 mb given the fact that the size of the df containing the 3 files is a lot less than 128 mb.

  • @da8233
    @da8233 Рік тому +1

    thank you so much , its wonderful explanation

  • @raghavendarsaikumar
    @raghavendarsaikumar Рік тому +1

    I see executors 1 and 2 in the picture before coalesce or repartition but post the action, I see both of them as executor 1. Is this pictorially wrong or does this operation reduces the num of executors as well.

    • @rajasdataengineering7585
      @rajasdataengineering7585  Рік тому

      Good catch. It's pictorial mistake. Repartition or coalesce is nothing to do with number of executors

  • @vamsi.reddy1100
    @vamsi.reddy1100 Рік тому +1

    one doubt...!
    when we have use repartition(2), then we got unevenly distributed partitions.'ie 8 in 1st partition and 2 in the other.
    but repartition should give us evenly distributed partition right? Please help me understand.

    • @rajasdataengineering7585
      @rajasdataengineering7585  Рік тому

      Hi Vamsi, good question.
      Data is getting evenly distributed in repartition. Here we can see some differences because of small data set. From spark point of view, 2 rows or 8 rows are almost same. We can see the difference between repartition and coalesce while dealing with huge amount of data like billion or millions of rows

    • @vamsi.reddy1100
      @vamsi.reddy1100 Рік тому +1

      @@rajasdataengineering7585 thank you for clarification..

    • @vamsi.reddy1100
      @vamsi.reddy1100 Рік тому +1

      @@rajasdataengineering7585 your videos are so good...

    • @rajasdataengineering7585
      @rajasdataengineering7585  Рік тому

      Thank you

    • @amiyaroy6789
      @amiyaroy6789 2 місяці тому

      @@rajasdataengineering7585 had the same question, thank you for explaining!

  • @robinshaw4641
    @robinshaw4641 9 місяців тому

    In real time scenario, when we will use coalsec and when repartiotion?

  • @CoopmanGreg
    @CoopmanGreg Рік тому +1

    👍

  • @kalyanreddy496
    @kalyanreddy496 Рік тому

    Good evening recently I came across with a question in capgemini client interview. Consider a scenario 2 gb of file is distributed in hadoop. After doing some transformations we got 10 dataframe. By applying the repartition(1) all the data is sits in one dataframe the dataframe size is 1.8 gb but your data node size is 1gb only. Does this 1.8 gb will sit in the data node or not. If yes how? Uf no what error it willbe given
    Requesting you sir please tell me the answer for this question

  • @suresh.suthar.24
    @suresh.suthar.24 Рік тому +1

    Hello Raja Sir, few days before i gave interview in that they asked a question like if we want to create 1 partition from multiple partition then which method you will choose coalesce or repartition ? i answered coalesce but they said we will use repartition. is it correct ?

    • @rajasdataengineering7585
      @rajasdataengineering7585  Рік тому +2

      Hi Suresh,
      In this case, number of partitions needs to be reduced. Coalesce and repartition both can be used to reduce number of partitions but choosing one of them is highly depending on the use case. So you should have asked more input from the interviewer to understand the use case better. If so many transformations would be applied after resizing the partition, repartition would be better choice. Otherwise coalesce is better choice

    • @suresh.suthar.24
      @suresh.suthar.24 Рік тому +1

      @@rajasdataengineering7585 thanks 🙏

  • @kalyanreddy496
    @kalyanreddy496 Рік тому +1

    Good afternoon sir
    Requesting you to answer this question sir which I recently faced in interview sir please
    Consider you have read 1GB file into a dataframe.
    The max partition bytes configuration is set to 128MB.
    you have applied the repartition(4) or coalesce (4) on the dataframe any of the methods will decrease the number of partitions.If you apply the repartition(4) or coalesce (4) Partition size gets increase >128MB . but the max Partition bytes is configured to 128MB. Does it throws any error (or) not throws any error? If it throws an error what is the error we will get when we execute the program? If not what is the behaviour of spark in this scenario?
    Could you tell me the answer for this question sir. Recently I faced this question. Requesting you sir please

    • @rajasdataengineering7585
      @rajasdataengineering7585  Рік тому

      The configuration 'maxPartitionBytes' is playing the role while ingesting data from external system into spark memory. Once data is loaded into spark memory, the partition size can vary according to various transformation and has nothing to do with maxPartitionBytes.
      So in this case, it wont through any error. Coalesce would produce unevenly distributed partitions, where repartition would create evenly distributed partitions in this case.
      Hope it clarifies your doubts.
      Thanks for sharing your interview experience. others can be benefitted in this community

    • @kalyanreddy496
      @kalyanreddy496 Рік тому +1

      @@rajasdataengineering7585 thank you very much sir. I understand. If possible please do a video on this question sir. So we get more understanding visually sir. If possible please do it sir 🙏

    • @rajasdataengineering7585
      @rajasdataengineering7585  Рік тому

      Sure Kalyan, will create a video on this requirement

  • @a2zhi976
    @a2zhi976 Рік тому +1

    in the code i see sc.parallelieze (range(100),1) , where is the reference for sc ?.

  • @tarunpothala2071
    @tarunpothala2071 Рік тому

    Hi sir, I was great explanation and good to see the practical implementation of it. But the only question is theoritically it was said that repartition will evenly distribute the data and coalesce will unevenly distribute the data. we it was practically implemented, I saw opposite results coalesce is taking evenly distrubuted values in two partitions but repartition doesn't. Can you please check ?

  • @shreyanvinjamuri
    @shreyanvinjamuri Рік тому

    sc.defaultParalellism is for RDD's and wil only work with RDD ? spark.sql.shuffle.partitions was introduced with DataFrame and it only works with DataFrame ?