Spark Join Without Shuffle | Spark Interview Question

Поділитися
Вставка
  • Опубліковано 16 січ 2025
  • #Spark #Join #Internals #Performance #optimization #DeepDive #Join #Shuffle: In this video , We have discussed how to perform the join without the shuffle.
    Please join as a member in my channel to get additional benefits like materials in BigData , Data Science, live streaming for Members and many more
    Click here to subscribe : / @techwithviresh
    About us:
    We are a technology consulting and training providers, specializes in the technology areas like : Machine Learning,AI,Spark,Big Data,Nosql, graph DB,Cassandra and Hadoop ecosystem.
    Mastering Spark : • Spark Scenario Based I...
    Mastering Hive : • Mastering Hive Tutoria...
    Spark Interview Questions : • Cache vs Persist | Spa...
    Mastering Hadoop : • Hadoop Tutorial | Map ...
    Visit us :
    Email: techwithviresh@gmail.com
    Facebook : / tech-greens
    Twitter :
    Thanks for watching
    Please Subscribe!!! Like, share and comment!!!!

КОМЕНТАРІ • 29

  • @shivrajsingh5559
    @shivrajsingh5559 3 роки тому +2

    That's what i was looking for. It's a great help Viresh

  • @mrkrish501
    @mrkrish501 4 роки тому +1

    i m really happy with your in deep dive spark. Thank you.

  • @gemini_537
    @gemini_537 3 роки тому +2

    small2 is not defined. Also why is the shuffle cost of partitioning the 2 RDDs separately lower than the shuffle cost of joining them directly? They are basically doing the same thing, moving data of the same join key to a same executor.

  • @gemini_537
    @gemini_537 3 роки тому +2

    I feel the title is misleading, repartitioning the 2 RDDs involves shuffle.

  • @MohitKumar-st3ms
    @MohitKumar-st3ms 4 роки тому +3

    Let's say if you are having two large dataframe , then How will you optimize the join ? And why are you using the rdd as it's very slow as compared to dataframe ?

  • @adamantnams
    @adamantnams 4 роки тому +1

    Any suggestions for dataframes?

  • @Trip-Train
    @Trip-Train Рік тому

    Why are you converting dataframe to rdd ?? It is very bad practice in terms of performance

  • @gemini_537
    @gemini_537 3 роки тому +1

    What's the benefit of persisting the 2 RDDs?

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 4 роки тому +1

    thanks Veresh , here "rdd"s been used , how to do same using Dataset/Dataframe ?? where you got "small2" from??

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 4 роки тому +2

    really nice , thanks bro , in line 14 , is it "small.partition.get" instead "small2.partition.get" right ? why shuffle.partitions set to 2 only ?

    • @TechWithViresh
      @TechWithViresh  4 роки тому

      Otherwise remaining 198 partitions would be empty

    • @SpiritOfIndiaaa
      @SpiritOfIndiaaa 4 роки тому

      @@TechWithViresh is it otherwise or other words ? want to keep 198 partitions empty ?

  • @naveenkumar-tb1de
    @naveenkumar-tb1de 4 роки тому +1

    I have been asked like, if I have 2 tables with same volume of data but say one has 10 column and other has 3 columns, how to optimise this joining.

  • @gemini_537
    @gemini_537 3 роки тому

    What's the book/picture in the video?

  • @monku1821
    @monku1821 3 роки тому +1

    have been following the series, its pretty good but this video is not at all clear, you should make another with same question

  • @Mryajivramuk
    @Mryajivramuk 3 роки тому

    Concept is really worth testing.
    Code is incomplete at places .
    I took time to fill gaps.
    Last line display()..will it work in scala spark ?🙄

    • @TechWithViresh
      @TechWithViresh  3 роки тому

      This code will run fine on Azure Databricks.

  • @IndianCoupleinUKBLR
    @IndianCoupleinUKBLR 4 роки тому

    where did small2 came from .....there is typo mistakes...can you please update it.??

  • @keyaar3393
    @keyaar3393 3 роки тому

    shuffle during join OR doing repartition before join .... u r saying that the second one is better.... right? Whats the difference? u have not mentioned why is it better... some one has to take care of repartitioning -> either join will shuffle or we have to repartition -> its fine... pls let us know why this approach is better.

  • @rishigc
    @rishigc 4 роки тому

    Even with repartitioning we have to move data to different partitions causing a shuffle, isnt it ?

  • @shankargs7685
    @shankargs7685 4 роки тому +1

    partition.get is returning None in largeRDD line no. 14

  • @rohinirithe1522
    @rohinirithe1522 4 роки тому

    getting error for line number 14 --->
    error: value partitioner is not a member of org.apache.spark.sql.DataFrame
    Kindly suggest

  • @saurabhgarud6690
    @saurabhgarud6690 4 роки тому +1

    Very Nice content provided on this channel thanks for that, Q:- Can range partition work here ?

  • @dipanjansaha6824
    @dipanjansaha6824 4 роки тому +1

    How to connect with you?

  • @sagarrawal7740
    @sagarrawal7740 Рік тому

    Video recommendatin at the end are blocking the content...

  • @dheerendrakumarjain6672
    @dheerendrakumarjain6672 3 роки тому

    your example is not up to the mark, whatever you describe in your lecture it is not understandable, only the shake of creating a video you do this, I did not get your point whatever you told us regarding the join how it happens and what happens please describe in a much better understandable manner.