Apache Spark Joins for Optimization | PySpark Tutorial

Поділитися
Вставка
  • Опубліковано 27 жов 2024

КОМЕНТАРІ • 5

  • @KiranJadhav-pu8gi
    @KiranJadhav-pu8gi 11 місяців тому +1

    Nice

    • @ampcode
      @ampcode  9 місяців тому

      Thank you so much! Subscribe for more content 😊

  • @isharkpraveen
    @isharkpraveen Місяць тому +1

    Where is the code? You can atleast show a demo?

  • @ahmedaly6999
    @ahmedaly6999 6 місяців тому

    how i join small table with big table but i want to fetch all the data in small table like
    the small table is 100k record and large table is 1 milion record
    df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin')
    it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help

    • @manishshaw1002
      @manishshaw1002 5 місяців тому

      ideally the broadcat join has default configuration of broadcating the samller size df (which should be less or equal to 10MB) so if you are getting error change you sparksubmit config - make some adjustment in the broadcast size and it might work. also you haven't mentioned in your code that you are broadcating the smaller df - it should be life df.join(broadcast(smallerdf), smallerdf.id=df.id, "left_outer")
      You can increase the spark.sql.autoBroadcastJoinThreshold to your big table size by default its 10MB then broadcasthashjoin will be performed