Getting started with Apache Spark / PySpark setup | ETL with Pyspark

Поділитися
Вставка
  • Опубліковано 3 гру 2024

КОМЕНТАРІ • 11

  • @BiInsightsInc
    @BiInsightsInc  2 роки тому +1

    Link to second session: ua-cam.com/video/DnUn9u_q5LQ/v-deo.html

  • @satishmajji481
    @satishmajji481 2 роки тому +4

    Thank you so much for the video. Please make videos to develop realtime ETL jobs using PySpark and AWS.

  • @richmondnyamekye6383
    @richmondnyamekye6383 2 роки тому

    You're doing a very great job here. Thank you

  • @guidysoll
    @guidysoll 2 роки тому

    Loved the tutorial

  • @stookie222
    @stookie222 Рік тому

    what would be advantage of using this whole java / spark environment if i can connect using python directly to oracle db using cx-Oracle for example? perhaps i can find the similar for MS SQL or SAP dbs etc. maybe i miss the point but i do not see the ELT part which i was looking for. thank you for reaction.

    • @BiInsightsInc
      @BiInsightsInc  Рік тому

      Hi stookie222, you can use Python/Pandas when processing small to medium size data loads. If your data fits in the memory constraint of a single machine then go with this approach (I go over this in the intro of the video). PySpark is an API for spark which is a distributed engine, designed for large datasets. It is designed to work as a cluster, three or more PCs (EC2 instances) to overcome the constraints of a single node. Once you hit memory limits or notice performance degradation then you can consider PySpark.

  • @ManojKumar-vp1zj
    @ManojKumar-vp1zj 2 роки тому +1

    Thanks for this series... im doing the same but having a error.... "An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
    : java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$" .can you please help me in this.

    • @BiInsightsInc
      @BiInsightsInc  2 роки тому

      Hi Manoj, Spark supports Java version 8-11. Make sure you have supported Java version installed plus set the JAVA_HOME variable pointing to the correct install location. Hope this helps.

    • @ManojKumar-vp1zj
      @ManojKumar-vp1zj 2 роки тому

      @@BiInsightsInc Thanks for you reply. I actually install java (jdk-18.0.2.1_windows-x64_bin refer your video) and also setup JAVA_HOME as system env variable. still having error.

    • @BiInsightsInc
      @BiInsightsInc  2 роки тому

      @@ManojKumar-vp1zj try setting the JAVA_HOME in your Python script (showcased in the video) and try again.

    • @ManojKumar-vp1zj
      @ManojKumar-vp1zj 2 роки тому

      @@BiInsightsInc I did the same way as you demonstrated in the video, also tried to setup env variable but nothing works.