what would be advantage of using this whole java / spark environment if i can connect using python directly to oracle db using cx-Oracle for example? perhaps i can find the similar for MS SQL or SAP dbs etc. maybe i miss the point but i do not see the ELT part which i was looking for. thank you for reaction.
Hi stookie222, you can use Python/Pandas when processing small to medium size data loads. If your data fits in the memory constraint of a single machine then go with this approach (I go over this in the intro of the video). PySpark is an API for spark which is a distributed engine, designed for large datasets. It is designed to work as a cluster, three or more PCs (EC2 instances) to overcome the constraints of a single node. Once you hit memory limits or notice performance degradation then you can consider PySpark.
Thanks for this series... im doing the same but having a error.... "An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$" .can you please help me in this.
Hi Manoj, Spark supports Java version 8-11. Make sure you have supported Java version installed plus set the JAVA_HOME variable pointing to the correct install location. Hope this helps.
@@BiInsightsInc Thanks for you reply. I actually install java (jdk-18.0.2.1_windows-x64_bin refer your video) and also setup JAVA_HOME as system env variable. still having error.
Link to second session: ua-cam.com/video/DnUn9u_q5LQ/v-deo.html
Thank you so much for the video. Please make videos to develop realtime ETL jobs using PySpark and AWS.
You're doing a very great job here. Thank you
Loved the tutorial
what would be advantage of using this whole java / spark environment if i can connect using python directly to oracle db using cx-Oracle for example? perhaps i can find the similar for MS SQL or SAP dbs etc. maybe i miss the point but i do not see the ELT part which i was looking for. thank you for reaction.
Hi stookie222, you can use Python/Pandas when processing small to medium size data loads. If your data fits in the memory constraint of a single machine then go with this approach (I go over this in the intro of the video). PySpark is an API for spark which is a distributed engine, designed for large datasets. It is designed to work as a cluster, three or more PCs (EC2 instances) to overcome the constraints of a single node. Once you hit memory limits or notice performance degradation then you can consider PySpark.
Thanks for this series... im doing the same but having a error.... "An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$" .can you please help me in this.
Hi Manoj, Spark supports Java version 8-11. Make sure you have supported Java version installed plus set the JAVA_HOME variable pointing to the correct install location. Hope this helps.
@@BiInsightsInc Thanks for you reply. I actually install java (jdk-18.0.2.1_windows-x64_bin refer your video) and also setup JAVA_HOME as system env variable. still having error.
@@ManojKumar-vp1zj try setting the JAVA_HOME in your Python script (showcased in the video) and try again.
@@BiInsightsInc I did the same way as you demonstrated in the video, also tried to setup env variable but nothing works.