OMG , I even tried many UDEMY courses to understand this. None of the tutor explained this clearly .. iam loving it ... Sir please start full databricks course to help us. Please.. 🙏
I appreciate your explanation; it has clarified the topic for me. Thank you. 🙏🏼 However, I have one question: if the CSV files are split into two, how will one worker determine if there are any duplications with another worker work?
Great primer @Mr. K! Thanks. Quick question - How does the driver program create task partitions for the plan? For example, if there are duplicates across two worker nodes, wouldn't the count be misrepresented if it simply adds 4500 and 5500? Does this get auto-handled or do we have to control the partitioning logic?
According to number of partitions of the files. You can also control over task by setting up configuration of partitions limit after each transformation.by using below code spark.conf.set("spark.sql.shuffle.partitions", num_partitions) The task is always depends on the number of partitions. Your question is that each and every worker nodes have duplicates and in count operation it will just sum the results right. Ans- after getting the result from each worker nodes the driver program will again aggregate it and then give the final result
hey man, may be worth a shot checking out LakeSail's PySail built on rust. supposedly 4x faster with 90% less hardware a cost according to their latest benchmarks. and can migrate existing python code. might be cool to make a vid on! love ur content!
This is my understanding - Apache Spark falls under the compute category. -It's related to MapReduce but is faster due to in-memory processing. -Spark can read large datasets from object stores like S3 or Azure Blob Storage. -It dynamically scales compute resources, similar to autoscaling and Kubernetes orchestration. -It processes the data to deliver analytics, ML models, or other results efficiently.
Great job breaking this down with clear examples
OMG , I even tried many UDEMY courses to understand this. None of the tutor explained this clearly .. iam loving it ... Sir please start full databricks course to help us. Please.. 🙏
Thank you so much :) Sure, will do :)
Fantastic. Thank you for such an easy and efficient explanation. The restaurant example is apt for spark. Great work👌👍🙏❤❤
Excellent !! thank you for the explanation!! Thats Y I subscribed n became member to your channel!!
this is what i was looking for well explained. thank you
Thank you so much :)
simply great explanation about SPARK architecture, how its connected step by step it connects all the dots in Spark.
Thank you so much :)
@@mr.ktalkstech looking further concept in Spark, it would be great if you try full course
Excellent and one of the best explanation of spark architecture...
Simple and brilliant analogy Mr K
this is so simple and clear explanation that
it made to share to my friends.
keep making video
your efforts putting great impact in our life.
Thank you so much :)
Simplest and excellent explanation Mr K.
Thank you so much :)
Wow! Just mind blowing brother💥💥!! Looking for more DE fundamentals videos ✨♥️👌
Thank you so much :)
Excellent! Thank you for explaining this.
Thank you so much :)
Clear and well explained
Thank you so much :)
Very well explained!!!
Thank you so much :)
Awesome explanation bro.
Thank you so much :)
I appreciate your explanation; it has clarified the topic for me. Thank you. 🙏🏼
However, I have one question: if the CSV files are split into two, how will one worker determine if there are any duplications with another worker work?
Excellent!
Thank you so much :)
Great primer @Mr. K! Thanks. Quick question - How does the driver program create task partitions for the plan? For example, if there are duplicates across two worker nodes, wouldn't the count be misrepresented if it simply adds 4500 and 5500? Does this get auto-handled or do we have to control the partitioning logic?
According to number of partitions of the files. You can also control over task by setting up configuration of partitions limit after each transformation.by using below code
spark.conf.set("spark.sql.shuffle.partitions", num_partitions)
The task is always depends on the number of partitions.
Your question is that each and every worker nodes have duplicates and in count operation it will just sum the results right.
Ans- after getting the result from each worker nodes the driver program will again aggregate it and then give the final result
Useful presentation
Thank you so much :)
Great explanation. Do you have a full pyspark tutorial?
Really good
hey man, may be worth a shot checking out LakeSail's PySail built on rust. supposedly 4x faster with 90% less hardware a cost according to their latest benchmarks. and can migrate existing python code. might be cool to make a vid on!
love ur content!
thanks for this
Hi , what tools used to create this type of videos. Please help.
Final cut pro, CapCut, PowerPoint and After effects.
@@mr.ktalkstech thank you for the info
waiting for your pyspark playlist:)
Very soon :)
This is my understanding
- Apache Spark falls under the compute category.
-It's related to MapReduce but is faster due to in-memory processing.
-Spark can read large datasets from object stores like S3 or Azure Blob Storage.
-It dynamically scales compute resources, similar to autoscaling and Kubernetes orchestration.
-It processes the data to deliver analytics, ML models, or other results efficiently.
Respect++
will this same topic be covered in the other channel (Mr.K Talks Tech Tamil)
No brother :)
RESPECT++++++++
Did not said about rddd