I have one question, I am using AWS EMR, in that cluster one worker node can have more than one executors,... in databricks is it a hard rule, that one worker node = one executor
say I've a heavy parquet file lying in S3 and I want to bring that file (COPY INTO command) into databricks as a delta table. What would be the ideal worker and driver type in that case if I have no transformations at all while moving the data but the dataset is very huge.
Explanation not very crisp. Could be done better. After watching this video, the viewer is easily confused between the relationship of the concepts you mention. Here is a summary... If a Dataset is given to Spark, it will divide the Dataset into multiple Partitions. Each of these Partitions will be assigned to each Executor in the Cluster. This way each of the Executors can work in Parallel. Example - If there are 1000 rows of data to process, then those 1000 rows can be divided into 10 Partitions, where each of the Partitions would contain 100 rows. Then each of these Partitions will be assigned to each of the Executor in the Cluster to process in Parallel. When a Developer submits a program to Apache Spark for execution, actually the Developer submits the program to the Driver Node, which is the Driver Machine of the Cluster. Then, from the Driver Machine, the submitted program goes to the Cluster Manager. The Cluster Manager takes the whole piece of code of the submitted program, and, divides into the natural hierarchy of - Jobs, Stages and Tasks respectively, and, sends each of these Tasks to each of the Executors of the Cluster for processing. Each Task is assigned a Partition of data to process.
What happens if... 1. In a databricks cluster a worker node get's down,what happens to the data which resides on that worker node??? 2. Meanwhile in continuation to above scenario,if databricks spins a new worker node...what happens if a select query goes to that new node..which doesn't have data(as its newly added in place of other node which went down and had data previously), will this cause data inconsistency???
we use metastore for that purpose which contains every information about the storage of data like which partition is in which node ,so if it fails ,it automatically recovers as data would be replicated across nodes .
Very in-depth explanation. Keep up the good work. But I have one doubt - where we are defining the partitions?? No of tasks = no. of partitions. From where does the number of partitions will come. Are we defining it somewhere?
We can define number of partitions in code and then choose the cluster configuration as per number of partitions you have set up in code(other factors are also taken into consideration while choosing cluster)...if you donot specify partitions in code spark in dbx will default create partitions for you, usually it's 200partitions with around 128mb size
your teaching skills are amazing.
You are doing Amazing, really appreciate your teaching skills and knowledge. Keep it up.
Very nice explanation. Was getting confused between stages and task part but now cleared. Thanks for this 😊.
I have one question, I am using AWS EMR, in that cluster one worker node can have more than one executors,... in databricks is it a hard rule, that one worker node = one executor
say I've a heavy parquet file lying in S3 and I want to bring that file (COPY INTO command) into databricks as a delta table. What would be the ideal worker and driver type in that case if I have no transformations at all while moving the data but the dataset is very huge.
Very informative video , please continue to upload more videos on Databricks... Thank you .
Super after lot of confusions Thanks 😊
Great explanation, Thank you
Very nice explanation
Thanks Ma'am for explaining.
Nice Explanation Bhawana. Thank you!!
Thanks a lot... Very well explained.. please upload videos on optimization techniques in Databricks.
ua-cam.com/video/a2ehHq3DJrw/v-deo.html
here i the link that might help you
well explained. Thanks
Explanation not very crisp. Could be done better. After watching this video, the viewer is easily confused between the relationship of the concepts you mention. Here is a summary...
If a Dataset is given to Spark, it will divide the Dataset into multiple Partitions.
Each of these Partitions will be assigned to each Executor in the Cluster.
This way each of the Executors can work in Parallel.
Example - If there are 1000 rows of data to process, then those 1000 rows can be divided into 10 Partitions, where each of the Partitions would contain 100 rows.
Then each of these Partitions will be assigned to each of the Executor in the Cluster to process in Parallel.
When a Developer submits a program to Apache Spark for execution, actually the Developer submits the program to the Driver Node, which is the Driver Machine of the Cluster.
Then, from the Driver Machine, the submitted program goes to the Cluster Manager.
The Cluster Manager takes the whole piece of code of the submitted program, and, divides into the natural hierarchy of - Jobs, Stages and Tasks respectively, and, sends each of these Tasks to each of the Executors of the Cluster for processing.
Each Task is assigned a Partition of data to process.
What happens if...
1. In a databricks cluster a worker node get's down,what happens to the data which resides on that worker node???
2. Meanwhile in continuation to above scenario,if databricks spins a new worker node...what happens if a select query goes to that new node..which doesn't have data(as its newly added in place of other node which went down and had data previously), will this cause data inconsistency???
we use metastore for that purpose which contains every information about the storage of data like which partition is in which node ,so if it fails ,it automatically recovers as data would be replicated across nodes .
Very in-depth explanation. Keep up the good work. But I have one doubt - where we are defining the partitions?? No of tasks = no. of partitions. From where does the number of partitions will come. Are we defining it somewhere?
We can define number of partitions in code and then choose the cluster configuration as per number of partitions you have set up in code(other factors are also taken into consideration while choosing cluster)...if you donot specify partitions in code spark in dbx will default create partitions for you, usually it's 200partitions with around 128mb size
@@cloudfitness Thanks a ton !! You've gained a new subscriber. You can upload more videos pertaining to databricks, PySpark, SQL. It would be helpful
very nice explanation
Great
Super