20. Runtime Architecture of Spark In Databricks

Поділитися
Вставка
  • Опубліковано 31 гру 2024

КОМЕНТАРІ • 23

  • @Prashanth-yj6qx
    @Prashanth-yj6qx 11 місяців тому

    your teaching skills are amazing.

  • @shrutikansal9831
    @shrutikansal9831 9 місяців тому

    You are doing Amazing, really appreciate your teaching skills and knowledge. Keep it up.

  • @ashutoshdeshpande3525
    @ashutoshdeshpande3525 2 роки тому

    Very nice explanation. Was getting confused between stages and task part but now cleared. Thanks for this 😊.

  • @Dinesh-g1o
    @Dinesh-g1o 5 місяців тому

    I have one question, I am using AWS EMR, in that cluster one worker node can have more than one executors,... in databricks is it a hard rule, that one worker node = one executor

  • @ayushsrivastava6494
    @ayushsrivastava6494 6 місяців тому

    say I've a heavy parquet file lying in S3 and I want to bring that file (COPY INTO command) into databricks as a delta table. What would be the ideal worker and driver type in that case if I have no transformations at all while moving the data but the dataset is very huge.

  • @ramaraju3273
    @ramaraju3273 3 роки тому

    Very informative video , please continue to upload more videos on Databricks... Thank you .

  • @Ravi_Teja_Padala_tAlKs
    @Ravi_Teja_Padala_tAlKs Рік тому

    Super after lot of confusions Thanks 😊

  • @Dinesh-g1o
    @Dinesh-g1o 5 місяців тому

    Great explanation, Thank you

  • @ganeshshinde4905
    @ganeshshinde4905 4 місяці тому

    Very nice explanation

  • @lifewithtarun
    @lifewithtarun Рік тому

    Thanks Ma'am for explaining.

  • @MohammedKhan-np7dn
    @MohammedKhan-np7dn 3 роки тому

    Nice Explanation Bhawana. Thank you!!

  • @deepikasoni7423
    @deepikasoni7423 2 роки тому

    Thanks a lot... Very well explained.. please upload videos on optimization techniques in Databricks.

    • @cloudfitness
      @cloudfitness  2 роки тому +1

      ua-cam.com/video/a2ehHq3DJrw/v-deo.html
      here i the link that might help you

  • @nagabadsha
    @nagabadsha 11 місяців тому

    well explained. Thanks

  • @T-Radi
    @T-Radi 4 дні тому +1

    Explanation not very crisp. Could be done better. After watching this video, the viewer is easily confused between the relationship of the concepts you mention. Here is a summary...
    If a Dataset is given to Spark, it will divide the Dataset into multiple Partitions.
    Each of these Partitions will be assigned to each Executor in the Cluster.
    This way each of the Executors can work in Parallel.
    Example - If there are 1000 rows of data to process, then those 1000 rows can be divided into 10 Partitions, where each of the Partitions would contain 100 rows.
    Then each of these Partitions will be assigned to each of the Executor in the Cluster to process in Parallel.
    When a Developer submits a program to Apache Spark for execution, actually the Developer submits the program to the Driver Node, which is the Driver Machine of the Cluster.
    Then, from the Driver Machine, the submitted program goes to the Cluster Manager.
    The Cluster Manager takes the whole piece of code of the submitted program, and, divides into the natural hierarchy of - Jobs, Stages and Tasks respectively, and, sends each of these Tasks to each of the Executors of the Cluster for processing.
    Each Task is assigned a Partition of data to process.

  • @abhishek310195
    @abhishek310195 2 роки тому

    What happens if...
    1. In a databricks cluster a worker node get's down,what happens to the data which resides on that worker node???
    2. Meanwhile in continuation to above scenario,if databricks spins a new worker node...what happens if a select query goes to that new node..which doesn't have data(as its newly added in place of other node which went down and had data previously), will this cause data inconsistency???

    • @billcates4048
      @billcates4048 Рік тому

      we use metastore for that purpose which contains every information about the storage of data like which partition is in which node ,so if it fails ,it automatically recovers as data would be replicated across nodes .

  • @shivamjha9720
    @shivamjha9720 3 роки тому

    Very in-depth explanation. Keep up the good work. But I have one doubt - where we are defining the partitions?? No of tasks = no. of partitions. From where does the number of partitions will come. Are we defining it somewhere?

    • @cloudfitness
      @cloudfitness  3 роки тому +6

      We can define number of partitions in code and then choose the cluster configuration as per number of partitions you have set up in code(other factors are also taken into consideration while choosing cluster)...if you donot specify partitions in code spark in dbx will default create partitions for you, usually it's 200partitions with around 128mb size

    • @shivamjha9720
      @shivamjha9720 3 роки тому

      @@cloudfitness Thanks a ton !! You've gained a new subscriber. You can upload more videos pertaining to databricks, PySpark, SQL. It would be helpful

  • @deepjyotimitra1340
    @deepjyotimitra1340 2 роки тому

    very nice explanation

  • @nagamanickam6604
    @nagamanickam6604 Рік тому

    Great

  • @nagendraprasadreddy353
    @nagendraprasadreddy353 2 роки тому

    Super