Top 15 Spark Interview Questions in less than 15 minutes Part-2

Поділитися
Вставка
  • Опубліковано 14 лип 2024
  • 𝐓𝐨 𝐞𝐧𝐡𝐚𝐧𝐜𝐞 𝐲𝐨𝐮𝐫 𝐜𝐚𝐫𝐞𝐞𝐫 𝐚𝐬 𝐚 𝐂𝐥𝐨𝐮𝐝 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫, 𝐂𝐡𝐞𝐜𝐤 trendytech.in/?src=youtube&su... for curated courses developed by me.
    I have trained over 20,000+ professionals in the field of Data Engineering in the last 5 years.
    𝐖𝐚𝐧𝐭 𝐭𝐨 𝐌𝐚𝐬𝐭𝐞𝐫 𝐒𝐐𝐋? 𝐋𝐞𝐚𝐫𝐧 𝐒𝐐𝐋 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐰𝐚𝐲 𝐭𝐡𝐫𝐨𝐮𝐠𝐡 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐬𝐨𝐮𝐠𝐡𝐭 𝐚𝐟𝐭𝐞𝐫 𝐜𝐨𝐮𝐫𝐬𝐞 - 𝐒𝐐𝐋 𝐂𝐡𝐚𝐦𝐩𝐢𝐨𝐧𝐬 𝐏𝐫𝐨𝐠𝐫𝐚𝐦!
    "𝐀 8 𝐰𝐞𝐞𝐤 𝐏𝐫𝐨𝐠𝐫𝐚𝐦 𝐝𝐞𝐬𝐢𝐠𝐧𝐞𝐝 𝐭𝐨 𝐡𝐞𝐥𝐩 𝐲𝐨𝐮 𝐜𝐫𝐚𝐜𝐤 𝐭𝐡𝐞 𝐢𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐬 𝐨𝐟 𝐭𝐨𝐩 𝐩𝐫𝐨𝐝𝐮𝐜𝐭 𝐛𝐚𝐬𝐞𝐝 𝐜𝐨𝐦𝐩𝐚𝐧𝐢𝐞𝐬 𝐛𝐲 𝐝𝐞𝐯𝐞𝐥𝐨𝐩𝐢𝐧𝐠 𝐚 𝐭𝐡𝐨𝐮𝐠𝐡𝐭 𝐩𝐫𝐨𝐜𝐞𝐬𝐬 𝐚𝐧𝐝 𝐚𝐧 𝐚𝐩𝐩𝐫𝐨𝐚𝐜𝐡 𝐭𝐨 𝐬𝐨𝐥𝐯𝐞 𝐚𝐧 𝐮𝐧𝐬𝐞𝐞𝐧 𝐏𝐫𝐨𝐛𝐥𝐞𝐦."
    𝐇𝐞𝐫𝐞 𝐢𝐬 𝐡𝐨𝐰 𝐲𝐨𝐮 𝐜𝐚𝐧 𝐫𝐞𝐠𝐢𝐬𝐭𝐞𝐫 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐏𝐫𝐨𝐠𝐫𝐚𝐦 -
    𝐑𝐞𝐠𝐢𝐬𝐭𝐫𝐚𝐭𝐢𝐨𝐧 𝐋𝐢𝐧𝐤 (𝐂𝐨𝐮𝐫𝐬𝐞 𝐀𝐜𝐜𝐞𝐬𝐬 𝐟𝐫𝐨𝐦 𝐈𝐧𝐝𝐢𝐚) : rzp.io/l/SQLINR
    𝐑𝐞𝐠𝐢𝐬𝐭𝐫𝐚𝐭𝐢𝐨𝐧 𝐋𝐢𝐧𝐤 (𝐂𝐨𝐮𝐫𝐬𝐞 𝐀𝐜𝐜𝐞𝐬𝐬 𝐟𝐫𝐨𝐦 𝐨𝐮𝐭𝐬𝐢𝐝𝐞 𝐈𝐧𝐝𝐢𝐚) : rzp.io/l/SQLUSD
    Most commonly asked interview questions when you are applying for any data based roles such as data analyst, data engineer, data scientist or data manager.
    Link of Free SQL & Python series developed by me are given below -
    SQL Playlist - • SQL tutorial for every...
    Python Playlist - • Complete Python By Sum...
    Don't miss out - Subscribe to the channel for more such informative interviews and unlock the secrets to success in this thriving field!
    Social Media Links :
    LinkedIn - / bigdatabysumit
    Twitter - / bigdatasumit
    Instagram - / bigdatabysumit
    Student Testimonials - trendytech.in/#testimonials
    Tags
    #mockinterview #bigdata #career #dataengineering #data #datascience #dataanalysis #productbasedcompanies #interviewquestions #apachespark #google #interview #faang #companies #amazon #walmart #flipkart #microsoft #azure #databricks #jobs

КОМЕНТАРІ • 4

  • @vaibhavj12
    @vaibhavj12 2 місяці тому

    Helpful❤

  • @piyushjain5852
    @piyushjain5852 Місяць тому +4

    how number of stages = no of wide transformations + 1 ?

    • @sugunanindia
      @sugunanindia Місяць тому

      In Apache Spark, the number of stages in a job is determined by the wide transformations present in the execution plan. Here's a detailed explanation of why the number of stages is equal to the number of wide transformations plus one:
      ### Transformations in Spark
      #### Narrow Transformations
      Narrow transformations are operations where each input partition contributes to exactly one output partition. Examples include:
      - `map`
      - `filter`
      - `flatMap`
      These transformations do not require data shuffling and can be executed in a single stage.
      #### Wide Transformations
      Wide transformations are operations where each input partition can contribute to multiple output partitions. These transformations require data shuffling across the network. Examples include:
      - `reduceByKey`
      - `groupByKey`
      - `join`
      Wide transformations result in a stage boundary because data must be redistributed across the cluster.
      ### Understanding Stages
      #### Stages
      A stage in Spark is a set of tasks that can be executed in parallel on different partitions of a dataset without requiring any shuffling of data. A new stage is created each time a wide transformation is encountered because the data needs to be shuffled across the cluster.
      ### Calculation of Stages
      Given the nature of transformations, the rule "number of stages = number of wide transformations + 1" can be explained as follows:
      1. **Initial Stage**: The first stage begins with the initial set of narrow transformations until the first wide transformation is encountered.
      2. **Subsequent Stages**: Each wide transformation requires a shuffle, resulting in the end of the current stage and the beginning of a new stage.
      Thus, for `n` wide transformations, there are `n + 1` stages:
      - The initial stage.
      - One additional stage for each wide transformation.
      ### Example
      Consider the following Spark job:
      ```python
      from pyspark import SparkContext
      sc = SparkContext.getOrCreate()
      # Sample RDD
      rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])
      # Narrow transformation: map
      rdd1 = rdd.map(lambda x: (x[0], x[1] * 2))
      # Wide transformation: reduceByKey (requires shuffle)
      rdd2 = rdd1.reduceByKey(lambda x, y: x + y)
      # Another narrow transformation: filter
      rdd3 = rdd2.filter(lambda x: x[1] > 4)
      # Wide transformation: groupByKey (requires shuffle)
      rdd4 = rdd3.groupByKey()
      # Action: collect
      result = rdd4.collect()
      print(result)
      ```
      **Analysis of Stages**:
      1. **Stage 1**: Includes `parallelize`, `map`. This is all narrow transformations.
      2. **Stage 2**: Starts with `reduceByKey` (a wide transformation) which triggers a shuffle.
      3. **Stage 3**: Includes `filter`, which is a narrow transformation.
      4. **Stage 4**: Starts with `groupByKey` (another wide transformation) which triggers another shuffle.
      So, there are two wide transformations (`reduceByKey` and `groupByKey`) and three stages (`number of wide transformations + 1`).
      ### Conclusion
      The number of stages in a Spark job is driven by the need to shuffle data between transformations. Each wide transformation introduces a new stage due to the shuffle it triggers, resulting in the formula: `number of stages = number of wide transformations + 1`. This understanding is crucial for optimizing and debugging Spark applications.

    • @epicdigger4110
      @epicdigger4110 2 дні тому +1

      bhai ne bola bapu dikhta toh bapu dikhta