Big Data Engineer Mock Interview | Questions on Data Skewness | Salting | Out of Memory Error

Поділитися
Вставка
  • Опубліковано 4 жов 2024
  • 𝐓𝐨 𝐞𝐧𝐡𝐚𝐧𝐜𝐞 𝐲𝐨𝐮𝐫 𝐜𝐚𝐫𝐞𝐞𝐫 𝐚𝐬 𝐚 𝐂𝐥𝐨𝐮𝐝 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫, 𝐂𝐡𝐞𝐜𝐤 trendytech.in/... for curated courses developed by me.
    𝐖𝐚𝐧𝐭 𝐭𝐨 𝐌𝐚𝐬𝐭𝐞𝐫 𝐒𝐐𝐋? 𝐋𝐞𝐚𝐫𝐧 𝐒𝐐𝐋 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐰𝐚𝐲 𝐭𝐡𝐫𝐨𝐮𝐠𝐡 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐬𝐨𝐮𝐠𝐡𝐭 𝐚𝐟𝐭𝐞𝐫 𝐜𝐨𝐮𝐫𝐬𝐞 - 𝐒𝐐𝐋 𝐂𝐡𝐚𝐦𝐩𝐢𝐨𝐧𝐬 𝐏𝐫𝐨𝐠𝐫𝐚𝐦!
    "𝐀 8 𝐰𝐞𝐞𝐤 𝐏𝐫𝐨𝐠𝐫𝐚𝐦 𝐝𝐞𝐬𝐢𝐠𝐧𝐞𝐝 𝐭𝐨 𝐡𝐞𝐥𝐩 𝐲𝐨𝐮 𝐜𝐫𝐚𝐜𝐤 𝐭𝐡𝐞 𝐢𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐬 𝐨𝐟 𝐭𝐨𝐩 𝐩𝐫𝐨𝐝𝐮𝐜𝐭 𝐛𝐚𝐬𝐞𝐝 𝐜𝐨𝐦𝐩𝐚𝐧𝐢𝐞𝐬 𝐛𝐲 𝐝𝐞𝐯𝐞𝐥𝐨𝐩𝐢𝐧𝐠 𝐚 𝐭𝐡𝐨𝐮𝐠𝐡𝐭 𝐩𝐫𝐨𝐜𝐞𝐬𝐬 𝐚𝐧𝐝 𝐚𝐧 𝐚𝐩𝐩𝐫𝐨𝐚𝐜𝐡 𝐭𝐨 𝐬𝐨𝐥𝐯𝐞 𝐚𝐧 𝐮𝐧𝐬𝐞𝐞𝐧 𝐏𝐫𝐨𝐛𝐥𝐞𝐦."
    𝐇𝐞𝐫𝐞 𝐢𝐬 𝐡𝐨𝐰 𝐲𝐨𝐮 𝐜𝐚𝐧 𝐫𝐞𝐠𝐢𝐬𝐭𝐞𝐫 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐏𝐫𝐨𝐠𝐫𝐚𝐦 -
    𝐑𝐞𝐠𝐢𝐬𝐭𝐫𝐚𝐭𝐢𝐨𝐧 𝐋𝐢𝐧𝐤 (𝐂𝐨𝐮𝐫𝐬𝐞 𝐀𝐜𝐜𝐞𝐬𝐬 𝐟𝐫𝐨𝐦 𝐈𝐧𝐝𝐢𝐚) : rzp.io/l/SQLINR
    𝐑𝐞𝐠𝐢𝐬𝐭𝐫𝐚𝐭𝐢𝐨𝐧 𝐋𝐢𝐧𝐤 (𝐂𝐨𝐮𝐫𝐬𝐞 𝐀𝐜𝐜𝐞𝐬𝐬 𝐟𝐫𝐨𝐦 𝐨𝐮𝐭𝐬𝐢𝐝𝐞 𝐈𝐧𝐝𝐢𝐚) : rzp.io/l/SQLUSD
    I have trained over 20,000+ professionals in the field of Data Engineering in the last 5 years.
    BIG DATA INTERVIEW SERIES
    This mock interview series is launched as a community initiative under Data Engineers Club aimed at aiding the community's growth and development
    Our highly experienced guest interviewer, Chandrali Sarkar, / chandrali-sarkar-4570a... shares invaluable insights and practical guidance drawn from her extensive expertise in the Big Data Domain.
    Our expert guest interviewee, Soumya Ranjan Parida, / soumya-parida has an interesting approach to answering the interview questions on Apache Spark, SQL and Azure Cloud Services.
    Link of Free SQL & Python series developed by me are given below -
    SQL Playlist - • SQL tutorial for every...
    Python Playlist - • Complete Python By Sum...
    Don't miss out - Subscribe to the channel for more such informative interviews and unlock the secrets to success in this thriving field!
    Social Media Links :
    LinkedIn - / bigdatabysumit
    Twitter - / bigdatasumit
    Instagram - / bigdatabysumit
    Student Testimonials - trendytech.in/...
    TIMESTAMPS : Questions Discussed
    00:35 Introduction
    01:40 Explain your project's end-to-end pipeline and overview.
    03:17 What is the data source for your project?
    03:36 Where does the data get ingested?
    04:36 What types of data are being processed?
    05:04 How do you capture incremental data in an OLTP environment?
    07:52 What is the frequency and volume of the incoming data?
    08:28 Which file formats have you worked with?
    09:00 What is the predicate pushdown?
    10:14 What optimizations have you applied in Spark?
    10:45 Define broadcast join.
    11:10 List some transformations you've used in Spark.
    11:27 Explain narrow and wide transformations.
    12:03 What is the difference between reduceByKey and groupByKey.
    12:56 Have you encountered "out of memory" errors in Spark? How did you resolve them?
    14:22 How will salting help in resolving out of memory error?
    14:46 What is data skewness?
    15:22 Explain cache and persist in Spark.
    16:57 If memory and disk are full then in that case what will happen?
    17:40 When would you use coalesce and repartition?
    18:00 Provide a scenario where coalesce and repartition can be used?
    18:38 Where does repartition happen at driver or executor level?
    19:30 What is the difference between rank, dense rank, and row number functions?
    22:06 Describe the internal process of submitting a Spark job.
    Music track: Retro by Chill Pulse
    Source: freetouse.com/...
    Background Music for Video (Free)
    Tags
    #mockinterview #bigdata #career #dataengineering #data #datascience #dataanalysis #productbasedcompanies #interviewquestions #apachespark #google #interview #faang #companies #amazon #walmart #flipkart #microsoft #azure #databricks #jobs

КОМЕНТАРІ • 17

  • @jithindev9185
    @jithindev9185 4 місяці тому +3

    No idea about skewness.but explaining how salting reduce oom..😊
    Just highlighting the points where an interviewer can easily catch ...

    • @soumyaparida9231
      @soumyaparida9231 4 місяці тому +1

      Yes friend I have not faced it i have not fluked it .I have already told it is my first DE project and our data volume is small.

  • @jameskhan6972
    @jameskhan6972 3 місяці тому +1

    I think re partition happens at executor level, Executors perform the actual data movement and redistribution. They read the data from the existing partitions, shuffle it across the network, and write it into new partitions as specified by the re partitioning logic.

    • @kch8278
      @kch8278 3 місяці тому

      I agree with you. Repartitn happens on executor

  • @hdr-tech4350
    @hdr-tech4350 3 місяці тому +1

    Predicate pushdown
    What opt used in spark
    Transformation used
    Groupby n reduceby
    Faced oom error ?
    Salting
    Data skewness
    Data spillness
    Cache persists
    Lru
    Repartition vs coalesce
    Rnk densernk rno
    What happen submit spark job

  • @gudiatoka
    @gudiatoka 4 місяці тому +1

    10:50
    AQE also changed the joining technique to broadcast if it can capable to hold the smaller df also if not we can alter the broadcast threshold value to as per desired depending upon the culster config

    • @soumyaparida9231
      @soumyaparida9231 4 місяці тому

      Yes,I have specifically mentioned our data volume is less.Please listen to that.As a result AQE will automatically choose broadcast join.

    • @mohammediqbal2406
      @mohammediqbal2406 4 місяці тому

      ​​@@soumyaparida9231
      Hi brother, how many years of experience do you have?

    • @soumyaparida9231
      @soumyaparida9231 4 місяці тому

      ​@@mohammediqbal2406 1.6 yrs full time and 1 yr internship

  • @gudiatoka
    @gudiatoka 4 місяці тому +2

    14:30 salting never decreases the OOM exception rather it causes the OOM as it replicated the smaller table data multiples time. Salting help us to reduce the skewness( through what i observed it is a hoax for me only 😊)

    • @soumyaparida9231
      @soumyaparida9231 4 місяці тому

      It will cause of OOM error if memory allocated is less to each executor.It is not a hoax.And data skewness is one of the reasons for OOM error if you don't know.

    • @soumyaparida9231
      @soumyaparida9231 4 місяці тому

      Salting is for improving the partitioning.So it varies from project to project.And well you have partitioned the data.So your conclusion is not correct,because you are trying to generalize.

  • @jithindev9185
    @jithindev9185 4 місяці тому +2

    Repartitioning happens at driver? 18:58

  • @rushirajkadge3995
    @rushirajkadge3995 3 місяці тому

    Are row_number values correct shown at 22:00 ?
    I mean if we are partitioning by marks, then how can output look as shown in the video?

  • @gudiatoka
    @gudiatoka 4 місяці тому +2

    Brother is saying they are getting data from azure sql and after that they are performing the transformation on top of that
    If the company is used cloud azure database for their project then how come he only take one fact table in general a application consist of more than 1 fact and if one fact is there then multiple dimension table. So if a company can have money to use azure sql db vm then database must be normalized. These are the commin mistake
    Please brush them up properly as in real.life interview it will not be easy

    • @SandeepRajChinnakandukur
      @SandeepRajChinnakandukur 4 місяці тому

      Hi bro, I need some project explanation tips as I have an interview scheduled for the Data engineer role. Appreciate your time. Please DM

    • @soumyaparida9231
      @soumyaparida9231 4 місяці тому +1

      Hi,Maybe I missed that point about one dimension table but do you really think addition of one dimension has any impact on cost? And also transformations are performed in stored procedures here.Only basic level transformations are performed in databricks.Their is nothing to get brushed up here.Do you really think I am working at Accenture without clearing the interview?