Trending Big Data Interview Question - Number of Partitions in your Spark Dataframe

Поділитися
Вставка
  • Опубліковано 20 сер 2024

КОМЕНТАРІ • 38

  • @Anonymous-fe2ep
    @Anonymous-fe2ep 11 місяців тому +3

    Hello Sir, I was asked the following questions for AWS Developer role. Please make a video on this. Thanks.
    Q1. We have *sensitive data* coming in from a source and API. Help me design a pipeline to bring in data, clean and transform it and park it.
    Q2. So where does pyspark come into play in this?
    Q3. Which all libraries will you need to import to run the above glue job?
    Q4. What are shared variables in pyspark
    Q5. How to optimize glue jobs
    Q6. How to protect sensitive data in your data
    Q7. How do you identify sensitive information in your data
    Q8. How do you provision a S3 bucket?
    Q9. How do I check if a file has been changed or deleted?
    Q10. How do I protect my file having sensitive data stored in S3
    Q11. How does KMS work?
    Q12. Do you know S3 glacier?
    Q13. Have you worked on S3 glacier?

  • @DEwithDhairy
    @DEwithDhairy 7 місяців тому +1

    PySpark Scenario Based Interview Question And Answers:
    ua-cam.com/play/PLqGLh1jt697zXpQy8WyyDr194qoCLNg_0.html&si=Ddhve6jjcy0ZvaLV

  • @arunsundar3739
    @arunsundar3739 4 місяці тому

    was curious why spark handles smaller files differently, & also had a fixed view that partition size is 128 MB all times, that view of mine is debunked now, beautifully explained , thank you very much sir :)

  • @Asyouwish145
    @Asyouwish145 Місяць тому

    I loved your presentation and understand more with 1 video about configuration ❤

  • @himanshupatidar9413
    @himanshupatidar9413 Рік тому +1

    Thanks for the simplified explanation, please make next video on deciding configuration for our jobs , ex: which one is better config i)10 executors with 4 cores and 4gb ram each or ii) 5 executors with 8 cores and 8 gb ram, there is no proper explanation about this concept anywhere

  • @tarunpothala6856
    @tarunpothala6856 Рік тому +1

    Sir,
    Great to see such scenarios explained clearly. We would love to watch some interview questions on databricks. Kindly post them.

  • @soumikdutta77
    @soumikdutta77 Рік тому +1

    Insightful and informative concept, thank you Sir for clearing it out with ease ✅

  • @kavyasri6654
    @kavyasri6654 Рік тому

    Thank you Sir, also please continue advanced sql playlist, I have completed both basic and advanced playlist it's very helpful.

  • @kirtisingh7698
    @kirtisingh7698 Рік тому

    Thank you Sir for explaining the answers with a scenario. It's really helpful.

  • @Rajesherukulla
    @Rajesherukulla Рік тому

    Was literally waiting for your video series... Congo for a great start sumit sir.

  • @siddheshkankal7567
    @siddheshkankal7567 Рік тому

    thank you so much for the great explanation in detail, can you discuss more on like many times interviewer might ask you have worked on how much big data size for that what could be cluster configuration, how you decide it, what can be optimized solution, what kind of data and its size, and more on next to next video expectation on spark optimization techniques

  • @sonurohini6764
    @sonurohini6764 5 місяців тому

    Good explanation sir. Make a video on possible scenario based questions like this

  • @bharanidharanm2653
    @bharanidharanm2653 2 місяці тому

    3rd scenario is not clear. Are we updating ant congratulation setting to avoid small files problem

  • @arpittapwal4651
    @arpittapwal4651 Рік тому

    Great explanation as always. Thank you Sumit sir, waiting for much such videos in future 😊

  • @user-pp4pu8kp7v
    @user-pp4pu8kp7v 5 місяців тому

    Please continue the series

  • @AbhishekVerma-hx8bq
    @AbhishekVerma-hx8bq Рік тому

    Excellent explanation, highly informative!

  • @RohanKumar-mh3pt
    @RohanKumar-mh3pt Рік тому

    very insightful please cover more spark internals scenerio based questions

  • @Momlifeindia
    @Momlifeindia Рік тому

    Well explained as always. I was asked the same question in one of the interviews.

  • @deepakpatil4419
    @deepakpatil4419 4 місяці тому

    Hi Sir, Thankyou for the explanation..
    I have a situation, I am executing a databricks pipeline through Airflow. In one of the task, I am writing the data from dataframe to a path ( in parquet file). The writing operation suppose to create the path on daily basis and write the data into the path. Path is being created but after writing, when I am checking the count, it is showing zero. It's not giving any error as well so really difficult to identify the issue.
    But, when I am reprocessing the same task then it's writing the data.

  • @eyecaptur
    @eyecaptur Рік тому

    Great explanation sir as always

  • @25683687
    @25683687 Рік тому

    Really very well explained!

  • @virajjadhav6579
    @virajjadhav6579 Рік тому

    Thank you Sir, the start of the series is great. Do we have to explain each answer with scenarios?

  • @anandattagasam7037
    @anandattagasam7037 Рік тому +1

    Hi sir, I wanted to get confirm. are you saying based on cpu cores, number of partition would happen. Like you said for 1GB data, there would be 8 partition due to paralellism then it will be 4 partition, correct. Pls correct me if i am wrong.

  • @Ronak-Data-Engineer
    @Ronak-Data-Engineer Рік тому

    Very well explained

  • @sufiyaanbhura6343
    @sufiyaanbhura6343 Рік тому

    Thank you sir!

  • @suvabratagiri9978
    @suvabratagiri9978 4 місяці тому

    Where is the next part ?

  • @ritumdutta2438
    @ritumdutta2438 Рік тому

    A very interesting start of an exciting series :) ... appreciate all your effort .... Just wanted to confirm one thing ... in case of RDD-s the partition size is always 128 mb right (what you explained applies for dataframe/higher level API-s)?

    • @sumitmittal07
      @sumitmittal07  Рік тому +1

      thats correct, in case of rdd.. it depends on the block size of underlying filesystem. in case of hdfs it will be 128 mb.

    • @localmartian9047
      @localmartian9047 5 місяців тому

      ​@@sumitmittal07And in case of object store/s3, will it be default Parallelism or the number of splits in source file from s3

  • @vusalbabashov8242
    @vusalbabashov8242 11 місяців тому

    In the example I have, I am getting df.rdd.getNumPartitions() equal to 200 which seems to be the default. I have 160 cores available in the cluster. How should we understand this in the light of what you say in the video, I feel like this part is missing. Also, when should we use spark.conf.set("spark.sql.shuffle.partitions", "auto")

    • @rohitshingare5352
      @rohitshingare5352 7 місяців тому

      in the context of video he just explained about intial stage of partitions, in your case data is get shuffled that why it has created 200 by default partitions.

    • @dipeshchaudhary2188
      @dipeshchaudhary2188 3 місяці тому

      As per my understanding, 160 tasks will be perfomed parellely and the remaining 40 tasks will wait in queue. And those 40 tasks will be performed when any 40 out of 160 cores are available again.

  • @sameersam4476
    @sameersam4476 Рік тому

    Sir i have watched your SQL complete playlist can i face the sql interview now??