Data Engineering Interview | Apache Spark Interview | Live Big Data Interview

Поділитися
Вставка
  • Опубліковано 17 гру 2024

КОМЕНТАРІ • 236

  • @tradingtransformation346
    @tradingtransformation346 2 роки тому +77

    Questions :
    1) Why you shifted from map reduce development to spark development?
    2) How Spark Engine is different from Hadoop Map Reduce engine?
    3) What are the steps for spark jobs optimization?
    4) What is executor and executor core? Reference in terms of process & threads
    5) How to you identify that your hive script is slow?
    6) When do we use partitioning and bucketing in hive?
    7) Small file problem in hive ? ---> Skewiness
    8) How do you improve high cardinality issue in dataset? In resect of Hive.
    9) How do you care code merging with other teams, explain your development process?
    10) Again, Small files issue in Hadoop ?
    11) Metadatasize of hadoop ?
    12) How spark is differentiated from Map Reduce?
    13) In a class having 3 fields name,age,salary & you are creating series of objects from this class? How do you compare the object ----(I didn't got the question exactly)
    14) Scala : what is === in joins conditions? What does it means?
    Hope so it will help?

  • @bramar1278
    @bramar1278 4 роки тому +25

    I must really appreciate for posting this interview in public domain. This is a really good one.. it would be really great to see a video on process to optimize the job

  • @Nasirmah
    @Nasirmah 2 роки тому +12

    Thank you guys, you are big reason why I got job in aws data Engineer. Spark and optimizations are most asked questions. Partitioning and bucketing with Hive as well. I would also add that the interviewers are similar to real setting because they usually point to you to the right direction of the answer they looking so always listen to their follow up.

    • @karna9156
      @karna9156 Рік тому

      How you are feeling now do you transitioned your career from some other tech ..? Do you face complexities in your day to day activities..?

  • @tradingtexi
    @tradingtexi 3 роки тому +19

    really great video, it would have been much greater, if you can answer the questions which the candidate was not able to answer, like what are symptoms of a job, on which you will decide that you should increase the number of executors or memory per executors. Can anyone please answer here, so that it may be beneficial for candidates. Thanks a lot for this video.

  • @MageswaranD
    @MageswaranD 3 роки тому +37

    How do you optimize a job?
    - Check the input data size and output data size and correlate to operating cluster memory
    - Check Input partition size,output partition size and number of partitions along with shuffle partition and decide number of cores
    - Check for disk memory spills during stage execution
    - Number of executors used for given cluster size
    - Available cluster memory and memory in use by the application/job
    - Check average run time of all stages in the job, to identify any data skewed stage tasks
    - Check whether the table is partitioned by column or not / bucketing

  • @kaladharnaidusompalyam851
    @kaladharnaidusompalyam851 4 роки тому +11

    hadoop is meant for handling big files in small numbers and also small file problem arises when file size is less than HDFS block size [ 64 or 128 ] . Moreover, handling bulk number of small files may increase pressure on Name node , if we have option to handle big file. so in hadoop file size matters alot so only Partitioning and Buckting came into picture. correct me if i did mistake

    • @sank9508
      @sank9508 3 роки тому

      Partitioning and Bucketing is related to YARN ( processing side of Hadoop )
      HDFS small files explained : blog.cloudera.com/the-small-files-problem/ ( storage side of Hadoop )
      Also to handle huge number of small files, we need to increase NN heap (1Million blocks count-> 1GB) there then causing GC issue and making things complicated.

  • @amansinghshrinet8594
    @amansinghshrinet8594 3 роки тому +5

    @Data Savvy
    It can be watched at one stretch. Really helpful. 👍🏻🙌🏻

  • @ajithkannan522
    @ajithkannan522 2 роки тому +2

    since this is mock interview at the end the interviewers should hv given feedback in the call itself so its helpful for viewers

  • @JaiGurujiAshwi
    @JaiGurujiAshwi 3 роки тому +2

    Hi sir, it's really helpful for me because I have issues lots of questions which you asked there, thank you so much sir.
    Please make one more videos on advance level SPRK series please.

  • @rohitrathod8150
    @rohitrathod8150 4 роки тому +6

    Awesome Harjeet sir!!
    I can even watch such thousand videos at a stretch😁
    Very informative!!!
    Can't wait for long, please upload as much as u can sir.

    • @DataSavvy
      @DataSavvy  4 роки тому +2

      Thanks Rohit... Yes, I will try my best :)

  • @mayanksrivastava4121
    @mayanksrivastava4121 2 роки тому

    Amazing .. thanks @Data Savvy for your efforts :)

  • @kiranmudradi26
    @kiranmudradi26 4 роки тому +5

    Nice video. The purpose of using '===' while joining is to make sure that we are comparing right values (join key value) and right data type as well. Please correct me if my understanding is wrong.

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      You are right... Using more keywords here will help in giving better answer

    • @Deekshithnag
      @Deekshithnag Рік тому

      using 3 equals (===) is a method defined in column class in scala that is specifically designed to compare columns in dataframes.

    • @Varnam_trendZ
      @Varnam_trendZ Рік тому

      ​@@DeekshithnagHi.. are you working as a data engineer?

  • @chaitanya5869
    @chaitanya5869 4 роки тому +3

    Ur interview is very helpful.
    Keep up the good work 👍👍👍

  • @kranthikumarjorrigala
    @kranthikumarjorrigala 4 роки тому +1

    This is very useful. Please make more videos like this.

    • @DataSavvy
      @DataSavvy  4 роки тому

      Thanks Kranthi... We will create more videos like this

  • @sukanyapatnaik7633
    @sukanyapatnaik7633 4 роки тому +1

    Awesome video. Thank you for putting this out. It's helpful.

  • @sujaijain4511
    @sujaijain4511 2 роки тому +1

    Thank you very much, this is very useful!!!

  • @rahulpandit9082
    @rahulpandit9082 4 роки тому +2

    Very Informative.. Thanks a lot Guys...

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      Thanks Rahul... Sathya and Arindham helped with this :)

  • @RahulRawat-wu1vv
    @RahulRawat-wu1vv 4 роки тому +10

    Can you do interview questions on scala. I believe these are really imp for cracking tough interviews

    • @DataSavvy
      @DataSavvy  4 роки тому +3

      Yes Rahul... I will plan for that

  • @ramkumarananthapalli7151
    @ramkumarananthapalli7151 3 роки тому

    Very much helpful. Thanks a lot for uploading.

  • @rajeshkumardash1222
    @rajeshkumardash1222 4 роки тому +2

    @Data Savvy Nice one very informative

    • @DataSavvy
      @DataSavvy  4 роки тому

      Thanks Rajesh... More videos like this will be posted

  • @sathyansathyan3213
    @sathyansathyan3213 4 роки тому +1

    Keep up the excellent work👍 expecting more such videos.

  • @srinivasjagalla7864
    @srinivasjagalla7864 3 місяці тому

    Nice discussion

  • @rohitkamra1628
    @rohitkamra1628 4 роки тому +1

    Awesome. Keep it up 👍🏻

  • @MoinKhan-cg8cu
    @MoinKhan-cg8cu 4 роки тому

    Very 2 helpful nd plz have 1,2 more interviews of same level.
    Great effort by interviewer and interviewee.

    • @DataSavvy
      @DataSavvy  4 роки тому

      Thanks for your kind words... Yup more interviews are planned

  • @shubhamshingi4657
    @shubhamshingi4657 4 роки тому +11

    It would be really helpful if you could make more such a mock interviews. I think we have only 3 live interviews yet on channel

  • @ShashankGupta347
    @ShashankGupta347 2 роки тому +2

    Default block size is 128MB, when small size files will be created using partitioning. Lot of storage will go waste. And required horizontal Scaling ( fails the purpose of distribution)

    • @kartikeyapande5878
      @kartikeyapande5878 7 місяців тому

      But we can configure block size aswell right?

    • @ShashankGupta347
      @ShashankGupta347 6 місяців тому

      @@kartikeyapande5878 Yes,
      When dealing with small files in a distributed storage system with a default block size of 128MB, indeed, there can be inefficiencies and wasted storage space due to the space allocated for each block. This issue is commonly known as the "small files problem" in distributed storage systems like Hadoop's HDFS.
      Here are a few strategies to mitigate this problem:
      1. **Combine Small Files**: One approach is to periodically combine small files into larger ones. This process is often referred to as file compaction or consolidation. By combining multiple small files into larger ones, you can reduce the overhead of storing metadata for each individual file and make better use of the storage space.
      2. **Adjust Block Size**: Depending on your workload, you might consider adjusting the default block size. While larger block sizes are more efficient for storing large files, smaller block sizes can be more suitable for small files. However, this adjustment requires careful consideration since smaller block sizes can increase the overhead of managing metadata and may impact performance.
      3. **Use Alternate Storage Solutions**: Depending on your specific requirements, you might explore alternative storage solutions that are better suited for managing small files. For example, using a distributed object storage system like Amazon S3 or Google Cloud Storage might be more efficient for storing and retrieving small files compared to traditional block-based storage systems.
      4. **Metadata Optimization**: Optimizing the metadata management mechanisms within your distributed storage system can help reduce the overhead associated with storing small files. Techniques such as hierarchical namespace structures, metadata caching, and efficient indexing can improve the performance and scalability of the system when dealing with small files.
      5. **Compression and Deduplication**: Employing compression and deduplication techniques can help reduce the overall storage footprint of small files. By compressing similar files or identifying duplicate content and storing it only once, you can optimize storage utilization and reduce wastage.
      6. **Object Storage**: Consider using object storage solutions that are designed to efficiently store and retrieve small objects. These systems typically offer features such as fine-grained metadata management, scalable architectures, and optimized data access patterns for small files.
      Each of these strategies has its own trade-offs in terms of complexity, performance, and overhead. The most suitable approach depends on the specific requirements and constraints of your application and infrastructure.

  • @Rohit-r1q1h
    @Rohit-r1q1h 3 місяці тому

    Can you post video now for data engineering interview and also post question sets as well

  • @tusharsonawane3055
    @tusharsonawane3055 4 роки тому +6

    Hello sir this first time I am getting touch with you . It was a great interview I have seen so many tricky questions . I am preparing for spark Administrator interview do you have some spark tunning interview questions and some advance interview questions related to spark

    • @DataSavvy
      @DataSavvy  4 роки тому

      Hi Tushar... I am happy this video is useful for you. There is a playlist, on my channel for spark performance tuning and I am also working on a new series... Let me know if u need anything extra

  • @Ahlambabes
    @Ahlambabes 4 роки тому +1

    amazing job , really interesting thank you for sharing this interview with us.

  • @saeba7528
    @saeba7528 4 роки тому +2

    sir can you please make a video on which language is best for Data engineering is it scala or python?

  • @vidyac6775
    @vidyac6775 4 роки тому

    i like all videos of yours :) very informative

    • @DataSavvy
      @DataSavvy  4 роки тому

      Thanks Vidya... I am happy that these videos are useful for you :)

  • @sssaamm29988
    @sssaamm29988 3 роки тому

    what is the answer to the scala question at 31:00, eliminating duplicate objects in a set on the basis of name?

  • @deepikalamba1058
    @deepikalamba1058 3 роки тому +1

    Hey, It was really helpfull Thank you 👍

  • @nivedita5639
    @nivedita5639 4 роки тому +1

    Thank you sir ..it is very helpful

  • @Karmihir
    @Karmihir 4 роки тому +2

    This is good but its just basics questions for DE, it would be great if you share some code and advanced logic questions for DE daily uses.

    • @DataSavvy
      @DataSavvy  4 роки тому +2

      There will be more videos... With more complex problems.

  • @davidgupta110
    @davidgupta110 4 роки тому +1

    Good logical questions 👌👌

  • @nitishr5197
    @nitishr5197 4 роки тому +2

    Informative .also it will be good if the correct answers also mentioned.

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      Thanks Nitish... Most of answers are available as dedicated video on channel

    • @chaitanyachimakurthi2370
      @chaitanyachimakurthi2370 4 роки тому

      @@DataSavvy Sorry i could not get, we have separate video with answers ?

  • @digwijoymandal5216
    @digwijoymandal5216 3 роки тому +1

    Seems like most questions are on how to optimize the jobs. Not much on the technical side. Does Data engineer interviews goes like this, or any other technical questions are asked too?

  • @surajyejare1627
    @surajyejare1627 3 роки тому

    This is really helpful

  • @abhinee
    @abhinee 4 роки тому +3

    Also spark does dynamic executor allocation so you dont need to pass 800 executors as input. Size your job by running test loads.

  • @gauravlotekar660
    @gauravlotekar660 4 роки тому +5

    It fine i..but it should be more of a discussion than question answer session.

    • @DataSavvy
      @DataSavvy  4 роки тому

      Hi Gaurav... Are u suggesting, way of interviewing is not appropriate

    • @gauravlotekar660
      @gauravlotekar660 4 роки тому

      @@DataSavvy no no .. definitely not that.
      I was saying discussion way of interviewing is more effective as per my opinion.
      I feel more comfortable and and able to express that way.

  • @arindampatra6283
    @arindampatra6283 4 роки тому +16

    Wish I didn't have the haircut that day😂😂😀😀😂😂😂

  • @newbeautifulworldq2936
    @newbeautifulworldq2936 Рік тому +1

    Any new video?i will appreciate

    • @DataSavvy
      @DataSavvy  Рік тому

      just posted ua-cam.com/video/pTFkjdNng-U/v-deo.html

  • @KahinBhiKuchBhi
    @KahinBhiKuchBhi Рік тому

    We can use Bucketing when there are lot of small files ... Correct me if i wrong...

  • @abhinee
    @abhinee 4 роки тому +2

    Pls cover estimating spark cluster size on cloud infrastructure like aws

    • @DataSavvy
      @DataSavvy  4 роки тому

      Sure Abhinee... Thanks for suggestion

  • @nivedita5639
    @nivedita5639 4 роки тому +1

    Can you explain this question with a video.
    What is the best way to join 3 table in spark.

  • @TheMan.0010
    @TheMan.0010 4 роки тому +2

    Please make a course on databricks certification for pyspark

    • @DataSavvy
      @DataSavvy  4 роки тому

      Sure Mani... Added in my list. Thanks for suggestion :)

    • @TheMan.0010
      @TheMan.0010 4 роки тому

      and please make
      1. hands on schema evolution with all formats
      2. databricks delta lake
      3. how to connect with different datasources
      You are the only creator to be expected from

  • @rajdeepsinghborana2409
    @rajdeepsinghborana2409 4 роки тому +2

    Sir.. can you please provide us Hadoop & Spark developer with SCALA video's Beginners to Perfect
    It's very very very Useful For us sirr.. because cheaked all types of video's on the youtube no one can do it ...
    Or sir hai bhi to wo paid courses hai like EDUREKA , Intellipath , Simpllearn etc . So , sir please make it earlier..Need its bhot the students

    • @rajdeepsinghborana2409
      @rajdeepsinghborana2409 4 роки тому

      Student able to learn and gain more and more knowledge but haven't money 😭

    • @DataSavvy
      @DataSavvy  4 роки тому

      Sure Randeep.. I will plan for spark course

  • @msbhargavi1846
    @msbhargavi1846 4 роки тому +2

    Hi, we can use distinct method in Scala for reading unique name rt??

  • @bhavaniv1721
    @bhavaniv1721 3 роки тому

    Can please explain roles and responsibilities of spark and scala

  • @vijeandran
    @vijeandran 3 роки тому

    Hi Data Savvy, unable to join Telegram, authorization issue....

  • @lavuittech3136
    @lavuittech3136 4 роки тому +2

    Can you teach us big data from scratch? Your videos are really useful.

    • @DataSavvy
      @DataSavvy  4 роки тому +2

      Sure... Which topics are you looking for

    • @lavuittech3136
      @lavuittech3136 4 роки тому +2

      @@DataSavvy from basics.

    • @DataSavvy
      @DataSavvy  4 роки тому +3

      Ok... Planning that :)

    • @lavuittech3136
      @lavuittech3136 4 роки тому +1

      @@DataSavvy Thanks..waiting for it.

    • @rudraganesh1507
      @rudraganesh1507 4 роки тому

      @@DataSavvy sir do it love u advance

  • @enishaeshwar7617
    @enishaeshwar7617 2 роки тому

    Good questions

  • @Venky-u3y
    @Venky-u3y 2 роки тому

    In spark it is not possible to apply Bucketing without partitioning the tables. So If we do not find a suitable column to partition the table, how will we proceed a head with optimization ?

  • @MANGESHpawarsm42
    @MANGESHpawarsm42 Рік тому

    Please add videos for fresher also

  • @kashifanwar4034
    @kashifanwar4034 4 роки тому

    How can we make only name as a deciding factor to remove duplicity in a set instead of all the entries it take in Scala?

  • @Anonymous-fe2ep
    @Anonymous-fe2ep Рік тому +1

    Hello, I was asked the followed questions in a AWS developer interview-
    Q1. We have *sensitive data* coming in from a source and API. Help me design a pipeline to bring in data, clean and transform it and park it.
    Q2. So where does pyspark come into play in this?
    Q3. Which all libraries will you need to import to run the above glue job?
    Q4. What are shared variables in pyspark
    Q5. How to optimize glue jobs
    Q6. How to protect sensitive data in your data.
    Q7. How do you identify sensitive information in your data.
    Q8. How do you provision a S3 bucket?
    Q9. How do I check if a file has been changed or deleted?
    Q10. How do I protect my file having sensitive data stored in S3?
    Q11. How does KMS work?
    Q12. Do you know S3 glacier?
    Q13. Have you worked on S3 glacier?

  • @abhinee
    @abhinee 4 роки тому +6

    Actually asking to compare spark n hadoop is incorrect. Should ask mr vs spark. Also spark has improved insanely so pls interviewers RIP this question

    • @KiranKumar-um2gz
      @KiranKumar-um2gz 4 роки тому

      its correct. hdfs has own framework and spark has its own.hdfs works in batch processing process where spark works by inmemory computation. everything consider as info dumb

    • @abhinee
      @abhinee 4 роки тому +1

      @@KiranKumar-um2gz spark is a framework in itself and it also does batch processing. Pls dnt spread half knowledge

    • @alokdutta4712
      @alokdutta4712 4 роки тому

      True

    • @hasifs8139
      @hasifs8139 4 роки тому

      I will still ask this question. When comparing Hadoop with Spark, it is assumed that we are comparing 2 processing engines not a processing engine to file system. We expect a candidate to be sane enough to understand that. Also, Spark is built on top of MR concept this very good question to test your understanding of it.

    • @abhinee
      @abhinee 4 роки тому +2

      @@hasifs8139 anyone who has picked up spark in last 3 years does not need to understand mr to be good at spark or data processing. Spark implementation is way different than mr to make any comparison. Do you do same or similar steps to optimise joins in spark and mr, no. You can keep asking this question and rejecting good candidates.

  • @yugandharpulicherla4078
    @yugandharpulicherla4078 4 роки тому

    Nice and informative video. Can you please answer the question asked in interview. How to compare two Scala objects based one variable value.

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      Hi... All answers of the questions are available as different videos on data Savvy channel... Let me know if you find anything missing... I will add that... I will add answer to scala questions also

  • @chetankakkireni8870
    @chetankakkireni8870 4 роки тому

    Sir, can you please do more interviews like this as it is helpful ..

    • @DataSavvy
      @DataSavvy  4 роки тому

      Yes Chetan, I am already planning for more videos on this

  • @dramaqueen5889
    @dramaqueen5889 3 роки тому

    I have been working on big data for quite sometime now , but i dont know why I cant clear interviews

  • @ravurunaveenkumar7987
    @ravurunaveenkumar7987 4 роки тому +1

    Hi, can you do interview on Scala and spark.

    • @DataSavvy
      @DataSavvy  4 роки тому

      Hi, I am sorry, i did not understand your question completely.. are u asking u want to do a mock interview with me on scala and spark? if yes... please drop me a message at aforalgo@gmail.com. we can workout this

  • @MrRemyguy
    @MrRemyguy 4 роки тому

    I'm moving from web development to spark development. Any inputs on that please !! Can I switch without any experience of working with spark.

  • @hgjhhj3491
    @hgjhhj3491 2 роки тому

    That broadcast join example looked cooked up 😆😆

  • @adhikariaman01
    @adhikariaman01 4 роки тому +1

    Can you answer to question how do you decide when to increase executor or memory question please ? :)

    • @DataSavvy
      @DataSavvy  4 роки тому +7

      You have to find if your job is compute intensive or IO intensive.. you will get hints of that in logs files... I realised after taking this mock interview that I should create a video on that... I am working on this.. thanks for asking this question 😀

    • @NithinAP007
      @NithinAP007 4 роки тому +3

      @@DataSavvy I do not completely agree to what you said. Or may be the question looks a bit off. To start with increasing executor memory will have a limit restricted by the total memory available(depending on the instance type you are using). Memory usage tuning at executor level would need considering 1) the amount of memory used by your objects (you may want your entire dataset to fit in memory), 2) the cost of accessing those objects, and 3) the overhead of garbage collection (if you have high turnover in terms of objects). Now, when we say increasing number of executors - I consider this as scaling needed to meet the job requirements. IO intensive doesn't directly mean increase the number of executors rather increasing the level of parallelism(dependent on the underlying part files(/size) etc.) which starts at the executor level. So, I would rather look at this answer like optimizing executor config for a (considerably)small dataset and tuning the executor config first and then assessing the level of scaling need viz. increasing the number of executors to meet the scale of the actual data. I would like to discuss ahead with you

  • @darshitsuthar6653
    @darshitsuthar6653 4 роки тому +5

    sir this was really helpful and informative....i'm a fresher and seeking an opportunity to work with big data technologies like hadoop, spark, kafka, etc.....please guide me how do i enter the corporate world starting with these technologies as there are very less firms that hires a fresher for such technologies.....

    • @satyanathparvatham4406
      @satyanathparvatham4406 3 роки тому

      HAVE you done any projects??/

    • @darshitsuthar6653
      @darshitsuthar6653 3 роки тому

      @@satyanathparvatham4406 worked on a hive project and other thn that I keep practicing some scenarios (spark) on databricks.

  • @rheaalexander4798
    @rheaalexander4798 4 роки тому

    Could you please answer...How to achieve optimisation in hive query with columns that have high cardinality

    • @sakshamsrivastava9894
      @sakshamsrivastava9894 3 роки тому +1

      may be we can use vectorization also in such scenarios and as he said bucketing on id column can help drastically, apart from it choosing right file format can work as well.

  • @msbhargavi1846
    @msbhargavi1846 4 роки тому

    Hi sir, will u plz exaplain difference b/w map and foreach....

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      Will create video on this...

  • @Anandhusreekumar
    @Anandhusreekumar 4 роки тому +1

    Very informative. Thanks :) Can you suggest some small Spark project for portfolio building?

    • @DataSavvy
      @DataSavvy  4 роки тому

      What is your profile and which Language you use

    • @Anandhusreekumar
      @Anandhusreekumar 4 роки тому +1

      @@DataSavvy I'm Scala Spark engineer.
      I'm am familiar with Cassandra, Kafka, HDFS, Spark.

    • @sindhugarlapati2776
      @sindhugarlapati2776 4 роки тому

      Same request...can you please suggest some small spark project using python..

    • @DataSavvy
      @DataSavvy  4 роки тому +2

      I am currently working on creating videos and explaining a end to end project..

  • @biswadeeppatra1726
    @biswadeeppatra1726 4 роки тому

    Can you please suggest any correct way to determine executor cores n executor memory by looking at the input data. Without hit n trail and instead going for that thumb rule that we have assuming 5 would be the optimized number for core.. Any other way to determine

    • @sank9508
      @sank9508 3 роки тому

      It purely depends on size of input data and kind of processing/computation like heavy join or simple scan of data.
      In general, worker nodes (data nodes) of size {Cores 40 (80 w/ Hyperthreading) ; Memory
      500 GiB} then ~ 1vCore for every 5GB.

  • @mohitupadhayay1439
    @mohitupadhayay1439 Рік тому

    This guy was giving Interview for TCS Data science role once.

  • @shivamgupta-bc7fn
    @shivamgupta-bc7fn 3 роки тому

    Can you guys tell which companies would be interviewing in this pattern?

  • @NithinAP007
    @NithinAP007 4 роки тому +1

    I liked some of the questions but not the answers completely. Say the HDFS block size, memory used per file in name node and the type safe equality. How do you plan to publish the right content/answers?

    • @DataSavvy
      @DataSavvy  4 роки тому

      Hi Nithin... This was a completely impromptu interview... Are u looking for answer of any specific question?

  • @RanjanKumar-ue5id
    @RanjanKumar-ue5id 4 роки тому +1

    Any resource link to do a spark related mini project ?

    • @DataSavvy
      @DataSavvy  4 роки тому

      I am creating a series for spark project... Will post soon

  • @raviteja1875
    @raviteja1875 3 роки тому

    attach a feedback video to the same it will go long way in knowing what should have been better answered

  • @lavanyareddy310
    @lavanyareddy310 4 роки тому

    Hello sir,u r videos are very helpful.....I am unable to join in u r telegram group.....plz help me sir

  • @MaheshKumar-yz7ns
    @MaheshKumar-yz7ns 3 роки тому

    @4.40 is the interviewer expecting ans DAG?

  • @nikhilmishra7572
    @nikhilmishra7572 4 роки тому +1

    Whats the purpose of using '===' while joining? nice video btw.

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      Thanks Nikhil... Will post a video about the answer in few days :)

    • @harshavardhanreddyakiti4655
      @harshavardhanreddyakiti4655 4 роки тому

      @@DataSavvycan you post something like this on Airflow?

    • @abhirupghosh806
      @abhirupghosh806 3 роки тому

      @@DataSavvy My best guess is = and == are already reserved operators.
      = is assignment operator like val a=5
      == is comparison opertor if object type is same like how we use in a a normal string comparison for example
      === is comparison operator if object type is different like when we compare two different colums for different datasets dataframes

  • @pratikj2538
    @pratikj2538 4 роки тому

    Can you make one interview video for Bigdata developer with 2-3 yrs of exp.

    • @DataSavvy
      @DataSavvy  4 роки тому

      Sure Pratik... That is already in plan... This interview also fits in less than 4 year exp category

  • @mohammadjunaidshaik2664
    @mohammadjunaidshaik2664 3 роки тому

    I am fresher can I start my carrier with big data.

  • @taherkutarwadli8368
    @taherkutarwadli8368 2 роки тому

    I am new to data engineering field which language should i choose scala or python . Which language has more job roles ?

  • @msbhargavi1846
    @msbhargavi1846 4 роки тому

    Hi, why hadoop doesn't support small files ?

    • @DataSavvy
      @DataSavvy  4 роки тому

      He meant to ask... Why small files are not good for Hadoop... Hadoop can store small files though

    • @msbhargavi1846
      @msbhargavi1846 4 роки тому

      @@DataSavvy thanks, But why its not good? performance issue.. how it exatly?

    • @rajeshkumardash1222
      @rajeshkumardash1222 4 роки тому +2

      @@msbhargavi1846 If you have too many small files then your name node will have to keep metadata for these each of these metadta takes around 100-150 bytes so just think if you have millions of small files how much memory name node has to exhaust to manage this ....

    • @msbhargavi1846
      @msbhargavi1846 4 роки тому

      @@rajeshkumardash1222 yes got it.... thanks

    • @DataSavvy
      @DataSavvy  4 роки тому

      Thanks Rajesh

  • @anudeepyerikala8517
    @anudeepyerikala8517 3 роки тому

    arindam patra the king of datasavvy

  • @ashutoshrai5342
    @ashutoshrai5342 4 роки тому +2

    Sathiyan is a genius

  • @jeevithat6038
    @jeevithat6038 3 роки тому

    Hi it would be nice if you give correct answers if the answer is wrong..

  • @lavakumar5181
    @lavakumar5181 4 роки тому +1

    Hi sir, if you are providing interview guidance personally please let me know..I'll contact personally...I need guidance

    • @DataSavvy
      @DataSavvy  4 роки тому

      Join our group... It's very vibrant and people help each other a lot

    • @subhaniguddu2870
      @subhaniguddu2870 4 роки тому

      Please share group link we will join

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      chat.whatsapp.com/KKUmcOGNiixH8NdTWNKMGZ

    • @DataSavvy
      @DataSavvy  4 роки тому

      @@hasnainmotagamwala2608 Hi, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr

    • @arunnautiyal2424
      @arunnautiyal2424 4 роки тому

      @@DataSavvy it is giving not authorized to access.

  • @TheMan.0010
    @TheMan.0010 4 роки тому

    wow, kindly make hive integration with spark

    • @DataSavvy
      @DataSavvy  4 роки тому

      Hmmmm... What is the problem that you are facing in integration...

    • @TheMan.0010
      @TheMan.0010 4 роки тому +1

      @@DataSavvy In databricks When i am creating a managed hive table by "using json" keyword Its creating fine but when i am creating external table its showing error

    • @TheMan.0010
      @TheMan.0010 4 роки тому

      @@DataSavvy Why "using keyword dosent work with external tables

  • @ferozqureshi5228
    @ferozqureshi5228 Рік тому +1

    If we use 800 executors for 100GB input data like you've mentioned in your example,Spark would be then busy in managing the high volume of executors rather than on data processing. So, it could better to use an executor for 5-10GB which would leave us to use 10-20 executors for 100GB data. If you're having any explanation for using 800 executors, then do post it.

    • @DataSavvy
      @DataSavvy  Рік тому

      let me look into this

    • @kshitizagarwal8389
      @kshitizagarwal8389 5 місяців тому

      not 800 executors- he said to use 800 cores for maximum parallelism- Keep five courses in each executor, resulting into 160 executors in total.

  • @anudeepk7390
    @anudeepk7390 4 місяці тому

    Did the participant consent for posting this online? If not u should blur his face

    • @DataSavvy
      @DataSavvy  4 місяці тому +1

      Yes.. It was agreed with participants

  • @atanu4321
    @atanu4321 4 роки тому

    Small file issue 16:45

  • @GauravSingh-dn2yx
    @GauravSingh-dn2yx 3 роки тому

    Everyone is in nightwear 😂😂😂

  • @THEPOSTBYLOT
    @THEPOSTBYLOT 4 роки тому

    Please create new watsup grp as it is full

    • @DataSavvy
      @DataSavvy  4 роки тому +2

      Hi XYZ, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      Hi XYZ, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr

  • @sangeethanagarajan278
    @sangeethanagarajan278 3 роки тому

    How many experience this candidate is having?

  • @seaofpoppies
    @seaofpoppies 4 роки тому +2

    There is no such thing as in memory processing.. Memory is used to store data that can be reused. 4 years back I was grilled on this 'in memory processing' stuff in one of the big4 firm.

    • @arindampatra6283
      @arindampatra6283 4 роки тому

      You should google the meaning of in memory processing once..It doesn't mean that your memory will process the Data for you 😂😂😂😂 Even kids there in school know that cpu does the actual computations..

    • @b__05
      @b__05 4 роки тому

      No In- memory is usually your RAM, where data is stored and computed in parallel. Hence it is fast.
      Can you just let me know how you got grilled for this?

    • @EnimaMoe
      @EnimaMoe 4 роки тому +2

      hadoop work in batches by moving data in the hdfs. Meanwhile Spark does its operation in-memory, the data is cached in memory and all operations are done live. Unless you were asked questions for hadoop i don't see how you could get grilled for this ...

    • @seaofpoppies
      @seaofpoppies 4 роки тому

      @@EnimaMoe Spark doesnot do operations in Memory. In fact, no processing happens in memory. I am not talking about the concept. I am talking about the phrase that is used "in memory processing".
      For those advising me to Google about spark, just an FYI, It's been years since in am using spark.
      You can always challenge whatever is written ot told by someone. Tc.

  • @omkarkulkarni9202
    @omkarkulkarni9202 4 роки тому

    Q1: What made you move to Spark from Hadoop/MR? Both the question and answer is wrong. Hadoop is a file system whereas spark is a framework/library to process data in distributed fashion. There is no such thing as 1 better than other. It's like comparing apples and oranges.

    • @DataSavvy
      @DataSavvy  4 роки тому

      Hi Omkar... Hadoop is combination of Map Reduce and HDFS.. hdfs is file system and MR is processing engine... Interviewer wanted to know why u prefer spark as compare to Hadoop MR style of processing... You will generally see people who are working in big data using this kind of language... Generally people who started with Hadoop and then moves to spark processing engine later

    • @omkarkulkarni9202
      @omkarkulkarni9202 4 роки тому

      @@DataSavvy Can you tell me where and how you do MR using Hadoop? And can you elaborate what exactly you mean by "Hadoop MR style of programming?" If the interviewer is using this language, clearly he has learnt and limited his knowledge to tutorials. Someone who has worked on large scale clusters using EMR or his own EC2 cluster wont use such vague language.

    • @DataSavvy
      @DataSavvy  4 роки тому

      Hi Omkar... Plz read en.m.wikipedia.org/wiki/MapReduce ... or Google Hadoop map reduce...

    • @omkarkulkarni9202
      @omkarkulkarni9202 4 роки тому +1

      @@DataSavvy I understand what is map reduce.. its a paradigm and not a framework/library that you are asking. The interviewer asked this question: Interviewer wanted to know why u prefer spark as compare to Hadoop MR style of processing? This question itself is wrong as Spark is a framework that allows you to process data using map reduce paradigm. There is no such thing as "Hadoop MR style of processing".

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      I see... You seems to have issue with words used to frame the question... I think we should focus on intent of question, rather than thinking to much about the words...

  • @sheshkumar8502
    @sheshkumar8502 8 місяців тому

    Hi how are you

  • @iam_krishna15
    @iam_krishna15 2 роки тому

    This one can't considered as spark interview.

    • @DataSavvy
      @DataSavvy  2 роки тому

      Hi Krishna... Please share your expectation... I will cover that as another video

  • @gauthamn1603
    @gauthamn1603 4 роки тому

    Please dont use the word "we" use "i"

    • @hasifs8139
      @hasifs8139 4 роки тому +1

      Never use 'I' unless you are working alone.

    • @MrTeslaX
      @MrTeslaX 3 роки тому

      @@hasifs8139 Always use I in interview

    • @hasifs8139
      @hasifs8139 3 роки тому

      @@MrTeslaX Thanks for the advice, luckily I rarely have to go for interviews nowadays. Personally, I don't hire people who use a lot of 'I' in their answers, because they are just ignoring the existence of others in the team. Most likely they are a bad team player and don't want such people in my team.

    • @MrTeslaX
      @MrTeslaX 3 роки тому +1

      @@hasifs8139 Thanks for your response. I live and work in the US and have attended FANG companies and other small companies as well. One of the most important things I was told to keep in mind was to highlight my contribution and achievement and not talk about the overall work done. Be specific and talk about the work you have done and use I while talking about them.

    • @hasifs8139
      @hasifs8139 3 роки тому +1

      @@MrTeslaX Thanks for your explanation. Yes, you must definitely highlight your contributions and achievements within the project. All I am saying is that you should not give the impression that you did it all on your own. Also what difference does it make, if you are living in the US or Germany(where I am) or anywhere else?

  • @ldk6853
    @ldk6853 6 місяців тому

    Terrible accent… 😮

  • @call_me_sruti
    @call_me_sruti 3 роки тому +2

    Hey 👋.. thank you for this awesome initiative. Btw one thing the whatsapp link does not work!!