Incremental Data Load in Hive | Big data interview questions

Поділитися
Вставка
  • Опубліковано 19 гру 2024

КОМЕНТАРІ • 117

  • @ArtAlive
    @ArtAlive 4 роки тому +6

    Thanks a lot was an awesome explanation. I was searching the answer for this .nice thank you so much

  • @venkatramana7980
    @venkatramana7980 4 роки тому +9

    Bro, Really you are Hero.Helping others without expecting anything is really a big big thing.Thanks a lot Bro

  • @kumarrk6343
    @kumarrk6343 5 років тому +7

    9:40 to 12:40 .No words .What a simple explanation.Really mind blowing.I have been rejected more than 25 interviews till now though I have 2 years genuine big data experience.I have come to know where I am lacking . Definitely I can crack my next interview with the help of your videos.

  • @sivak9750
    @sivak9750 4 роки тому +3

    Best and simple explanation. I didn't find this solution any where. Thanks alot!!.

  • @aa-kj9zm
    @aa-kj9zm 4 роки тому

    seems real time work.I am learning Hadoop but lost my way because i am not taking any training.This is very helpful. I will check all your videos.Thanks for this awesome video

  • @DeepakSharma_youtube
    @DeepakSharma_youtube 4 роки тому +6

    Very good explanation but I have a few questions, because I've used a slightly different approach in our prod environment, and this approach will also not solve our issue (Q3 below):
    Q1: @14:42 you didn't update the date to 2019-04-23 but it shows in your view. How?
    Q2: The other question I have is, how would you handle 'DELETES' on the source system?
    Q3: As we approach Day30, or Day365 etc. the main EXT table would be huge. Is there a way to kind of 'Reset' that base table at some point so it doesn't grow every time?

    • @nareshj6370
      @nareshj6370 3 роки тому

      I have the same question:)

  • @sunshinemoon922
    @sunshinemoon922 2 роки тому

    Awesome video sir. Very useful for interviews. Thank you very much.

  • @hemanthreddykolli
    @hemanthreddykolli 3 роки тому

    This video is very helpful to understand the CDC concept. thanks for sharing your knowledge.

  • @ramkumarananthapalli7151
    @ramkumarananthapalli7151 Рік тому

    Quite useful!! Thank you for making 💐💐

  • @subramanianchenniappan4059
    @subramanianchenniappan4059 5 років тому

    i am a java developer with hadoop handson. i will see all your videos , thanks for your help

  • @debatrii
    @debatrii 4 роки тому

    Very good..well explained..thanks

  • @gauravpathak7017
    @gauravpathak7017 4 роки тому +1

    just wow! This is the best that anyone can have on incremental Load in Hive. cheers :)

  • @RaviKumar-uu4ro
    @RaviKumar-uu4ro 5 років тому +1

    Tons of Thanks to your valuable videos. Really marvelous and uncomparable to any other.

  • @bsrameshonline
    @bsrameshonline 4 роки тому

    very good explanation on incremental loading

  • @sourav7413
    @sourav7413 3 роки тому

    thanks for your good informative video ...

  • @ririraman7
    @ririraman7 2 роки тому

    Brilliant video! much needed...to the point!

  • @bobbyvenkatesan3657
    @bobbyvenkatesan3657 4 роки тому

    Thanks for sharing this kind of videos. Very helpful

  • @pravinmahindrakar6144
    @pravinmahindrakar6144 7 місяців тому

    Thanks a lot, I think we can use row_number window function to get updated records by using partitions by emp_id and order by date desc. Finally can filter for row_number=1

  • @puneetbhatia
    @puneetbhatia 3 роки тому

    explained amazingly. thank you so much!!!!!!!!

  • @ajinkyahatolkar6518
    @ajinkyahatolkar6518 2 роки тому

    quite informative video. thanks !

  • @christiandave100
    @christiandave100 3 роки тому +4

    why extra subquery t2.. we can remove the second subquery e.g
    select t1.* from
    (selct * from inc_table)t1 join
    (select empid,max(modified_date) max_modified from inc_table t2 group by empid) s
    on t1.empid=s.empid and t1.modified_date=s.max_modified

  • @udaynayak4788
    @udaynayak4788 2 роки тому

    thank you so much for detailed explanation

  • @astropanda1623
    @astropanda1623 Рік тому

    Very good explanation

  • @Sumit261990
    @Sumit261990 4 роки тому

    Nice video, Really useful. Thanks a lot

  • @rajnimehta9189
    @rajnimehta9189 3 роки тому

    Awesome explanation

  • @naveenvinayak1088
    @naveenvinayak1088 4 роки тому

    well explained..helping others without expectations

  • @adityapratapsingh7649
    @adityapratapsingh7649 3 роки тому +4

    Thanks for detailed video. I have one question we can do the same with window functions right like using row_number(). So which approach is the optimized one?
    select * from (select *, row_number() over (partition by id order by modifiedDate) as rk from v1) a where rk=1

    • @dhivakarsathya3918
      @dhivakarsathya3918 3 роки тому

      I would prefer to use group by and inner join which GK as used which runs much faster than window functions in hive. Better to follow sqoop import if possible else hdfs storage size would become massive and ur view process will take lot of time

  • @narasimharao3665
    @narasimharao3665 4 роки тому +2

    If we do like this every time that duplicate records will be there in located file and that file size is extremely increased and whenever we run that view there are sub queries in that and it is also decrease the performance. instead of this we can use sqoop ("sqoop incremental option") to import the incremental data into hdfs directory file or cloud(like aws s3).

  • @tallaravikumar4560
    @tallaravikumar4560 2 роки тому +1

    Good explanation but Ur text application is unclear atleast full black with white font would been more clearer,what if the modified data is not updated.

  • @kilarivenkatesh9844
    @kilarivenkatesh9844 3 роки тому

    Nice explanation.. Bro

  • @bigdatabites6551
    @bigdatabites6551 2 роки тому

    Great job Bro....

  • @svdfxd
    @svdfxd 5 років тому

    I am preparing for Big Data interviews and such interview series would be really helpful. Please add Spark interview questions as well. The way you explained patiently with example is really good.

    • @GKCodelabs
      @GKCodelabs  5 років тому +1

      Sure, My interview series will cover a wide range of interview questions in all BD technologies. Hive, Spark, HBase, and Datawarehousing concepts will be a major part of those, as these are the most important skills in demand for most of the interviews.
      #KeepWatching :)

  • @the_high_flyer
    @the_high_flyer 4 роки тому

    Super...thanks a ton for your video.

  • @sumitkumarsahoo
    @sumitkumarsahoo 4 роки тому

    Thanks alot! I was actually looking for something like this for loading incremental data

  • @Sagar-gi5zq
    @Sagar-gi5zq 2 роки тому +1

    what if we dont have modified_date column...?

  • @NextGen_Tech_Hindi
    @NextGen_Tech_Hindi 9 місяців тому

    Thanks man your are amazing 😍❤❤❤

  • @ANUKARTHIM
    @ANUKARTHIM 4 роки тому

    Thanks for the video. Good work. Looking for more videos on Hbase and its stuff, how regions work in Hbase. How to define regions while creating hbase table and many more Bro.

  • @junaidansari675
    @junaidansari675 5 років тому

    Very helpful, please make videos about other components n theory, hadoop admin jobs related videos...

  • @akshaychoudhari5641
    @akshaychoudhari5641 2 роки тому +1

    how incremental load in Hive is different than the incremental load we do in scoop operation. can you explain

  • @abhiganta
    @abhiganta 5 років тому +4

    Hi Sir,
    What is the need to create t2 ?
    We can directly directly query as (select empid, max(modDate) from inc_table group by empid) s and then join t1 and S ?
    Please correct me if wrong.

    • @vermad6233
      @vermad6233 2 роки тому

      Same question from my side!

  • @sagarsinghrajpoot6788
    @sagarsinghrajpoot6788 5 років тому

    you are awesome man ;) I liked your vidoes i feel like i am watching like netfix easy to understand :)

  • @vru5696
    @vru5696 3 роки тому

    Awsome video... Any videos related to SCD and SCD revert in Hive? Please share link.

  • @arunkumar-th8vy
    @arunkumar-th8vy 4 роки тому +1

    can you please help me if we dont have any date column and after loading day2 into my history table(day1) i need to make it doesn't contain any duplicates

  • @ajaythedaredevil7220
    @ajaythedaredevil7220 5 років тому +2

    1.Can we use merge statement for simplification?
    2.what if a employee id has been deleted in new data set and now we don’t want it in our final table.
    I can see the join will take the left employee id also
    Many thanks!

    • @abhiganta
      @abhiganta 5 років тому

      I have the same doubt.. What if some of the records are deleted from source db and we need to remove those records in hive ?

    • @arindampatra6283
      @arindampatra6283 4 роки тому

      That means the new data has all the employees info and you can simply filter the latest date😊

  • @sumitkhandwekar6021
    @sumitkhandwekar6021 2 роки тому

    Amazing brooooo

  • @sagarsinghrajpoot6788
    @sagarsinghrajpoot6788 5 років тому

    I got this real time case . Thanks :) Now we got here how to handle incremental data but do you have have any video on a different use case - "Data Transofrmation usecase" using Hive ( applied business Transformations )? if yes please tell me . I became fan of you man .now onward i will also do practise like this by my own ....

  • @M-Fash0070
    @M-Fash0070 2 роки тому

    Plz make one more vdo on rdbms to hive with maintaining history,updated data and new data....

  • @MrManish389
    @MrManish389 5 років тому

    Sir, easy and simple explained

  • @rakshithbs882
    @rakshithbs882 4 роки тому +1

    Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output

    • @manikandanl4909
      @manikandanl4909 3 роки тому

      we are not selecting all columns in S alias table. So we are joining with t1 alias table to get all columns

  • @anilpatil6783
    @anilpatil6783 5 років тому +1

    Thank you GK. This incremental data load is the basis millions of ETL jobs. Thank you for such a pitch perfect explaination.
    I have a question how this 9:40 > after logic is put into production? I mean to say how it actually made to run everyday.
    Here I could see view only. Is this view used to load data from staging to some other layer each day.

    • @rathnakarlanka2624
      @rathnakarlanka2624 4 роки тому

      Thank you GK. If we miss incremental data extract for couple of times and if we use max date to join then there is a scope of missing the records right. So, How come we overcome this problem?

  • @tarunreddy5917
    @tarunreddy5917 Рік тому

    @GK Code labs , Is there any difference between incremental data and delta data

  • @deepikakumari5369
    @deepikakumari5369 4 роки тому

    nice explanation, please upload a video on "How to handle multiple small files generating hive as output?".. Thank you :)

  • @naveengupta7268
    @naveengupta7268 4 роки тому

    Great Explanation!

  • @prashantahire143
    @prashantahire143 5 років тому +1

    Good Explanation !
    How to perform incremetal hive load from the HDFS for the Partition Table? The table do not have the date/timestamp column

    • @GKCodelabs
      @GKCodelabs  5 років тому +3

      Hi Prashant,
      Thanks for your comment.. ☺️
      We can use many other internal checkpoints in such cases. Thanks for sharing the scenario, i will surely explain this in one of my coming videos..
      #KeepWatching

    • @arindampatra6283
      @arindampatra6283 4 роки тому +1

      I think you would have got your answers by now? If not then let's discuss. What is the partition column? How are you loading new data in that table?

    • @mahesh.h1b339
      @mahesh.h1b339 Рік тому

      ​@@GKCodelabs what type of join is this bro ?

  • @richalikhyani7204
    @richalikhyani7204 3 роки тому

    Do you have any course playlist.

  • @ravikumark6746
    @ravikumark6746 4 роки тому

    Great sir..thank you

  • @saurav0777
    @saurav0777 4 роки тому

    Is this the implementation for SCD type 2 as well?

  • @rajeshkumardash611
    @rajeshkumardash611 4 роки тому +1

    @GK Codelabs This may not work if data has deleted records

  • @gandlapentasabjan9115
    @gandlapentasabjan9115 3 роки тому

    Hi Bro,
    Very helpful videos thank you so much for sharing this with us. I have a small doubt,
    If suppose we don't have date column, then how to do?

  • @Kutub2005
    @Kutub2005 3 роки тому

    Hello GK Codelabs, thanks for this awesome video.
    Would you please make a video to add column for where modified data will be refect ?
    Senerio is I don't have modified_date column in existing hive table so if I want to use the stratagy that you have shown in this video, then how do I add modified_date column in existing hive table and hdfs data ??

  • @ArunKumar-gw2ux
    @ArunKumar-gw2ux 4 роки тому +1

    Good explanation! But this wont work in enterprise level data. this is not a scalable solution. For instance, if the incremental is maintained for 12 months and update is coming everyday, this deduping will take long time to complete

    • @kiranmudradi26
      @kiranmudradi26 4 роки тому +1

      Can you please let us know better solution in such scenarios? Thanks

  • @rajeshreddy906
    @rajeshreddy906 4 роки тому

    Hi your videos are gr8. If you don't mind could you please post a video on sort merge bucket ( SMB ) join

  • @gsp4420
    @gsp4420 3 роки тому

    Hi, if we don't have sequence I'd and the CSV/table data contains only data with duplicates but the total combination of row values will be unique, how to do the incremental load in this situation, thank you

    • @gsp4420
      @gsp4420 3 роки тому

      And will not get load date and unique column is in source table and target table

  • @snagendra5415
    @snagendra5415 2 роки тому

    Could you do a video on the small file problem

  • @kumarraja4759
    @kumarraja4759 4 роки тому

    Nice explanation, But I have question here . The final join query is required to pick the latest records? We can select all the columns with max(modified_date) will give the desired output i believe. correct me if i wrong

    • @manikandanl4909
      @manikandanl4909 3 роки тому

      Bro, when do aggregation by some column(here mod_date) we need to specify all other columns in "group by". In case if have 100s of column then we need to specify everything. Thats why joined with original table.

    • @ririraman7
      @ririraman7 2 роки тому

      @@manikandanl4909 Thank you for clearing the doubt brother!

  • @mahammadshoyab9717
    @mahammadshoyab9717 5 років тому

    Dont think asking much, If you explain us one end to end scenario from kafka pulling to hdfs landing and hive loading.its very helpful for the persons who are poor and struggling to clear interview?

  • @naveenvinayak1088
    @naveenvinayak1088 4 роки тому

    can u do video about kafka?

  • @deepikakumari5369
    @deepikakumari5369 4 роки тому

    Sir, will you please give me answer to this? What approach we should take to load thousands of small 1 KB files using Hive, do we load one by one or should we merge together and load at once and how to do this?

    • @ririraman7
      @ririraman7 2 роки тому

      I believe hive is not meant for small files!

  • @rohitaute9928
    @rohitaute9928 Рік тому

    what if we need to maintain versions in hbase?

  • @ravikirantuduru1061
    @ravikirantuduru1061 4 роки тому

    Can you create videos on spark join if data is skewed and if joining small data and large data and how to do joins and explain how spark does sort merge and shuffle join.

  • @routhmahesh9525
    @routhmahesh9525 3 роки тому

    Can you please make a video on handling small file in Apache spark?

  • @arunsakkumar8463
    @arunsakkumar8463 4 роки тому

    Please post top most interview questions for hive

  • @bhushanmayank
    @bhushanmayank 5 років тому +1

    I have a doubt, how long are we going to store the daily files in hdfs ? Don't you think the performance of the view is going to be hit as more csv files are stored in the hdfs location to run a view on top of them ? Is there any way to keep only relevant records and a fresh file to process in hive and rest we move to cold storage ?

  • @zeeshan42007
    @zeeshan42007 4 роки тому

    can you please this Cloudera image I have downloaded from CDH it is very heavy not able to work.

  • @The_Code_Father_v1.0
    @The_Code_Father_v1.0 4 роки тому

    @gkcodelabs can you please make some similar videos on pyspark with use cases asked in interview

  • @haranadhsanka9699
    @haranadhsanka9699 5 років тому

    Really appreciate , can you also explain spark way of incremental load
    Thanks in advance

    • @GKCodelabs
      @GKCodelabs  5 років тому +1

      Sure Haranadh, i will explain in one of my coming videos.
      KeepWatching ☺️

    • @rakshithbs882
      @rakshithbs882 4 роки тому

      @@GKCodelabs Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output

  • @nlaxman5091
    @nlaxman5091 4 роки тому

    hi bro please do one vedio on how to choose memory , core ,excutores in spark cluster

  • @mahammadshoyab9717
    @mahammadshoyab9717 5 років тому

    Hi Bro, How to perform incremental load when no primary key and datestamp coulmns in table?
    Thanks in advance

    • @arindampatra6283
      @arindampatra6283 4 роки тому

      If there's no primary key the data is gibberish 😊

    • @rakshithbs882
      @rakshithbs882 4 роки тому

      @@arindampatra6283 Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output

  • @vikky7480
    @vikky7480 5 років тому

    please make a video on accumulators and broadcast variables. as well as aggregate by key() with a good example.

    • @GKCodelabs
      @GKCodelabs  5 років тому +2

      Awesome Vicky, somehow you cracked what,the next video is going to be about ☺️☺️, its the very next video,which you Requested.. coming soon (couple of days)☺️☺️☺️💐

  • @arunkumarreddy9736
    @arunkumarreddy9736 4 роки тому

    what if we don't have date column , can you please help

  • @seetharamireddybeereddy222
    @seetharamireddybeereddy222 5 років тому

    can i know how to work with stagging table in hive

  • @arupanandaprasad2202
    @arupanandaprasad2202 3 роки тому

    How to do incremental load in spark?

  • @BigDataWithSky
    @BigDataWithSky Рік тому

    Why not u uploading the videos regularly

  • @sathishanumaiah6907
    @sathishanumaiah6907 5 років тому

    Could you please explain the same process with rdbms data instead of files

  • @rohitsotra2010
    @rohitsotra2010 5 років тому

    What if we don't have any date column like modified date??

    • @suhaskolaskar552
      @suhaskolaskar552 4 роки тому

      you need to first learn slowly changing dimension then you wont ask this question.

  • @ravishankarrallabhandi531
    @ravishankarrallabhandi531 6 місяців тому

    How can we handle the case where source records are closed / deleted ?

  • @kalyanis6886
    @kalyanis6886 3 роки тому

    Hello, Can you share the VM image.

  • @ambikaprasadbarik6400
    @ambikaprasadbarik6400 4 роки тому

    Thanks a lot!

  • @sudhakarsubramani1528
    @sudhakarsubramani1528 2 роки тому

    thank you

  • @PramodKhandalkar5
    @PramodKhandalkar5 3 роки тому

    thanks man

  • @swapnilpatil1422
    @swapnilpatil1422 2 роки тому

    I was asked this question twice...

  • @Shiva-kz6tn
    @Shiva-kz6tn 4 роки тому

    16:10 You dont need to ask that :)

  • @dineshughade6570
    @dineshughade6570 4 роки тому

    SCREEN NEEDS TO BE CLEARED. I am bearly managing to see your screen.