9:40 to 12:40 .No words .What a simple explanation.Really mind blowing.I have been rejected more than 25 interviews till now though I have 2 years genuine big data experience.I have come to know where I am lacking . Definitely I can crack my next interview with the help of your videos.
seems real time work.I am learning Hadoop but lost my way because i am not taking any training.This is very helpful. I will check all your videos.Thanks for this awesome video
Very good explanation but I have a few questions, because I've used a slightly different approach in our prod environment, and this approach will also not solve our issue (Q3 below): Q1: @14:42 you didn't update the date to 2019-04-23 but it shows in your view. How? Q2: The other question I have is, how would you handle 'DELETES' on the source system? Q3: As we approach Day30, or Day365 etc. the main EXT table would be huge. Is there a way to kind of 'Reset' that base table at some point so it doesn't grow every time?
Thanks a lot, I think we can use row_number window function to get updated records by using partitions by emp_id and order by date desc. Finally can filter for row_number=1
why extra subquery t2.. we can remove the second subquery e.g select t1.* from (selct * from inc_table)t1 join (select empid,max(modified_date) max_modified from inc_table t2 group by empid) s on t1.empid=s.empid and t1.modified_date=s.max_modified
Thanks for detailed video. I have one question we can do the same with window functions right like using row_number(). So which approach is the optimized one? select * from (select *, row_number() over (partition by id order by modifiedDate) as rk from v1) a where rk=1
I would prefer to use group by and inner join which GK as used which runs much faster than window functions in hive. Better to follow sqoop import if possible else hdfs storage size would become massive and ur view process will take lot of time
If we do like this every time that duplicate records will be there in located file and that file size is extremely increased and whenever we run that view there are sub queries in that and it is also decrease the performance. instead of this we can use sqoop ("sqoop incremental option") to import the incremental data into hdfs directory file or cloud(like aws s3).
Good explanation but Ur text application is unclear atleast full black with white font would been more clearer,what if the modified data is not updated.
I am preparing for Big Data interviews and such interview series would be really helpful. Please add Spark interview questions as well. The way you explained patiently with example is really good.
Sure, My interview series will cover a wide range of interview questions in all BD technologies. Hive, Spark, HBase, and Datawarehousing concepts will be a major part of those, as these are the most important skills in demand for most of the interviews. #KeepWatching :)
Thanks for the video. Good work. Looking for more videos on Hbase and its stuff, how regions work in Hbase. How to define regions while creating hbase table and many more Bro.
Hi Sir, What is the need to create t2 ? We can directly directly query as (select empid, max(modDate) from inc_table group by empid) s and then join t1 and S ? Please correct me if wrong.
can you please help me if we dont have any date column and after loading day2 into my history table(day1) i need to make it doesn't contain any duplicates
1.Can we use merge statement for simplification? 2.what if a employee id has been deleted in new data set and now we don’t want it in our final table. I can see the join will take the left employee id also Many thanks!
I got this real time case . Thanks :) Now we got here how to handle incremental data but do you have have any video on a different use case - "Data Transofrmation usecase" using Hive ( applied business Transformations )? if yes please tell me . I became fan of you man .now onward i will also do practise like this by my own ....
Thank you GK. This incremental data load is the basis millions of ETL jobs. Thank you for such a pitch perfect explaination. I have a question how this 9:40 > after logic is put into production? I mean to say how it actually made to run everyday. Here I could see view only. Is this view used to load data from staging to some other layer each day.
Thank you GK. If we miss incremental data extract for couple of times and if we use max date to join then there is a scope of missing the records right. So, How come we overcome this problem?
Hi Prashant, Thanks for your comment.. ☺️ We can use many other internal checkpoints in such cases. Thanks for sharing the scenario, i will surely explain this in one of my coming videos.. #KeepWatching
Hello GK Codelabs, thanks for this awesome video. Would you please make a video to add column for where modified data will be refect ? Senerio is I don't have modified_date column in existing hive table so if I want to use the stratagy that you have shown in this video, then how do I add modified_date column in existing hive table and hdfs data ??
Good explanation! But this wont work in enterprise level data. this is not a scalable solution. For instance, if the incremental is maintained for 12 months and update is coming everyday, this deduping will take long time to complete
Hi, if we don't have sequence I'd and the CSV/table data contains only data with duplicates but the total combination of row values will be unique, how to do the incremental load in this situation, thank you
Nice explanation, But I have question here . The final join query is required to pick the latest records? We can select all the columns with max(modified_date) will give the desired output i believe. correct me if i wrong
Bro, when do aggregation by some column(here mod_date) we need to specify all other columns in "group by". In case if have 100s of column then we need to specify everything. Thats why joined with original table.
Dont think asking much, If you explain us one end to end scenario from kafka pulling to hdfs landing and hive loading.its very helpful for the persons who are poor and struggling to clear interview?
Sir, will you please give me answer to this? What approach we should take to load thousands of small 1 KB files using Hive, do we load one by one or should we merge together and load at once and how to do this?
Can you create videos on spark join if data is skewed and if joining small data and large data and how to do joins and explain how spark does sort merge and shuffle join.
I have a doubt, how long are we going to store the daily files in hdfs ? Don't you think the performance of the view is going to be hit as more csv files are stored in the hdfs location to run a view on top of them ? Is there any way to keep only relevant records and a fresh file to process in hive and rest we move to cold storage ?
Awesome Vicky, somehow you cracked what,the next video is going to be about ☺️☺️, its the very next video,which you Requested.. coming soon (couple of days)☺️☺️☺️💐
Thanks a lot was an awesome explanation. I was searching the answer for this .nice thank you so much
Bro, Really you are Hero.Helping others without expecting anything is really a big big thing.Thanks a lot Bro
9:40 to 12:40 .No words .What a simple explanation.Really mind blowing.I have been rejected more than 25 interviews till now though I have 2 years genuine big data experience.I have come to know where I am lacking . Definitely I can crack my next interview with the help of your videos.
what u did for 2 years ?
Best and simple explanation. I didn't find this solution any where. Thanks alot!!.
seems real time work.I am learning Hadoop but lost my way because i am not taking any training.This is very helpful. I will check all your videos.Thanks for this awesome video
Very good explanation but I have a few questions, because I've used a slightly different approach in our prod environment, and this approach will also not solve our issue (Q3 below):
Q1: @14:42 you didn't update the date to 2019-04-23 but it shows in your view. How?
Q2: The other question I have is, how would you handle 'DELETES' on the source system?
Q3: As we approach Day30, or Day365 etc. the main EXT table would be huge. Is there a way to kind of 'Reset' that base table at some point so it doesn't grow every time?
I have the same question:)
Awesome video sir. Very useful for interviews. Thank you very much.
This video is very helpful to understand the CDC concept. thanks for sharing your knowledge.
Quite useful!! Thank you for making 💐💐
i am a java developer with hadoop handson. i will see all your videos , thanks for your help
Very good..well explained..thanks
just wow! This is the best that anyone can have on incremental Load in Hive. cheers :)
Tons of Thanks to your valuable videos. Really marvelous and uncomparable to any other.
very good explanation on incremental loading
thanks for your good informative video ...
Brilliant video! much needed...to the point!
Thanks for sharing this kind of videos. Very helpful
Thanks a lot, I think we can use row_number window function to get updated records by using partitions by emp_id and order by date desc. Finally can filter for row_number=1
explained amazingly. thank you so much!!!!!!!!
quite informative video. thanks !
why extra subquery t2.. we can remove the second subquery e.g
select t1.* from
(selct * from inc_table)t1 join
(select empid,max(modified_date) max_modified from inc_table t2 group by empid) s
on t1.empid=s.empid and t1.modified_date=s.max_modified
thank you so much for detailed explanation
Very good explanation
Nice video, Really useful. Thanks a lot
Awesome explanation
well explained..helping others without expectations
Thanks for detailed video. I have one question we can do the same with window functions right like using row_number(). So which approach is the optimized one?
select * from (select *, row_number() over (partition by id order by modifiedDate) as rk from v1) a where rk=1
I would prefer to use group by and inner join which GK as used which runs much faster than window functions in hive. Better to follow sqoop import if possible else hdfs storage size would become massive and ur view process will take lot of time
If we do like this every time that duplicate records will be there in located file and that file size is extremely increased and whenever we run that view there are sub queries in that and it is also decrease the performance. instead of this we can use sqoop ("sqoop incremental option") to import the incremental data into hdfs directory file or cloud(like aws s3).
Good explanation but Ur text application is unclear atleast full black with white font would been more clearer,what if the modified data is not updated.
Nice explanation.. Bro
Great job Bro....
I am preparing for Big Data interviews and such interview series would be really helpful. Please add Spark interview questions as well. The way you explained patiently with example is really good.
Sure, My interview series will cover a wide range of interview questions in all BD technologies. Hive, Spark, HBase, and Datawarehousing concepts will be a major part of those, as these are the most important skills in demand for most of the interviews.
#KeepWatching :)
Super...thanks a ton for your video.
Thanks alot! I was actually looking for something like this for loading incremental data
what if we dont have modified_date column...?
Thanks man your are amazing 😍❤❤❤
Thanks for the video. Good work. Looking for more videos on Hbase and its stuff, how regions work in Hbase. How to define regions while creating hbase table and many more Bro.
Very helpful, please make videos about other components n theory, hadoop admin jobs related videos...
how incremental load in Hive is different than the incremental load we do in scoop operation. can you explain
Hi Sir,
What is the need to create t2 ?
We can directly directly query as (select empid, max(modDate) from inc_table group by empid) s and then join t1 and S ?
Please correct me if wrong.
Same question from my side!
you are awesome man ;) I liked your vidoes i feel like i am watching like netfix easy to understand :)
Awsome video... Any videos related to SCD and SCD revert in Hive? Please share link.
can you please help me if we dont have any date column and after loading day2 into my history table(day1) i need to make it doesn't contain any duplicates
1.Can we use merge statement for simplification?
2.what if a employee id has been deleted in new data set and now we don’t want it in our final table.
I can see the join will take the left employee id also
Many thanks!
I have the same doubt.. What if some of the records are deleted from source db and we need to remove those records in hive ?
That means the new data has all the employees info and you can simply filter the latest date😊
Amazing brooooo
I got this real time case . Thanks :) Now we got here how to handle incremental data but do you have have any video on a different use case - "Data Transofrmation usecase" using Hive ( applied business Transformations )? if yes please tell me . I became fan of you man .now onward i will also do practise like this by my own ....
Plz make one more vdo on rdbms to hive with maintaining history,updated data and new data....
Sir, easy and simple explained
Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output
we are not selecting all columns in S alias table. So we are joining with t1 alias table to get all columns
Thank you GK. This incremental data load is the basis millions of ETL jobs. Thank you for such a pitch perfect explaination.
I have a question how this 9:40 > after logic is put into production? I mean to say how it actually made to run everyday.
Here I could see view only. Is this view used to load data from staging to some other layer each day.
Thank you GK. If we miss incremental data extract for couple of times and if we use max date to join then there is a scope of missing the records right. So, How come we overcome this problem?
@GK Code labs , Is there any difference between incremental data and delta data
nice explanation, please upload a video on "How to handle multiple small files generating hive as output?".. Thank you :)
Great Explanation!
Good Explanation !
How to perform incremetal hive load from the HDFS for the Partition Table? The table do not have the date/timestamp column
Hi Prashant,
Thanks for your comment.. ☺️
We can use many other internal checkpoints in such cases. Thanks for sharing the scenario, i will surely explain this in one of my coming videos..
#KeepWatching
I think you would have got your answers by now? If not then let's discuss. What is the partition column? How are you loading new data in that table?
@@GKCodelabs what type of join is this bro ?
Do you have any course playlist.
Great sir..thank you
Is this the implementation for SCD type 2 as well?
@GK Codelabs This may not work if data has deleted records
Hi Bro,
Very helpful videos thank you so much for sharing this with us. I have a small doubt,
If suppose we don't have date column, then how to do?
Hello GK Codelabs, thanks for this awesome video.
Would you please make a video to add column for where modified data will be refect ?
Senerio is I don't have modified_date column in existing hive table so if I want to use the stratagy that you have shown in this video, then how do I add modified_date column in existing hive table and hdfs data ??
Good explanation! But this wont work in enterprise level data. this is not a scalable solution. For instance, if the incremental is maintained for 12 months and update is coming everyday, this deduping will take long time to complete
Can you please let us know better solution in such scenarios? Thanks
Hi your videos are gr8. If you don't mind could you please post a video on sort merge bucket ( SMB ) join
Hi, if we don't have sequence I'd and the CSV/table data contains only data with duplicates but the total combination of row values will be unique, how to do the incremental load in this situation, thank you
And will not get load date and unique column is in source table and target table
Could you do a video on the small file problem
Nice explanation, But I have question here . The final join query is required to pick the latest records? We can select all the columns with max(modified_date) will give the desired output i believe. correct me if i wrong
Bro, when do aggregation by some column(here mod_date) we need to specify all other columns in "group by". In case if have 100s of column then we need to specify everything. Thats why joined with original table.
@@manikandanl4909 Thank you for clearing the doubt brother!
Dont think asking much, If you explain us one end to end scenario from kafka pulling to hdfs landing and hive loading.its very helpful for the persons who are poor and struggling to clear interview?
can u do video about kafka?
Sir, will you please give me answer to this? What approach we should take to load thousands of small 1 KB files using Hive, do we load one by one or should we merge together and load at once and how to do this?
I believe hive is not meant for small files!
what if we need to maintain versions in hbase?
Can you create videos on spark join if data is skewed and if joining small data and large data and how to do joins and explain how spark does sort merge and shuffle join.
Can you please make a video on handling small file in Apache spark?
Please post top most interview questions for hive
I have a doubt, how long are we going to store the daily files in hdfs ? Don't you think the performance of the view is going to be hit as more csv files are stored in the hdfs location to run a view on top of them ? Is there any way to keep only relevant records and a fresh file to process in hive and rest we move to cold storage ?
can you please this Cloudera image I have downloaded from CDH it is very heavy not able to work.
@gkcodelabs can you please make some similar videos on pyspark with use cases asked in interview
Really appreciate , can you also explain spark way of incremental load
Thanks in advance
Sure Haranadh, i will explain in one of my coming videos.
KeepWatching ☺️
@@GKCodelabs Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output
hi bro please do one vedio on how to choose memory , core ,excutores in spark cluster
Hi Bro, How to perform incremental load when no primary key and datestamp coulmns in table?
Thanks in advance
If there's no primary key the data is gibberish 😊
@@arindampatra6283 Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output
please make a video on accumulators and broadcast variables. as well as aggregate by key() with a good example.
Awesome Vicky, somehow you cracked what,the next video is going to be about ☺️☺️, its the very next video,which you Requested.. coming soon (couple of days)☺️☺️☺️💐
what if we don't have date column , can you please help
can i know how to work with stagging table in hive
How to do incremental load in spark?
Why not u uploading the videos regularly
Could you please explain the same process with rdbms data instead of files
What if we don't have any date column like modified date??
you need to first learn slowly changing dimension then you wont ask this question.
How can we handle the case where source records are closed / deleted ?
Hello, Can you share the VM image.
Thanks a lot!
thank you
thanks man
I was asked this question twice...
16:10 You dont need to ask that :)
SCREEN NEEDS TO BE CLEARED. I am bearly managing to see your screen.