00:01 Data pipelines automate data collection, transformation, and delivery. 00:38 Data pipeline involves stages like collect, ingest, store, compute, and consume. 01:18 Data pipeline captures live data feeds for real-time tracking. 01:56 Data pipeline involves batch and stream processing of ingested data 02:40 Data pipeline tools like Apache Flink and Google Cloud are used for real-time processing of data streams. 03:23 Data is transformed for analysis in storage phase 04:05 Data pipelines enable various end users to leverage data for predictive modeling and business intelligence tools. 04:47 Data pipeline enables continuous learning and improvement using machine learning models. Crafted by Merlin AI.
Great video. Showed me the fundamentals of data pipelines and processes from collection to consumption. There are so many tools/applications extensively used for data processing at various stages that I have never heard of, or only encounter in job descriptions, but since I am not a data specialist, I had no idea of! Thanks for putting these short summaries online. Helpful for people like myself!
Loved how simply you explained this complicated concept! Also what are your thoughts on Irys, world's only provenance layer ensuring the data integrity and accountability.
Those functional applications should likely use the same data platform for their functional applications, the only difference is how you're serving the transformed result. What's the difference then that you think should be talked about?
Functional applications are most likely consume very small amount of data while BI and AI ML models required way more likely gb to TB amount of data to work with. There's no possible way you can load 1gb of data in your web app or sql it just makes your app clogging and time consuming.
Because more and more non-traditionally technical business roles are leveraging data for business intelligence - so the demand for understanding these concepts is greater there (than in complex application architectures where more traditional technical skill accumulates).
@@manishshaw1002this isn’t always true, at the health insurance company I work at we have functional applications that internal users and providers use to view data about members and there are vast amounts of data streaming to and from these applications
Why is Apache Flink not an option for batch processing? As I understand it, it makes more sense to use the same computation frameworks when doing both, so why not use Flink for both given Flink can support batch jobs?
Suppose we have 100 microservices deployed as different AWS Lambda functions. Out of these, more than 30 Lambda functions need to write data to MongoDB Atlas. Each of these 30 functions is triggered simultaneously via SNS (Simple Notification Service), and each function will be invoked 200,000 times to process different data. Given this setup, the MongoDB Atlas connection limit will likely be exhausted due to the large number of simultaneous requests. What would be the best approach to handle this scenario without running into connection problems with MongoDB Atlas? May you create a video for this scenario, sir?
So i need to build a way so retrieve man many emails and categorize them with a ml model and then save them in the right system. Do i build this with kafka and pyspark? Or how can this be done easaly
Why would ETL here be considered as real time when ETL is slower as you need to transform every single extraction before you load it into a db warehouse?
I dont know why but the gain of the microphone is too high, there is a little background noise and its a bit noticeable, keep it in check. Great video, as always in the channel.
they said that about Mainframe computers 30 years ago, but they are still here/in production. Large organizations are not going to adopt the latest solutions for all there data needs (for instance data that isn't accessed that often/specific use cases, or they might have support staff that is more familiar with legacy tools and they don't see the need to adopt latest methods at the moment). So I can guarantee Hadoop is NOT completely dead.
@@shilashm5691 Most use AWS S3 as storage for their datalake, others Azure Data Lake Storage. MapReduce is dead and HDFS is on the brink of obscurity as well. I pity those who still have to work with some inhouse hdfs from the darkest and most painful era of data engineering (hadoop era)
I like your content a lot but you have a lot of mistakes. Not only in this video but also in the others. Mislabeling, duplicities. It might get confusing a lot for a beginner. Similarly if you are using acronyms I would recommend explaining them or at least stating the full name
00:01 Data pipelines automate data collection, transformation, and delivery.
00:38 Data pipeline involves stages like collect, ingest, store, compute, and consume.
01:18 Data pipeline captures live data feeds for real-time tracking.
01:56 Data pipeline involves batch and stream processing of ingested data
02:40 Data pipeline tools like Apache Flink and Google Cloud are used for real-time processing of data streams.
03:23 Data is transformed for analysis in storage phase
04:05 Data pipelines enable various end users to leverage data for predictive modeling and business intelligence tools.
04:47 Data pipeline enables continuous learning and improvement using machine learning models.
Crafted by Merlin AI.
I work as the PM in data enablement, this video was amazing for understanding each component in a data pipeline.
3:13 typo *AWS Glue.
Love these vids, thanks!
bruh had me googling whats AWS glow
just learned more in 5 minutes than I learned in 5 years. instant subscribe. thank you!
0:49 Shouldn't the last one be 'Consume'?
Yeah...error but you can the video has to be published. They cannot go back to edit from the beginning
I love the short video format, as I can dive deeper on topics and terms I am interested in on my own time :)
Amazing explanation, so far the most easy to digest video about data pipelines.
Great video. Showed me the fundamentals of data pipelines and processes from collection to consumption. There are so many tools/applications extensively used for data processing at various stages that I have never heard of, or only encounter in job descriptions, but since I am not a data specialist, I had no idea of! Thanks for putting these short summaries online. Helpful for people like myself!
The best animated introduction to data pipelines in just five minutes.
Great overview of data pipelines! Thanks!
Spark is widely used in stream processing too, not only batch, see spark structured streaming.
For stream processing, Apache Flink is more suited. Even though both can do stream and batch processing.
Your channel is a blessing.
What do you use to create these animations/info graphics
I think it could be either figma or canvas.
@@Biostatistics is there a video out there that shows how that is done in power point? I see these data like infographics a lot these days
@@user-data_junkieit’s says in the description of this video, he used Adobe illustrator and after effects. 😊
@@Biostatistics thanks. I did check at the time and did not see anything. Appreciate the update
Which tool do you use to create these animated presentations?
Trade secret 😂
I also want to know what he uses to create the presentation illustrations. They look neat
💯 Looking like your channel is on track for 1 million subscribers by year end! Great stuff! 😎✌️
This video was amazing for understanding, thank you 🤗🤗
Loved how simply you explained this complicated concept! Also what are your thoughts on Irys, world's only provenance layer ensuring the data integrity and accountability.
Why do we mostly talk about data pipelines for BI or ML when many times we also need it for functional applications?
Those functional applications should likely use the same data platform for their functional applications, the only difference is how you're serving the transformed result. What's the difference then that you think should be talked about?
Functional applications are most likely consume very small amount of data while BI and AI ML models required way more likely gb to TB amount of data to work with.
There's no possible way you can load 1gb of data in your web app or sql it just makes your app clogging and time consuming.
Because more and more non-traditionally technical business roles are leveraging data for business intelligence - so the demand for understanding these concepts is greater there (than in complex application architectures where more traditional technical skill accumulates).
Just call it messaging and you’re good to go
@@manishshaw1002this isn’t always true, at the health insurance company I work at we have functional applications that internal users and providers use to view data about members and there are vast amounts of data streaming to and from these applications
I think you meant AWS Glue 3:18. Appreciate these informative videos
Very good discussion
Why is Apache Flink not an option for batch processing? As I understand it, it makes more sense to use the same computation frameworks when doing both, so why not use Flink for both given Flink can support batch jobs?
tq very much .mind blowing explantion
Top quality work as always
thanks for the knowledge you share
I like your presentations. What do you use to make them?
I also want to know what he uses to create the presentation illustrations. They look neat
@@chrisalmighty Adobe illustrator and after effects
Amazing video. Thanks for your great efforts!
Love it. This jargon cleared now
one-stop shop Video . loved it ♥
Love the presentation, Do you recommend some resource to do it?
this was very useful. thanks for sharing.
Very Good Video!! Easy to get!
Fantastic video and graphics, what program do you use to animate your graphics? It's great stuff.
Suppose we have 100 microservices deployed as different AWS Lambda functions. Out of these, more than 30 Lambda functions need to write data to MongoDB Atlas. Each of these 30 functions is triggered simultaneously via SNS (Simple Notification Service), and each function will be invoked 200,000 times to process different data.
Given this setup, the MongoDB Atlas connection limit will likely be exhausted due to the large number of simultaneous requests.
What would be the best approach to handle this scenario without running into connection problems with MongoDB Atlas? May you create a video for this scenario, sir?
Important information about refunds: what a joy
Is GA4 consider a data stream? And big query a storage and transform tools?
Video illustrations look neat. What tool did you use create the presentation illustrations?
3:13 , what is AWS Glow ? Typo ??
Thank you for doing this!
Great video. Small remark: the AWS service for ETL is called AWS Glue, not Glow
Very informative !! But how you do all these animations ??what product do you use !!
4ra ke technical features sabse best hain, pura reliable platform hai bhai
Maybe some examples of simplified pipeline on specific application would make this video even better.
So i need to build a way so retrieve man many emails and categorize them with a ml model and then save them in the right system. Do i build this with kafka and pyspark? Or how can this be done easaly
Kafka dear
Always so so good
请问这些精美的图是怎么画的?太赞了
Mujhe 4ra pe fair play bohot pasand aaya, jitne ka pura chance milta hai
Bhai 4ra ke technical support bhi quick response dete hain, full satisfaction
Intimidating!
Thanks!
4ra pe games aur betting ka experience bohot acha hota hai, fair winnings aur good odds
I want to learn system design for data pipelines
Could you please suggest how to proceed ? What books ?
4ra pe odds bhi bohot ache milte hain, jeetne ka chance hamesha high hota hai
Thanks
No mention of Apache Iceberg and such technology?
Aws glow or aws glue?
Bravo!
Your diagram had compute arrows twice when you verbally said compute and consume for the last two phases.
Why would ETL here be considered as real time when ETL is slower as you need to transform every single extraction before you load it into a db warehouse?
Looks like your examples are only AWS or Google stack. Why not cover examples from MS Azure stack as well?
Did not make a mention on data lakehouse
4ra pe odds bohot ache hain, isliye maza aata hai bet karne me
Yaar 4ra pe bets lagana easy hai, aur winnings bhi achi hoti hain
you have an error in diagram, 2 computes, it should be compute and consume
❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤
4ra pe betting ka maza hi kuch aur hai, odds bhi bohot ache milte hain, real fun
Tomorrow i have an interview :)
Do data analysts build data pipelines?
AWS Glow or Glue?
Leaving out all Azure tools... really a shame
Maybe it's intentional. Many serious data scientists aren't fond of the Azure UI for big data pipelines.
Microsoft training has that covered
So basically a data pipeline is similar to a system flowchart?
"Trade Secret" name of the tool used to create the animations ...😂
4rabet pe betting ka maza isliye hai kyunki odds hamesha sahi milte hain
I dont know why but the gain of the microphone is too high, there is a little background noise and its a bit noticeable, keep it in check.
Great video, as always in the channel.
4rabet pe winnings jaldi milti hain, wait nahi karna padta
This seems so complicated
AWS Glue*
AWS Glue, not Glow
Lopez Robert Lee Gary Williams Christopher
4rabet pe betting karte waqt kabhi bhi unfair nahi laga, sab kuch transparent hai
Davis Jose Harris Christopher Jackson Ronald
apache hive logo is on acid
😎🤖
Rest api
looks like you need to change the mic you are currently using. there is some crackling noise when you talk.
Hadoop is dead
Why, what's the reason
they said that about Mainframe computers 30 years ago, but they are still here/in production. Large organizations are not going to adopt the latest solutions for all there data needs (for instance data that isn't accessed that often/specific use cases, or they might have support staff that is more familiar with legacy tools and they don't see the need to adopt latest methods at the moment). So I can guarantee Hadoop is NOT completely dead.
Lol it’s not dead at all, and its ecosystem tools are still widely used
😂 most uses hdfs as data lake, when you say hadoop.is dead be precise and say mapreduce.is dead, bcoz hadoop ecosystem is large and still functioning
@@shilashm5691 Most use AWS S3 as storage for their datalake, others Azure Data Lake Storage. MapReduce is dead and HDFS is on the brink of obscurity as well. I pity those who still have to work with some inhouse hdfs from the darkest and most painful era of data engineering (hadoop era)
I like your content a lot but you have a lot of mistakes. Not only in this video but also in the others.
Mislabeling, duplicities. It might get confusing a lot for a beginner. Similarly if you are using acronyms I would recommend explaining them or at least stating the full name
4rabet pe betting aur winnings ka process bohot transparent hai, sab kuch fair lagta hai
Treu bro 4rabet pe bet place karna fast aur simple hai, kabhi bhi delay nahi hota
Bhai 4rabet pe betting itni fair hoti hai, kabhi bhi lagta nahi ke cheating ho rahi hai
4ra ke technical features sabse best hain, pura reliable platform hai bhai
Mujhe 4ra pe fair play bohot pasand aaya, jitne ka pura chance milta hai
Treu bro 4rabet pe bet place karna fast aur simple hai, kabhi bhi delay nahi hota
4ra pe betting ka maza hi kuch aur hai, odds bhi bohot ache milte hain, real fun
Bhai 4rabet pe betting itni fair hoti hai, kabhi bhi lagta nahi ke cheating ho rahi hai
4rabet pe betting karte waqt kabhi bhi unfair nahi laga, sab kuch transparent hai
Bhai 4ra ke technical support bhi quick response dete hain, full satisfaction
4ra ke technical features sabse best hain, pura reliable platform hai bhai
4ra pe betting ka maza hi kuch aur hai, odds bhi bohot ache milte hain, real fun