- 35
- 23 257
Anirvan Decodes
India
Приєднався 14 січ 2022
Hi All,
I am Anirvan Sen, I love to talk about SQL, Python, Apache Spark, Data Engineering.
My passion is to teach complex topics regarding software engineering in a very simplified way.
That is why the channel is "Anirvan Decodes" where we decode complex topics and make them simple.
I am starting this youtube channel to build a community where we learn from each other and grow in our life.
Join me to learn different skills of Software Engineering in a fun and simple way.
See you in my videos!
I am Anirvan Sen, I love to talk about SQL, Python, Apache Spark, Data Engineering.
My passion is to teach complex topics regarding software engineering in a very simplified way.
That is why the channel is "Anirvan Decodes" where we decode complex topics and make them simple.
I am starting this youtube channel to build a community where we learn from each other and grow in our life.
Join me to learn different skills of Software Engineering in a fun and simple way.
See you in my videos!
Handling Late Arriving Data in Spark Structured Streaming with Watermarks
Spark Structured Streaming Sinks and foreachBatch Explained
In this video, we explore the different sinks available in Spark Structured Streaming and how to use the powerful foreachBatch sink for custom processing.
📽️ Chapters to Explore
0:00 Introduction
0:13 What is late data
3:29 State store
5:36 Rocks DB state store
5:59 Handle late data using watermarks
💻 Code is available in GitHub: github.com/anirvandecodes/Spark-Structured-Streaming-with-Kafka
🌟 Stay Connected and Level Up Your Data Engineering Skills!
🔔 Subscribe Now: www.youtube.com/@anirvandecodes?sub_confirmation=1
🤝 Let's Connect: www.linkedin.com/in/anirvandecodes/
🎥 Explore Playlists Designed for You:
🚀 Spark Structured Streaming with Kafka: ua-cam.com/play/PLGCTB_rNVNUNbuEY4kW6lf9El8B2yiWEo.html
🛠️ DBT (Data Build Tool): ua-cam.com/play/PLGCTB_rNVNUON4dyWb626R4-zrLtYfVLa.html
🌐 Apache Spark for Everyone: ua-cam.com/play/PLGCTB_rNVNUOigzmGI6zN3tzveEqMSIe0.html
📌 Love the content? Show your support! Like, share, and subscribe to keep learning and growing in data engineering. 🚀
Song: Dawn
License: Creative Commons (CC BY 3.0) creativecommons.org/licenses/by/3.0
open.spotify.com/artist/5ZVHXQZAIn9WJXvy6qn9K0
Music powered by BreakingCopyright: breakingcopyright.com
In this video, we explore the different sinks available in Spark Structured Streaming and how to use the powerful foreachBatch sink for custom processing.
📽️ Chapters to Explore
0:00 Introduction
0:13 What is late data
3:29 State store
5:36 Rocks DB state store
5:59 Handle late data using watermarks
💻 Code is available in GitHub: github.com/anirvandecodes/Spark-Structured-Streaming-with-Kafka
🌟 Stay Connected and Level Up Your Data Engineering Skills!
🔔 Subscribe Now: www.youtube.com/@anirvandecodes?sub_confirmation=1
🤝 Let's Connect: www.linkedin.com/in/anirvandecodes/
🎥 Explore Playlists Designed for You:
🚀 Spark Structured Streaming with Kafka: ua-cam.com/play/PLGCTB_rNVNUNbuEY4kW6lf9El8B2yiWEo.html
🛠️ DBT (Data Build Tool): ua-cam.com/play/PLGCTB_rNVNUON4dyWb626R4-zrLtYfVLa.html
🌐 Apache Spark for Everyone: ua-cam.com/play/PLGCTB_rNVNUOigzmGI6zN3tzveEqMSIe0.html
📌 Love the content? Show your support! Like, share, and subscribe to keep learning and growing in data engineering. 🚀
Song: Dawn
License: Creative Commons (CC BY 3.0) creativecommons.org/licenses/by/3.0
open.spotify.com/artist/5ZVHXQZAIn9WJXvy6qn9K0
Music powered by BreakingCopyright: breakingcopyright.com
Переглядів: 46
Відео
Streaming Aggregates in Spark : Tumbling vs Sliding Windows with Kafka
Переглядів 63Місяць тому
Streaming Aggregates in Spark - Tumbling vs Sliding Windows with Kafka In this video, we break down the concept of streaming aggregates in Spark Structured Streaming and explain the difference between tumbling and sliding windows. Using Kafka as the data source, we demonstrate how to effectively process and aggregate real-time data. 📽️ Chapters to Explore 0:00 Introduction 0:20 Use Case for str...
Spark Structured Streaming Sinks and foreachBatch
Переглядів 72Місяць тому
Spark Structured Streaming Sinks and foreachBatch Explained This video explores the different sinks available in Spark Structured Streaming and how to use the powerful foreachBatch sink for custom processing. 📽️ Chapters to Explore 0:00 Introduction 0:25 Types of sinks 0:50 Memory Sink 2:08 Kafka Sink 4:00 Delta Sink 4:30 toTable does not support update mode 4:53 How to use foreachBatch 💻 Code ...
Spark Structured Streaming Output Mode | Append| Update | Complete Modes
Переглядів 49Місяць тому
Spark Structured Streaming Output Modes Explained In this video, we explore the output modes in Spark Structured Streaming, a crucial concept for controlling how processed data is written to sinks. Understanding these modes helps in designing efficient and accurate streaming pipelines. 📽️ Chapters to Explore 0:00 Introduction 0:45 Complete Mode 2:30 Update Mode 4:24 Append Mode 5:32 How to use ...
Spark Structured Streaming Checkpoint
Переглядів 62Місяць тому
Understanding Spark Structured Streaming Checkpoints In this video, we dive deep into checkpoints in Spark Structured Streaming and their critical role in ensuring fault-tolerant and stateful stream processing. 📽️ Chapters to Explore 0:00 Introduction 0:20 Why Checkpoint is required? 1:50 How to define checkpoint 2:15 Content of a checkpoint folder 4:26 Kafka offset information in checkpoint 5:...
Spark Structured Streaming Trigger Types
Переглядів 68Місяць тому
In this video, we dive into Spark Structured Streaming Trigger Modes-a key feature for managing how your streaming queries process data. Whether you're working with real-time data pipelines, ETL jobs, or low-latency applications, understanding trigger modes is essential to optimize your Spark jobs. 📽️ Chapters to Explore 0:00 Introduction 0:40 Why Do We Need Trigger Types? 1:50 Default Trigger ...
Spark Structured Streaming Introduction
Переглядів 102Місяць тому
Welcome to this introduction to Spark Structured Streaming! In this video, we’ll break down the basics of Spark Structured Streaming and explain why it’s one of the most powerful tools for real-time data processing. Song: Dawn License: Creative Commons (CC BY 3.0) creativecommons.org/licenses/by/3.0 open.spotify.com/artist/5ZVHXQZAIn9WJXvy6qn9K0 Music powered by BreakingCopyright: breakingcopyr...
Databricks Setup for Spark Structured Streaming
Переглядів 112Місяць тому
In this tutorial, we’ll guide you through setting up Databricks for Spark Structured Streaming, enabling you to start building and running real-time streaming applications with ease. Databricks offers a powerful platform for big data processing, and Spark Structured Streaming makes it easy to process streaming data with Spark’s DataFrame API. By the end of this video, you’ll have your environme...
Kafka Consumer Tutorial - Complete Guide with Code Example
Переглядів 587Місяць тому
In this in-depth Kafka Consumer tutorial, we’ll walk through everything you need to know to start building and configuring Kafka consumer applications. From understanding core concepts to exploring detailed configurations and implementing code, this video is your one-stop guide to Kafka consumers. Here's what you'll learn: Kafka Consumer Basics: Get an overview of Kafka consumers, how they work...
Kafka Producer Tutorial - Complete Guide with Code Example
Переглядів 68Місяць тому
Welcome to this comprehensive Kafka Producer tutorial! In this video, we’ll dive deep into the fundamentals of Kafka producers and cover everything you need to know to get started building your own producer applications. Here's what we'll cover: Kafka Producer Basics: Learn what a Kafka producer is and how it fits into the Kafka ecosystem. Producer Workflow: Understand the steps for sending mes...
Setting Confluent Cloud for Kafka and walkthrough
Переглядів 2222 місяці тому
☁️ Setting Up Confluent Cloud for Kafka | Spark Structured Streaming Series ☁️ In this video, we’re walking through the steps to set up Confluent Cloud for a seamless Kafka experience! Confluent Cloud offers a fully managed Kafka service, making it easier than ever to get started with real-time streaming without the hassle of self-managing Kafka infrastructure. Join me as we cover everything yo...
Kafka Fundamentals Part-2
Переглядів 412 місяці тому
In this video, we’ll dive into the essential roles of Kafka Producers and Consumers-the backbone of any Kafka-powered streaming application. Whether you're just starting with Kafka or brushing up on streaming concepts, this session will break down how data is sent to and retrieved from Kafka, making real-time streaming possible. What We’ll Cover: Kafka Producers: Learn how Kafka Producers send ...
Kafka Fundamentals Part -1
Переглядів 1022 місяці тому
🦅 Bird's Eye View of Kafka Components | Spark Structured Streaming Series 🦅 In this video, we’ll take a high-level look at the critical components that make Apache Kafka a robust and reliable platform for real-time data streaming. Whether you're new to Kafka or want to solidify your understanding of its architecture, this video will provide a clear overview of Kafka’s inner workings and how eac...
Understanding Apache Kafka for Real-Time Streaming
Переглядів 1142 місяці тому
🎬 Understanding Apache Kafka for Real-Time Streaming | Spark Structured Streaming Series 🎬 In this video, we’ll explore the basics of Apache Kafka to understand why it’s become the go-to solution for real-time data streaming. Whether you’re new to streaming or looking to expand your data engineering skills, this session will introduce the core concepts of Kafka and how it powers modern streamin...
Spark Structured Streaming with Kafka playlist launch
Переглядів 2332 місяці тому
🔥 Welcome to my new UA-cam series: “Spark Structured Streaming with Kafka!” 🔥 In this series, we’re diving deep into the powerful combination of Apache Kafka and Spark Structured Streaming to master real-time data processing. 🚀 Get ready to learn all about building scalable, fault-tolerant streaming applications for real-world scenarios like financial transactions, fraud detection, and more! Wh...
DBT Tutorial: Snapshot - SCD type 2 in DBT
Переглядів 5644 місяці тому
DBT Tutorial: Snapshot - SCD type 2 in DBT
DBT Tutorial: DBT Tests | Generic and Singular Tests
Переглядів 1,1 тис.10 місяців тому
DBT Tutorial: DBT Tests | Generic and Singular Tests
DBT Tutorial: How to generate automatic documentation in DBT
Переглядів 69410 місяців тому
DBT Tutorial: How to generate automatic documentation in DBT
DBT Tutorial: How to use Target | Deploy project to different environment
Переглядів 41010 місяців тому
DBT Tutorial: How to use Target | Deploy project to different environment
DBT Tutorial: Templating using Jinja
Переглядів 99210 місяців тому
DBT Tutorial: Templating using Jinja
DBT Tutorial: Project and Environment variables
Переглядів 1,3 тис.11 місяців тому
DBT Tutorial: Project and Environment variables
DBT Tutorial: Incremental Model | Updates, Appends, Merge
Переглядів 4,6 тис.11 місяців тому
DBT Tutorial: Incremental Model | Updates, Appends, Merge
DBT Tutorial : How to structure DBT project
Переглядів 72211 місяців тому
DBT Tutorial : How to structure DBT project
DBT Tutorial : How does DBT run your query ?
Переглядів 75411 місяців тому
DBT Tutorial : How does DBT run your query ?
DBT Tutorial : Everything you need to know about Sources and Models
Переглядів 1 тис.11 місяців тому
DBT Tutorial : Everything you need to know about Sources and Models
DBT Tutorial : Setting up your first dbt project
Переглядів 1,5 тис.11 місяців тому
DBT Tutorial : Setting up your first dbt project
Spark Skewed Data Problem: How to Fix it Like a Pro
Переглядів 1,2 тис.Рік тому
Spark Skewed Data Problem: How to Fix it Like a Pro
How to add new columns to the existing incremental model
are you getting error while adding the new column in the configuration ?
But not column level
yes , But dbt cloud has that.
Super, simple, great video
Glad you liked it,please share it with your friends!
Is this playlist enough to get full dbt knowledge?
Yes watch the complete playlist , You will more than enough knowledge to complete projects.
Way of explanation is too good....
Glad you enjoyed it! Do share with your network.
I need complete video on Pyspark
sure will add more
Can you create a project video where iot device data is processed using kafka streams in real time that would be great.thx in advance 😊
Thanks for the idea! , Will try to create some project videos
I have a query that you might be able to solve I am saving data from Kafka to timescaleDB But for each offset mesages I need to query db to get userId associated with IOT sensor. So for each offset processing one query is executed. Causing max connection error. Any solution for that (for now I added redis + connection pooling) but I don't think it will solve it for the long term 2. As data grows to 30-40gb of the single table inserts get slower in timescaleDB what should we do to make it fast
Thx in advance
You should try to batch your query (Task will be to mimize the db call) or you can copy the data from db to databricks , Checkout this one : ua-cam.com/video/n0RS7DB_y9s/v-deo.html
Just subscribed your channel
Thank you , Please share the playlist with your Linkedin network which will help this channel to grow.
Nice video bro
Thank you so much
It's Short and Sweet and Very Descriptive. I have install as per the video. But i encountered with an error: Error from git --help: Could not find command, ensure it is in the user's PATH and that the user has permissions to run it: "git". Please let me know How to resolve this error.
Thank you . So git path is not set properly , Check out this one : ua-cam.com/video/lt9oDAvpG4I/v-deo.html , Please share the playlist with your network which will help this channel to grow.
@@anirvandecodes Thank you so much
Great content, thank you for sharing! Special respect for github link
thank you , please share the playlist with your LinkedIn network so that it reach to wider audience.
hi how to setup Kafka ? Do we have any video on this ?
Yes , Checkout this video to setup Kafka on confluent cloud : ua-cam.com/video/miN4WLiJnRE/v-deo.html Playlist link : ua-cam.com/play/PLGCTB_rNVNUNbuEY4kW6lf9El8B2yiWEo.html
Hey, great series, thanks. how can i make the producer, produce faster?
There are a few configuration changes you can do Batching: Set linger.ms (5-50 ms) and increase batch.size (32-128 KB). Serialization: Opt for efficient formats like Avro or Protobuf. Partitions & Proximity: Add more partitions and deploy near Kafka brokers. In production people generally use more scalable solutions than just a Python producer app, Check out this : docs.confluent.io/platform/current/connect/index.html Do share the playlist with your LinkedIn community.
14 completed ❤
You are making a great progress , Please share with your friends and colleagues
13 completed ❤
12 completed ❤
11 completed ❤
10 completed ❤
9 completed ❤
8 completed ❤
6 completed ❤
5 completed ❤
4 completed ❤
3 completed ❤
3 completed ❤
2 completed ❤
1 completed ❤
I have copied the yml file in the folder staging, marts, I am getting the conflict to rename the yml sources , how do we effectively define sources in the models
Can you share the complete error text and project structure?
@@anirvandecodes so in your video you pasted the yml file containing sources in the all the 3 folders, since the source is the same for all 3 files I just pasted the model sql files inside the folder and kept the yml file outside the folder so this resolved the error, I believe with the new dbt version you cannot have 2 yml files having the same source referencing the same table at the same folder level currently my folder structure looks like models -staging - - staging_employee_details.sql -intermideate - - intermideate _employee_details.sql -marts - - marts_employee_details.sql -employee_source.yml in the video you pasting the yml file in each 3 folders (staging, intermideate, marts) which gives naming_conflict_error your videos have been very informative, I went through the whole playlist was struggling to install dbt on my system and understand it thank you so much ! 😄😄
@ i think you might have same spirce name mentioned in two place , take a look into that
@@anirvandecodes dbt found two sources with the name "employee_source_EMPLOYEE". Since these resources have the same name, dbt will be unable to find the correct resource when looking for source("employee_source", "EMPLOYEE"). To fix this, change the name of one of these resources: - source.dbt_complete_project.employee_source.EMPLOYEE (models\marts\marts_employee_source.yml) - source.dbt_complete_project.employee_source.EMPLOYEE (models\staging\stg_employee_source.yml)
should the source name always be unique ?
bro, what if I don't want to share my data with confluent. Can we do the confluent kafka setup on premises?
Absolutely , They call it self managed kafka , Check this out www.confluent.io/get-started/?product=self-managed
Yes , you can do that. For that you need to install confluent platform on premises and you have to install and maintain kafka clusters and other artifacts b yourself. This is the self-managed offering.
best dbt playlist man! searched for a lot throughout youtube, no one comes closer to clarity of explanation!
Made my day , Thank you , Do share with your network.
@@anirvandecodes you deserve it man!
How to see the column lineage?
dbt core does not have any out of box column mapping lineage . You can explore column lineage in dbt cloud or check out this one tobikodata.com/column_level_lineage_for_dbt.html
Hi @anirvan, thanks for your detailed explanation dbt concepts.which has helped me a lot
Glad to hear that , Please share the content with your network.
Complted the tutorials, I loved it. Please create more tutorials playlist for more topics.
Thank you for the support , Yes I will be publishing content on spark structured streaming with kafka.
loved your video, it cleared my doubt about sources and models and how we create sources.
Glad it was helpful! , do share with your network
Hi, I ran dbt dubug from command prompt and worked well, i am running from pycharm and getting error , The term 'dbt' is not recognized as the name of a cmdlet, function, script file
looks like this is some pycharm path related issue , try to debug if path is coming properly in pycharm or you can also select different terminal as git bash , you can get more info on google
This error generally comes when the path is not added in the system ,try to use stackoverflow or chatgpt and you can try to do with git bash
Hey my man, just wanna say thanks for this whole series you did. Extremely helpful to people who are specifically looking for guidance in this new tool. Really appreciate your hard work man.
Thank you so much, it really made my day :)
hello I have question dbt doesn't recognize my model as incremental I using incremental modling to take snap shot of table row count and insert it to build time serise table contain row conut for every day
I will upload one video on snapshot soon , check that out !
How to achieve incremental insert in dbt without allowing duplicates base on specific columns?
You can apply distinct in sql to remove the duplicates or use any other strategy to remove the duplicates
Pleas teach on snowflake dbt integration and how dbt works on entire process SCD type 2 thnk u
sure will create one video on it
Man, amazing work. Can't wait....Subscribed! Do keep the videos coming, please?
Thanks! Will do!
I was looking for this and you are like a saviour. Thanks
Glad I could help
Hiii
hello
Are you teachjng dbt
Yes I have a complete dbt playlist here : ua-cam.com/play/PLGCTB_rNVNUON4dyWb626R4-zrLtYfVLa.html
Can you please share the dbt models as well
sorry i lost the model file
Awesome tutorials.. keep the good work going...when can we expect tutorials on other tools like airflow, airbyte etc ?
thank you so much , I have two more videos dbt to complete the playlist , will plan after that
Nice explanation
Keep watching
Very nice explained
Thank you so much 🙂
how to deploy code from DEV to QA to PRD , Please make video on this... thank you
yes , i am in the process on making video on how to deploy dbt project on cloud. stay tuned!
Hey Anirvan, Thanks for clearly explaining. I am currently learding dbt and I came across this question whether we can keep multiple where conditions in incremental load
Yes, definitely. think it as a sql query with which you are filtering out the data.
@@anirvandecodes Hello Anirvan, any code snippet or any format suggestion from your end??