Processing 2000 TBs per day of network data at Netflix with Spark and Airflow

Data with Zach

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 22 жов 2024
Check out my academy at www.DataExpert.io where you can learn all this in much more detail!
You can get use code ZACH15 to get 15% off!
#dataengineering
#netflix

КОМЕНТАРІ • 391

@sevrantw8931 6 місяців тому ⁺³²⁶¹
I’m so glad I found this video, I was just sitting here with 60 million gigabytes and was figuring out what joins to use so this was perfect timing.
@aripapas1098 6 місяців тому ⁺¹¹
if all u registered was 60 mil gb & joins ur not flowing
@smackastan5697 6 місяців тому ⁺³⁴
You're kidding, but somehow I just started a data analysis project of two terabytes and this video shows up.
@hi-mn5rg 6 місяців тому ⁺¹³
@@aripapas1098 if you think comments must indicate a user registered every aspect of a video, ur not following
@derickd6150 6 місяців тому ⁺²
@@aripapas1098this is a sad comment
@00Tenrai00 6 місяців тому
Sarcasm ???? 😂
@bilbobeutlin3405 6 місяців тому ⁺²⁴¹²
Can't wait to build hyperscale pipelines for my startup with 0 users
@92kosta 6 місяців тому ⁺⁶⁷
But it sounds powerful when you say it, like you mean business.
@npc-drew 6 місяців тому ⁺⁶
Based
@vikingthedude 6 місяців тому ⁺⁶
1 user (me)
@JGComments 6 місяців тому ⁺¹³
If you build it, they will come.
@abhilashpatel6852 6 місяців тому ⁺¹
I have 1k TB data just sitting around in my backyard. Glad your video came up to get me started on atleast something.
@subhasishsarkar5106 7 місяців тому ⁺⁴⁴²
What I absolutely love about your videos is that as a beginner in the data engineering field, you often talk about things that I had no conception of. In this video for example, I have never heard of SMBs or broadcast joins. This gives me an oppurtunity to learn these things, even hearing them be mentioned from someone as widely experienced as you.
You need not necessarily have to even go into detail, but these short form videos act as beacons of knowledge that I can throw myself into learning about.
Thanks a lot, and keep these coming Zach!
@EcZachly_ 7 місяців тому ⁺⁷⁷
Really appreciate this comment! It reminds to that the value im putting out there is important!
@vasudevreddy3527 7 місяців тому ⁺²
@@EcZachly_ ✌
@eric.batdorff 6 місяців тому ⁺¹⁰
Great summation! I was thinking the exact same thing while watching. It's nice hearing even the specialized lingo from technical experts in their fields, it peaks my curiosity.
@MrAmitkr007 6 місяців тому
@@EcZachly_thanks
@prawtism 6 місяців тому ⁺²
@@EcZachly_did you already know the importance of these two before Netflix or did you learn that while working at Netflix?
@supercompooper 6 місяців тому ⁺⁶⁹⁷
In the future a wrist watch will have a little blinking light that will have 60 million gigabytes of data in it
@dhillaz 6 місяців тому ⁺⁹⁶
You mean an Electron app?
@aripapas1098 6 місяців тому
yeah okay crack smoker
@mrevilducky 6 місяців тому ⁺⁴⁰
And it will still lag and hit 99% singularities
@Ivan-Bagrintsev 6 місяців тому ⁺¹²
@@dhillaz that will just show current time
@supercompooper 6 місяців тому ⁺⁹
@@Ivan-Bagrintsev Yes it will show the time, but with full DRM. Unless you have a license to view certain minutes it will be denied.
@supafiyalaito 6 місяців тому ⁺⁹²
Thanks Zach, hopefully one day I will understand what all of that means
@mu3076 3 місяці тому
😂😂😂, I’m starting now
@lucas.p.f 6 місяців тому ⁺⁵⁴⁴
Boyfriend simulator: you sit with your bf and he starts talking about this nerdy stuff you have no idea about but need to keep listening because you love him
@EcZachly_ 6 місяців тому ⁺⁵⁰
This is exactly correctly
@CU.SpaceCowboy 6 місяців тому ⁺¹⁰
aww 🥰
@heykike 5 місяців тому
After marriage they no longer pretend to listen to
@rajns8643 5 місяців тому ⁺²
If only a girl would fall for me when I speak nerdy stuff 🫠
@lucas.p.f 5 місяців тому ⁺²
@@rajns8643 are you kidding me? This is what most people like the most! Intelligent people are extremely attractive
@Bostonaholic 6 місяців тому ⁺⁴⁶
I love that you kept it short and to the point.
@tobiastho9639 6 місяців тому ⁺²
He sure wanted to save some data… 😅
@RichardOles 6 місяців тому ⁺⁶⁴
Holy crap. I’m currently learning about data science, the various roles, etc. -with the hope of one day switching careers. But the current state of learning is all about the languages and software used etc, not about the infrastructure and what to do with massive datasets. So this just 🤯
@samuelisaacs7557 5 місяців тому ⁺²
its really about math but no one talks about it. get at least 1 year university math comprehension and then get into the python and tech tools. the most competent and successful data engineers are always people with a good STEM background. for example Zach has a Bachelor's Degree in Applied Mathematics and a Bachelor's Degree in Computer Science so he is a heavy numbers guy. That's what most of Data Science \ Engineering UA-camrs don't tell their viewers cause that will cause them to loose viewers.
@byRoyalty 5 місяців тому ⁺¹
learning the tools can be very different from solving real world problems.
@rajns8643 5 місяців тому
@@samuelisaacs7557 True asf
@stevess7777 4 місяці тому
@@samuelisaacs7557Yep, even a business administration bachelors will have a lot of maths and it's nowhere near data science which is 3x that.
@jacobp8294 6 місяців тому ⁺¹²
I am a regional IT installer who runs Cat6 Ethernet pipelines for managing 1gb loads on HP laptops, this video is really awesome and breaks down your workflow and mindset in a complicated field really efficiently. I would love to get more short videos about the industry like this.
@EcZachly_ 6 місяців тому ⁺²
I'll keep them coming. I make much more on Tiktok and Instagram since I like making vertical content!
@jacobp8294 6 місяців тому
@@EcZachly_ Ill check it out! Keep it up!
@rembautimes8808 7 місяців тому ⁺⁷⁴
Great content, an honour to be able to listen to someone who has handled that volume of data.
@deletevil 7 місяців тому ⁺¹
literally 🎉
@codecaine 6 місяців тому
Have chat gpt explain it too you or some other LLM.
@JGComments 6 місяців тому ⁺¹⁵
2 pita bites a day, the same as me when I’m on a diet.😊
@WM-eg4gh 6 місяців тому ⁺⁶
Thank you Zach for taking the time to give us the hard truth and hands down your experience. It helps a lot of enthuastic students/people to know how we can in some way support or help others in the subjects we like. I don't imagine myself processing 2000TBs per day, but it helps give a bigger picture. Once again, appreciate the short video and thank you for sharing
@ArjunRajaS 6 місяців тому ⁺³
If you come across a scenario to join 2 large datasets. You could do an iterative broadcast join. Basically you are going the break one of the df into multiple dfs and join the dataframe in a loop till all the multiple dfs are joined.
@jordanmessec5332 5 місяців тому
You’ll require a lot of memory and have long start times, no?
@nikolagrkovic8769 6 місяців тому ⁺²
The amount of knowledge you shared here is astonishing
@oakleyorbit 4 місяці тому ⁺¹
Half of what you said I had no idea what you were taking about but I was very engaged and now I’m gonna look all this stuff up for centering my div!
@rohanbhakat2922 7 місяців тому ⁺⁸
Thanks for the info Zach. Could you please make an elaboriative video on SMB join.
@SahilKashyap64 6 місяців тому ⁺⁴
I've never heard of these terms, thank you sharing your real case scenarios(The FB notification example)
@naraendrareddy273 Місяць тому ⁺¹
As a guy struggling to get a job because entry level roles require ex[erience, I have learned something new and valuable today. Broadcast and SMB joins.
@Jc12x06 6 місяців тому ⁺¹²
Dude has beef with Bezos😂
@KoulickSadhu Місяць тому ⁺¹
Thanks Zach for the insightful video. I have a similar use-case. Hence, a few questions:
1. So, with the large volumes of the datasets, do you archive it/compress it/just set a TTL to it? What do you suggest would be the best way for this.
2. With such large datasets, while I join the two tables, bucketing along with partitioning would be the most viable option right? Can you make a video around the joins if possible.
Thanks!
@dazzassti 6 місяців тому ⁺²⁰
In the 37 years I’ve been working in data, I’ve never heard anyone call it Peter 😂. PETA
@anotherguy9402 6 місяців тому
What's wrong with a Peter bite?
@divinecomedian2 6 місяців тому ⁺¹
Heya Peeda
@Starmast3rmusic 6 місяців тому
Could be an accent or a slip 😂
@earthling_parth 6 місяців тому ⁺⁶
Imma wait for Primeagen to confirm this as well when he reacts to this video inevitably 😁
@internetcancer1672 6 місяців тому ⁺⁴
My problem is how do people even find out about the careers that they go into?
@john_paul 6 місяців тому ⁺¹
I love how you acronym Sorted Bucket Merge as SMB. Think you may have had Super Mario Bros on the mind 😂
@solitary200 4 місяці тому
Great points to remember!
There are a lot more underlying abstraction layers you can add at these different points to further optimize the second network hop. Caching is a simple one.
Can you implement an efficient snapshot system with delta encoding of entities and compress the message? Would be a cool video for you to implement!
@hearhaw 4 місяці тому ⁺¹
I'd like to learn more about these pitabytes. What are they? What do they taste like?
@JT-zb6vi 6 місяців тому ⁺²
instant subscribe - really appreciate the concise explanation and clear examples
@aamer2411 7 місяців тому ⁺⁴
Just started following you. Really appreciate you for sharing your knowledge with the community.
@dungenwalkerr619 5 місяців тому ⁺¹
Thanks for sharing, now I can finally put some good numbers on my resume 🎉
@Adhanks91 7 місяців тому ⁺⁴
Informative and straight to the point, great stuff as usual
@JimRohn-u8c 5 місяців тому ⁺²
Did Facebook use Databricks or did they have HPC Clusters for you to run Spark on?
@SamCyanide 6 місяців тому ⁺²
My medical science clients called, they need an 800tb imaging data set parsed by end of day (thank you kubernetes)
@vikrampandit2174 6 місяців тому ⁺²
Never thought broadcast join is a Netflix saviour
@picdu2891 5 місяців тому ⁺¹
I love technology and I know more than your average user, yet I have no IT qualifications and I am light years away from this knowledge, but for some reason, I love watching these videos as if I was ever going to use the information 😂
@_sonicfive 6 місяців тому ⁺¹
Whenever I hold on to more than 60 petabytes I just call the assistant to the regional manager and he runs a fix from his mainframe.
@LambOverSpicyRice 6 місяців тому ⁺³
Excellent video, thanks Zach!
@arbol41 6 місяців тому ⁺²
Thanks Zach , but I have a question broadcast join is used when we have a small dimensions joined with big table this is your case? Or are you used hash join with two large table?
@souravghosh358 6 місяців тому ⁺³
Very important concept in such short time.. thank u so very much ❤
@theAnupamAnandhelueene 6 місяців тому ⁺¹
you can make a bios optimized for throughput and without interrupta , to speeden 67x and more
@Llanowyn 6 місяців тому ⁺²
I would be interested in the architecture and content delivery for pre and post cdn from a network design perspective. Are there any examples or presentations regarding networking at netflix?
@OurNewestMember 5 місяців тому ⁺¹
Interesting! I would have thought something like sharding (or partitioning and clustering) so data processing and access can scale horizontally.
@EcZachly_ 5 місяців тому
Bucketing and clustering are similar
@IAmAlpharius14 5 місяців тому ⁺⁶
Sir this is a Wendy's.
@ChrisMPerry 7 місяців тому ⁺⁴
Insightful as always.💯
@EcZachly_ 7 місяців тому
Appreciate that!
@motonoob-i2d 6 місяців тому ⁺¹
That's cool bro. Will it fix the Netflix app where it shows the title of one show but the preview and description of another?
@EcZachly_ 6 місяців тому
It was to look at network traffic to keep your credit card data secure
@tschaderdstrom2145 6 місяців тому ⁺²
I love pita bites as much as the next guy, but I don't think I can take more than 35 before I'm full
@RyanSaplanPT 6 місяців тому ⁺¹
Please more data stuff!!! I hardly understood what you said, but it’s sounds interesting
@remo 4 місяці тому ⁺²
Damn I just wanted to shuffle like there’s no tomorrow and then I found this video.
@ismailahmad9597 6 місяців тому ⁺³
That's what I'm saying, I completely agree!
(As a medical student who has done some codecademy 😅)
@xasm83 6 місяців тому ⁺¹
my data pipeline usually processes one pitabyte every other day and one shawarmabyte every week week
@narbwow8168 6 місяців тому ⁺¹
Pretty interesting, even though I had no idea about most of what he was talking about.
@revel-88 3 місяці тому ⁺¹
Subscribing just for the britto. One of my favourite hoods
@TheInterestingInformer 6 місяців тому ⁺²
I’m trying to get into data analytics and most of this we t over my head but this still sounds lit 🔥
@ATX_Engineer 4 місяці тому ⁺¹
Ah yes, data structures and sorting… but with the “can you even scale bro” tick enabled.
@elferpe27 2 місяці тому ⁺¹
Wow, didn't know Owen Wilson was working on data
@abdullah7891 Місяць тому ⁺¹
Wow I didn’t even know that such joins existed. No one taught me 😮
@Settings404 6 місяців тому ⁺¹
I love that I’m only a software engineer but I can understand all of this
@GameCyborgCh 5 місяців тому ⁺¹
gotta love a good pita byte
@maggiejetson7904 4 місяці тому ⁺¹
Honestly, 2000 TB per day isn't the problem. The problem is the cost and how much of the data is burst. If it is not burst it is pretty much always cheaper to do it in-house with your own hardware than to pay and rent the cloud to do it.
@uwize5897 6 місяців тому
optimizing selling personal data to minimize cost is something i never thought about
@chrism3790 4 місяці тому ⁺¹
What engine were you using to do these massive joins? Spark?
@EcZachly_ 4 місяці тому
Yep!
@liamvstech 6 місяців тому ⁺¹
When I was hired to do data engineering, it was always data that could fit on a single hard drive and it was boring af. I hated it. This sounds way more challenging and interesting.
@PySnek 6 місяців тому ⁺¹
That's around 160 Gbit/s. Enough for 30K 1080p streams or 10K in 4K.
@schwarzie2478 6 місяців тому ⁺¹
I just felt like drinking from the fountain of knowledge and instantly drowning. Definitily haven't had to deal with these kind of volumes yet...
@yokothespacewhale 2 місяці тому ⁺¹
Hold my beer while I cross join Amazon to Netflix
@ngneerin 6 місяців тому ⁺²
Thanks, looking forward to more such content
@sharpsrain8302 7 місяців тому ⁺²
I just found ur stuff but thanks for the content mang keep it up 🙏
@LucTaylor 5 місяців тому ⁺¹
I might get 5 users on my site this month so this will come in handy
@RajveerSingh-vf7pr 2 місяці тому ⁺¹
Wow, if I knew all this, it's pretty amazing content...
If only...
@twitchizle 3 місяці тому ⁺¹
I really wonder how netflix achieves 100tb/hr just with only streaming videos.
@theAnupamAnandhelueene 6 місяців тому ⁺¹
: multiple streams across entire ddrs directly accessible
@ankandatta4352 6 місяців тому
Bucketing is a one time process. But what if everyday new data comes in?
For example if our bucketing takes say 2hr per day for say 10 gb data(right table), and every next day, this increases by 10 gb, don't you think that it'll take more and more time as more data get accumulated?
@EcZachly_ 6 місяців тому ⁺¹
You have to partition your data. Unless your data is genuinely doubling everyday (which I doubt it is)
@EcZachly_ 6 місяців тому ⁺¹
The bucket joins should only be between events for that day and dimensions for that day. Not all the data going back
@EcZachly_ 6 місяців тому ⁺¹
As the business grows, this can still get bigger because 10 GB/day might become 50 after some time and you need to account for that
@problemat1que 3 місяці тому ⁺¹
Minimizing retention and broadcast joins could have been ten seconds of the video, and the rest could have been productively spent explaining SMB joins with a diagram
@EcZachly_ 3 місяці тому
Make that video and share it with me!
@BocoGreenpeace 2 місяці тому
Got it, data ordered in a fashion that facilitates merge / hash joins
@phitsf5475 6 місяців тому ⁺¹
The internet is not something you just dump something on, it's not a big truck. It's a series of tubes.
@tanujkhochare3498 7 місяців тому ⁺⁵
Hey Zach, your content is consistently amazing! As a newcomer to the field, I'm considering diving into data engineering. What roadmap would you recommend, and are there any certifications that could enhance my journey? I already have a solid grasp of Python and SQL in data analysis.
@cry2love 6 місяців тому ⁺²
I still bite my gigas when my man hustling meta in peta
@emerald42481 5 місяців тому ⁺¹
Very useful and interesting, even to a layman
@orppranator5230 6 місяців тому ⁺¹
Bro can figure out how to send my entire homework folder in 1/500th of a second but can’t flip the camera sideways
@GnomeEU 4 місяці тому ⁺¹
Now I just need a billion dollar company to have these kinda problems.
My question would be, why you have table that big? Can't you distribute or cluster your data?
I'm thinking like 10000 users per server. Only stuff around those 10k users gets stored.
No magic needed to query stuff.
@EcZachly_ 4 місяці тому ⁺¹
Gotta analyze it all together though
@tlalepm 6 місяців тому ⁺⁴
My tech lead keeps talking about bucketing as our integration solution tends to get overloaded sometimes. This kinda puts things into perspective. Definitely dont need most of what he’s talking about but just to know the terms and how to implement them
@yippykayyay 3 місяці тому ⁺¹
No idea what this guy is talking about, but thankful UA-cam sent me this
@3dilson 6 місяців тому ⁺¹
"FNA developer"
I'm sorry, my brain couldn't let go of it
@dark_lord98 7 місяців тому ⁺³
Are those joins available in MySQl or specific to dbms at meta you worked?
@juanbrekesgregoris4405 7 місяців тому ⁺¹
I think they're not available on MySQL because it's an OLTP database. Those joins are used for analytics
@jordanmessec5332 5 місяців тому
These are not database joins, they are processing joins. Frameworks such as Flink and Spark would leverage broadcasts.
It basically boils down to a single coordinator instance that publishes a small, often changing dataset to all parallel processors. Usually used to enrich, prune, or map the main dataset.
@iloos7457 6 місяців тому ⁺¹
Hey are you familiar with cosmosDB from azure? Its a db like mongo but claims to be able to scale infinitely... What are your thoughts on that?
@ungeschaut 6 місяців тому ⁺¹
I use just a database with just value as field (long string) and nothing else
@Wahinies 6 місяців тому ⁺¹
Shouldve used middleout compression
@andreas1989 3 місяці тому ⁺¹
Hey
Data with Zach
.. I have some questions.. So netflix uses AWS servers all over the world.... I am wondering. how many gb is each 4K movies, 1080p movie.. ? :) and what audio mix do they have.. Dolby Atmos, DTX etc. etc. :) Have a good day.. love from sweden :)
@EcZachly_ 3 місяці тому
For serving videos they use OpenConnect and CloudFront, not AWS servers. This allows them to serve the video from the closest regional spot to you.
Almost all videos can be served in 4k. but are downsampled depending on the current network conditions
@sergeikulikov4412 6 місяців тому ⁺²
You shouldn't write "s" in Terabyte per hour, just TB/hr
"TBs/hr" looks like "Terabyte*second / hour" 😅
@TheDa6781 4 місяці тому ⁺¹
Managing retention, storage and flow is always important. Im sitting on a toilet as im writing this.
@mohdmuneeb4851 6 місяців тому
I am senior year software engineer intern. I didn't understand anything you said except "joins". Not even the variants. Where can I learn things like that? please
@YishuaiLiu 7 місяців тому ⁺³
Short and informative
@EcZachly_ 7 місяців тому
Thank you! What other videos would you like to see from me?
@swiatlowiekuiste 2 місяці тому ⁺¹
That's a ridiculous amount of data, but wait till you see my girlfriend's Messenger 😂
@TheGoodContent37 5 місяців тому ⁺¹
Love the way you tried to make it sound more complicated than it actually is and failed.
@Kusagrass 6 місяців тому ⁺¹
People don’t know the data they collect is very volatile, unless you are paying for it.
@caioreis350 7 місяців тому ⁺⁹
Wait, why ordering a table and then joining it is more efficient? Why have I never heard of this technique before? Well, guess it's time to get into some digging
@coding3438 7 місяців тому ⁺¹
Ordering a table on the join keys. Thats because for each key in one dataset, the entire other dataset doesn’t have to be scanned.
@bbbbbbao 7 місяців тому ⁺²
As he said whenever there is shuffling involved performances get really poor. You can try doing some computation in spark/dask against nyc taxi dataset.
@sepg5084 6 місяців тому ⁺⁴
Because binary search becomes more efficient the bigger your dataset is, and binary search only works when the tables are sorted on your search keys.
It also depends on the sorting algorithm.
@sepg5084 6 місяців тому
@ayyleeuz4892 because you don't know why it's faster 😉
@ChessFlix 6 місяців тому ⁺³
Petabyte was misspelled. Great video though.
@sneakybutpirate 6 місяців тому
Oh yeah that’s really great and insightful, now what’s a join?
@MFsyrup 6 місяців тому
Thank you Tony Hawk, very cool!
@kali786516 5 місяців тому ⁺¹
did you used spark or hive at Netflix to process 2000 TB's per day ?
@EcZachly_ 5 місяців тому ⁺¹
Spark
@kali786516 5 місяців тому
@@EcZachly_ spark batch or streaming job or structured streaming job ? do you have the parameters which are passed handy ? like number of executors etc ....
@EcZachly_ 5 місяців тому ⁺¹
@@kali786516 hourly batch
@kali786516 5 місяців тому
@@EcZachly_ do you mind sharing the spark parameters what you have passed if you have them handy ? probably a video might help
@EcZachly_ 5 місяців тому
@@kali786516 that video sounds painfully boring
@bruceleehiiiyaaa 6 місяців тому ⁺¹
middle out compression

Наступне

Автоматичне відтворення

Data Modeling: One Big Table vs Kimball vs Relational for data engineers