2: Instagram + Twitter + Facebook + Reddit | Systems Design Interview Questions With Ex-Google SWE

Jordan has no life

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 24 лис 2023
It's not easy being one of the most popular users on social media and being such an object of female desire (I'm scared of them)
Наука та технологія

КОМЕНТАРІ • 223

@raymondyang2018 7 місяців тому ⁺⁵⁴
Thanks for the video. I don't think I can survive at Amazon for another month. I also barely have time outside work to study so I don't think I can put enough prep time for interviews. At this point, I'm strongly considering just quitting without another job lined up and just spend a month or two grinding leetcode and system design. I know the job market is bad, I just don't care anymore.
@jordanhasnolife5163 7 місяців тому ⁺¹³
I believe in you man, you got this!
@Saurabh2816 7 місяців тому ⁺¹
Rooting for you buddy. You got this!
@Doomer1234 7 місяців тому ⁺³
You got this mate!
@whyrutryingsohard 5 місяців тому ⁺¹
You got this
@RSWDev 4 місяці тому ⁺¹
Good luck to you man. I feel the same way about my job right now. Forced RTO is probably the final straw for me, it has really made it so much more stressed and less productive and they’re about to lose an engineer because of it.
@ekamwalia5757 23 дні тому ⁺¹
Love the video Jordan! On my way to binging your entire System Design 2.0 playlist.
@idiot7leon 3 місяці тому ⁺¹⁰
Brief Outline
00:01:09 Objectives
00:02:18 Capacity Estimation
00:03:49 Fetching Follower/Following
00:05:44 Follwer/Following Tables
00:07:56 Follow/Following Database
00:09:24 Follow/Following Partioning/DataModel
00:10:34 NewsFeed(Naive)
00:12:08 NewsFeed(Optimal)
00:13:39 NewsFeed Diagram
00:17:27 Posts Database/Schema
00:18:39 Popular Users
00:20:10 Caching Popular Posts
00:22:35 Fetching popular users that a given user follows
00:25:57 Security Levels on Posts
00:28:15 Nested Comments
00:30:08 Nested Comments Access Patterns
00:31:44 Graph Database
00:33:53 Alternatives to Graph Database
00:34:41 DFS Index (similar to GeoHash!)
00:36:48 Final Diagram
Thanks, Jordan!
@jordanhasnolife5163 3 місяці тому ⁺²
Thank you!!
@hassansazmand1747 7 місяців тому ⁺²
great work I love how your videos have evolved!
@cc-to2jn 6 місяців тому ⁺⁹
man it took me 2.5hr to digest this video, coming up with my own and watching urs. How long did it take you? Great content as always, thanks for always putting out such high quality work!
@jordanhasnolife5163 6 місяців тому ⁺²⁰
Glad to hear! Yeah I'd say one of these videos takes me around 8 or so hours to make, but I think it's awesome that you're taking the time to pause and think through it and decide if you agree or not, I think that's a much better process than just mindlessly watching!
@alekseyklintsevich4601 4 місяці тому ⁺⁴
Another way the follower/followee table you can be implemented is by using a secondary index on follower, and use DynamoDB. For example, DynamoDB will do most of what you stated under the hood, when adding a secondary index on a field.
@jordanhasnolife5163 4 місяці тому ⁺¹
Interesting - if that's how they actually implement it then seems totally reasonable to me!
@user-fw5pg1tm3t 5 місяців тому ⁺⁴
Thanks for the great content as usual. One question on cassandra being the storage backend for followers: Can that lead to write conflicts if you delete the row? If yes, do you think it makes sense to handle unfollows with a different table and periodically bulk update the follower table?
@jordanhasnolife5163 5 місяців тому ⁺³
Yeah it can assuming you unfollow like right after following. That being said, it's worth noting that I believe Cassandra uses tombstones for deletes, so assuming that those are there, when the various leaders perform anti entropy the tombstone should reconcile with the original write to be removed from the SSTable.
@aforty1 Місяць тому ⁺²
Your channel is amazing. Liked and comment for the algorithm!
@NBetweenStations 2 місяці тому ⁺¹
Nice video Jordan! So for building newsfeeds is the idea when a Post is made, Flink node locally stores a mapping of which followers a user has and writes that post id to the Redis cache by those user ids? Thanks again!
@jordanhasnolife5163 Місяць тому
Sounds about right to me!
@Didneodl 7 місяців тому ⁺¹
Thanks for another great video! What do you think about supporting a feature friends who you might know on FB?
That case we gonna have to traverse more depths on follower/followee relationships. Then, could columnar db could still be a good option? Would you go with graph db in that case?
@jordanhasnolife5163 7 місяців тому ⁺¹
Well I guess the interesting thing here is that potential friends all have mutual friends, so every time you add a friend you could add all of their friends as potential ones.
But agreed generally speaking something like a graph db may be best
@eason7936 7 місяців тому ⁺³
Really nice video, it's very clear and impressive. Very appreciate your sharing
@pgkansas 3 місяці тому ⁺¹
As an extension (or a new post), would be good to add
- how to efficiently refresh the feed (algo to sort the feed => timeline, ML real-time recommended systems) and keep adding new items to a users' news feed.
The DFS traversal also helps
- to show popular (VIPs) folks commenting on a post ; this is usually shown instead of other normal comments
Excellent post !
@jordanhasnolife5163 3 місяці тому ⁺¹
1) Good point! I imagine as a front end optimization we probably poll for the next set of feed the moment we open the app. The ML stuff I'll cover in a subsequent video.
2) Covered displaying "top"/"hot" comments like how reddit does it in a separate video a couple weeks ago :)
@rezaasadollahi9880 16 днів тому ⁺¹
Thanks Jordan for preparing these videos. I noticed in couple of your videos you raised a concern with graph DBs inefficiency if we need to jump from one address on disk to another due to disk latency. However this is not the case with SSD drives that are everywhere these days. With SSD any address lookup on disk is done with O(1) time complexity.
@jordanhasnolife5163 15 днів тому
Would have to look into it more as I can't remember my operating systems too well, but that would certainly help. SSDs are a bit more expensive though, so that's unfortunate.
@ugene1100 3 місяці тому ⁺¹
definitely a ton to learn from your videos. Appreciate your work Jordan!
@AjItHKuMaR-ns5yu 5 місяців тому ⁺⁶
Hey. Thanks for your efforts. I love the way you explain things. I have one doubt on the feed generation part. I am new to stream processing with kafka and flink, so pardon me if my question is stupid.
U said that we use CDC and maintain the user : [followers] list in memory in flink. I have 2 questions here.
Firstly, there are 2.5 billion users on instagram. Are we really going to maintain these many user:[follower] list in flink?? Is it even capable of holding this much data in memory,as its mainly a real time stream processing framework.
Secondly, I read that both kafka and flink are push based system. So, when a user following is updated, i understand it can be consumed by flink and make necessary updates. However, if suppose flink goes down and since all the data was in memory, it is bound to get flushed. When it comes up again, are we going load all the data again in memory?
@jordanhasnolife5163 5 місяців тому ⁺³
Good questions! I think that potentially watching the video I made on flink may help you, but the gist is:
1) lots of partitions! We can hold this data if we do that, and also you actually can use disk with flink to store data if you need more storage.
2) flink checkpoints state so that it is not lost!
@igorrybalka2611 6 місяців тому ⁺²
Thanks for such a detailed walkthrough, enjoyed the thought process.
I have a question whether we actually need separate derived data for "verified users someone follows"? Can this status not be part of user-follows table (similar to how access control is implemented on another table)? I suppose the downside is that we need to update a lot of relations when user becomes "verified" but it probably doesn't happen too often.
@jordanhasnolife5163 6 місяців тому
Yep totally could, and I think you outlined the tradeoff well!
@ramm5621 5 місяців тому ⁺²
Hey jordan,
great video as always. So I had a question earlier until I realized why you did it. Diff question now
So the tradeoff you made here is global secondary index on followerId vs having 2 DBs updated by CDC with exactly-once processing guarantee using flink?
In terms of needing distributed transactions and having redundant data both approaches would be about even right? Maybe not having the redundant data (global index) eating up disk speeds up queries or is that wrong?
@jordanhasnolife5163 5 місяців тому ⁺³
Hey Ram! So having a global secondary index here would mean that per write we'd need to update the secondary index write away (which could be on a different partition), hence we could be looking at a distributed transaction.
With CDC, we avoid this, knowing that *eventually* our second table will be updated
@ramm5621 5 місяців тому
@@jordanhasnolife5163 So supposing a downstream DB (followerId) write fails, we'd have to send a rollback event to rollback the change in the upstream DB (userId) right? But until then we're serving the data in a non-atomic manner. Vs. in a dist. transaction we never serve reads from a partially committed transaction but we suffer in terms of write performance and we can't read from the modified rows.
Assuming I understand this correctly, I really like this tradeoff in terms of providing signal in interviews.
So we could go with CDC with rollbacks even for a reservation/ticketmaster system (Isolation is strict but worst case show a room/ticket is taken and then they can retry) , whereas the only time you go with the dist. transaction is when you'd rather never show the non-atomic data than be slower.
@tarushreesabarwal6618 Місяць тому ⁺¹
Great video Jordan, had a qq: at 25:23, where we have users table, and user followers table, the event will be triggered in User Followes table first , for eg: all the followers for user id 6, maybe [3, 5, 10]
Then after this result reaches Flink, then it would take every followers id, and check if it is verified from Users table
So trigger event on these 2 DBs won't be simultaneous . I am imagining CDC to be like Dynamodb streams.
Also I didn't completely understand why are we using both Kafka and Flink, can't we send the trigger event on any DB change directly to Flink.
I am a beginner, so pardon me for any obvious questions asked
@jordanhasnolife5163 Місяць тому
They don't have to be simulataneous, once a user gets verified we stop sending their posts to all followers and begin putting them in the popular posts db.
As for question 2, all streaming solutions require some sort of message broker under the hood, it's just a question of how much you abstract things away from the developer
@timchen6510 6 місяців тому ⁺²
Hey Jordan, great video, learnt a ton. One question for managing user follows/following relationship, can we just update the two tables in one transaction instead of using CDC? For example if user 1 follows 2, we basically careate one entry in the user follows table as 1 : 2 and the one entry in the user following table 2: 1, and everything happens at the same time.
@jordanhasnolife5163 6 місяців тому ⁺¹
You can do this! That being said, these are big tables. So the chance that this transaction will be on multiple nodes is actually pretty high. In that case, you may find yourself having to do a two phase commit, which could slow down your write quite a bit.
@timchen6510 6 місяців тому
@@jordanhasnolife5163makes sense, thank you!
@harikareddy5759 6 місяців тому ⁺¹
Jordan, awesome video once again! What is up with the video descriptions? 😂
@jordanhasnolife5163 6 місяців тому ⁺²
Oh shoot that was meant for my diary
@ariali2067 3 місяці тому ⁺¹
What's the data in news feed caches are like? Is that something each entry is a tuple ? Ideally we should only store post id in the cache I would assume?
@jordanhasnolife5163 3 місяці тому
I think you'd actually want the post text here - it's pretty small anyways! But yeah something like a tree set based on timestamp of posts
@douglasgomes9144 15 днів тому ⁺¹
Hey, Jordan! Which videos or playlist do you recommend for someone who wants to start from scratch? BTW thanks for the amazing content!
@jordanhasnolife5163 15 днів тому ⁺¹
Perhaps I'd watch systems design concepts 2.0, and then system design problems 2.0!
@douglasgomes9144 15 днів тому ⁺¹
Thanks a lot! You are amazing!
@Robinfromthehood88 7 місяців тому ⁺¹²
Hi Jordan. Nice video. As someone who participated in a very famous news feed design and re-design, I can say that you explained some nice concepts here and assembled a legit design.
With that being said, you did mention you gonna go way deeper, beyond any other system design youtuber on such subjects. Personally for me, a seasoned engineer (who's here for some humor as well), I found it not to my level. There are crazy, crazy issues when building a scaleable news feed (even at HLD stage).
One example:
- How do you make sure that a middle east user can see his feed from Australia if a region is down? or if he's using an Australian VPN now for some reason? (assuming that user want to see live updates fairly fast [say minutes?])
News feeds work really hard to try to make distant region accessible as possible very fast and there are reasons for that.
There are many parts to take into account if depth is what you are looking for in your designs.
@jordanhasnolife5163 7 місяців тому ⁺⁹
I appreciate the constructive criticism. I think you make a good point and will attempt to go further in future videos.
At the same time, I am trying to strike a happy medium here between every failure scenario and instead teach for the interview specifically. I am trying to go more in depth about the decisions asked in interview questions, though maybe I'll devote some more time in the future to those edge cases.
Also, just curious, which one did you help build? Would love to learn more!
Have a good day!
@abhijit2614 7 місяців тому ⁺¹
Maybe I exceeded the char limit 😂
@clintnosleep 7 місяців тому ⁺¹
@@jordanhasnolife5163 Great video - I'd also appreciate more thoughts about geo replication / multi-region design. Fantastic channel.
@jordanhasnolife5163 7 місяців тому ⁺³
@@clintnosleep Thanks! The more that I think a bit more about this, I'll try to cover it very high level but at the end of the day it is mainly an interview prep channel so I'd like to keep videos to the point!
@pavankhubani4191 Місяць тому
Great video first of all. I think the explanation in the video is in enough depth, but would really like to hear about the additional points @Robinfromthehood88 mentioned.
@Summer-qs7rq 5 місяців тому ⁺¹
Thanks for this lovely video. Appreciate your efforts here.
However i have a question here regarding nested comments.
what is the downside of using document db for storing nested comments ? is there a situation that makes document db more optimal than dfs based index store for nested comments ? Could you please shed some light on decision making on documents vs dfs ?
@jordanhasnolife5163 5 місяців тому ⁺¹
Yeah I think with a document DB you lose the ability to query for a comment by ID. Instead you basically have to fetch the parent comment and Travers down, as opposed to being able to get a query off the bat.
@pbsupriya Місяць тому ⁺¹
Hi Jordan, Thank you for all the content. I have a question - Why can't we use a secondary index for fetching followers/following mentioned at 5:40?
@jordanhasnolife5163 Місяць тому ⁺¹
Because they are sharded differently! Using a local secondary index means that one of the two tables would require a scatter/gather read request.
@pbsupriya Місяць тому
@@jordanhasnolife5163 Oh okay. Thank you. Great content. I recommend to 10 people in 2 days :)
@Summer-qs7rq 5 місяців тому ⁺²
Hey jordan, I had question related to user-follower relation. why cant we use graph db to store user follower or user-following instead of storing it in a nosql cassandra like db ?
Also when should i use graph db ? if the disk reads are sequential ?
Thanks a ton
@jordanhasnolife5163 5 місяців тому ⁺³
Hey Summer! A graph db isn't necessary for this relation because we aren't actually doing any traversals. We just want to know for a given user, who they follow, or who follows them.
Generally speaking, I'd avoid using graph DBs if there is a way to model the problem other than by using a graph, as they're slow for general purpose tasks due to poor data locality. So to answer your question, you should use a graph db strictly if you plan on traversing multiple edges in a graph. So for example, "find all people who are separated from me by exactly two edges in a facebook friends graph".
@Summer-qs7rq 5 місяців тому
@@jordanhasnolife5163 that makes sense.
@rishabhsaxena8096 5 місяців тому ⁺²
Hey Jordan,
In Follower/Following Partition Data Model (9:55), you have used partition key as userID and sort key for follower/following ID. But what is understand is we will have 2 tables like you mentioned one will be userID-> FollowerID mapping and another would be FolloweeID-> userID mapping. Here I understand that a single user will be on single partition so we can quickly get all the followers of the user but then if we want all the followees of the user it would again be slow since we will have followees sitting in different partition. Could you please let me know if I'm missing something here?
@jordanhasnolife5163 5 місяців тому ⁺¹
Sure! The point here is that we have two different tables:
1) user -> someone that follows the user
2) user -> someone that the user is following
For both of these tables, we partition on the user column. This allows us to quickly figure out for a given user, who their following is, and who they follow via one singular network request (for each query). If we didn't do this, figuring out who I follow might take many different requests to different partitions which we'd then have to aggregate.
@rishabhsaxena8096 5 місяців тому ⁺¹
Cool, that makes sense.
Thanks 😊
@just_a_normal_lad 5 місяців тому ⁺¹
Thanks for the wonderful video; I really loved it. I have one doubt regarding the part where you suggested using CDC pipeline for updating the user-follows DB. The reason you provided is to avoid the overhead of maintaining distributed transactions, such as the 2-phase commit. For example, let's consider T1 as updating the User-follower DB and T2 as updating the User-follows DB. Maintaining transactionality between T1 and T2 is difficult without using SAGA or 2-phase commit. That's why the suggestion is to use CDC pipeline.
However, my question is, by using CDC pipeline, are we not just replacing T2 with a Kafka producer call? Doesn't this still pose the same issue? What happens if the Kafka producer call fails, even with its multiple retry mechanisms? My concern is whether replacing the DB call with a Kafka producer call truly addresses the distributed transaction issue.
@jordanhasnolife5163 5 місяців тому
Yep, it's still an extra networking call, you make a fair point there. But it's also non blocking if for whatever reason kafka is down. I can still submit new following requests, and then the other table can be updated "later", as opposed to having to need both writes to go through exactly at once.
If you need both tables to be perfectly consistent, sure, go ahead and use 2pc.
@just_a_normal_lad 5 місяців тому ⁺¹
Thanks for the reply. I got the idea. Looking forward to the next video
@gangsterism 7 місяців тому ⁺¹
this information is going to be very useful for my massive application with up to 0 to 1 users, achieving speed is certainly going to be a concern
@jordanhasnolife5163 7 місяців тому ⁺²
Users list: my mom
Still need a hyperloglog to count them tho
@deepakshankar249 5 місяців тому ⁺¹
Jordan, you are a rockstar.. Thanks buddy 🙏
@jordanhasnolife5163 5 місяців тому
Thank you Deepak!!
@yuanshaoqian Місяць тому ⁺¹
Great video, thanks a lot. I am also trying to study how to design ML recommendation based newsfeed (FB/IG/Twitter these days mostly rank the feed based on some ML score instead of post creation time), I couldn’t find a lot of material on it, could you make a video on this too?
@jordanhasnolife5163 Місяць тому ⁺¹
Check my recent videos
@gaurangpateriya4879 10 днів тому ⁺¹
I also had another query, since we were using cassandra, would we need to implement a mechanism to distinguish between write/edit vs sync operations to nodes somewhere before new data is propagated to Kafka queue? I mean to make sure only writes and edits are propagated and not sync updates as the CDC would be capturing data from all the replicas.
@jordanhasnolife5163 10 днів тому
Believe it's gonna handle it for us
cassandra.apache.org/doc/stable/cassandra/operating/cdc.html
@user-no7wb5bj7r 4 місяці тому ⁺¹
Thanks for your video. You suggested to use Flink to manage the user-follows DB and the user-following DB. Why can't we just maintain one DB which outlines the following relationship and have 2 indexes each on one column. One index will be on follower ID column and other on following ID column. Won't that solve our problem?
@jordanhasnolife5163 4 місяці тому
Think about what happens when these tables are massive and need to be partitioned. All of a sudden if you want to index/partition by the first column of both tables you can't, hence you need two different tables.
@user-xu3nx9tk8v 4 місяці тому ⁺¹
Hi , thanks for posting this, i have a question, I don't understand how we are using cassandra to store
user follower relationship in 9:42
1. please correct me if i am wrong but cassandra primary key has to be unique, you cannot have
represent user 1 has user 2 and user 3 follower like this
user1 -> user2
user1 -> user3
Are we bundling (user, follower) into single primary key, and query by partition key?
2. if user1 has 100 million followers, for example elon musk, justin bieber, storing them in same partition might not be a good idea?
I might misunderstand something, can you elaborate on how you represent user to follower relationship in cassandra?
@jordanhasnolife5163 4 місяці тому
1) The Cassandra primary key is the combination of a partitioning key and the clustering (or sorting key). The partitioning key here is the first user, and the clustering key is the second user.
2) Yeah, you're correct that for someone with 100 million followers we probably can't store this all on one partition. We can introduce a second component to our partitioning key which has some number (from say 1-100) that we only use for popular users. Then we can perform re-partitioning for our popular user data.
@tunepa4418 День тому ⁺¹
Hello Jordan. If we want to support finding friends of friends in this problem, I guess a graph database would be ideal for the follows table?
@jordanhasnolife5163 День тому
Yup
@kaqqao 2 місяці тому ⁺¹
For the follower/following case, why not index on both columns? It is possible to have a secondary index on a non-partition-key column, right?
@jordanhasnolife5163 2 місяці тому
If there are tons of these relationships, not all rows will be on the same table. Then you'll have to make a local secondary index, so for getting the counts for a given user you'd need to distribute your query.
@firezdog Місяць тому ⁺¹
so is the need to partition what prevents us from indexing on both user and follower? I wrote this in my notes:
index key: by user or by follower?
we need both (one to quickly find all the users i follow, the other to find all the users who follow me)
if we have user-follower, i can quickly find everyone a given user is followed by (log n)
but to find everyone a given user follows, i have to look at each row for that user as a follower and collect the results (n)
the need to partition prevents us from indexing on two keys in our db?
would the expensive query to find all the users a user follows be classified as "scatter-gather"?
by using CDC to solve this problem, are you essentially trading space for time?
@jordanhasnolife5163 Місяць тому
1) Yes
2) Yep that would be a scatter gather
3) You're trading work on the write path for less work on the read path, I wouldn't say you're trading space here really as we're using the same amount of space (barring kafka, but you could make a similar argument even if we used two phase commit)
@amitmannur8743 20 днів тому ⁺¹
Awesome video , you hit core concepts . i have few feedback
1: felt too aggressive or fast you touched complex topic so need some pause and clean way of explaining.
2: u might need to change to different tool this is too black white , color design would have worked well.
3: lol you acknowledge at end , you know you nailed it and areas to change. Plz provide documentation of your design that would be good help aswell
@choucw1206 6 місяців тому ⁺¹
I have a few questions regarding the way you shard the post data/table
1. In the video you propose to use userId as shard key. I know at your design you had the user-verified and popular post cached. But I think there may still be the use case you may need to query it directly. How do you alleviate the "hot partition" issue for popular users?
2. Some other resources refers to use "Snowflak ID" as shard key, which was used by Twitter to generate global ID for tweets, user, etc. (The GitHub repo was archived so they might move to use something else) However, none of them can explain how using this shard key can make the query efficient. For example, a query like "Find all tweets/post in last 7 days published by Elon" will require to hit every partition node. Did you look at this when you researched this topic?
@jordanhasnolife5163 6 місяців тому
1) FWIW I guess by hot partition here it's not an issue of posting too much data, but more-so too many reads. If we had too much data, I think we'd have to basically "inner shard" where users can be split across many partitions. Since it's probably just too many reads IRL, I'd probably just add more replicas for that partition.
2) I didn't look into this, but to me it seems pointless lol. Maybe more of an optimization to balance writes on partitions evenly, not sure.
@learningwheel8442 15 днів тому
@@jordanhasnolife5163 I think sharding on UserID or PostID is dependent on the use case. If we are looking to fetch all posts created by a user to populate their user timeline, then USERID based shading makes sense.
On the other hand, if we also want to get all new/top posts created in last x hours/days (e.g Reditt) or some analytics on all new posts that were created in the last x hours/days, then sharding on postid and sort by timestamp or using snowflake id with time stamp embedded to efficiently get the new tweets without having to query all the posts database partitions which are sharded on user id makes sense. Thoughts?
@jianchengli8517 5 місяців тому ⁺¹
What if a user newly subscribe another user and refresh the timeline. How does this streaming solution works as apparently there will be a cache miss for posts from this new following (posted before subscription)?
@jordanhasnolife5163 5 місяців тому ⁺¹
Yep, you just don't see them. Not the end of the world, there are no guarantees on the news feed.
@jen17a 3 місяці тому ⁺¹
Hi Jordan!
Thanks for the video. I do have some questions (might be basic but I am just starting to learn these technologies).
1) Say I had created an account, followed few folks 5 years back. And today I had decided to post a video. How does flink have the user:[follower] list from 5 years back? Does it fetch from the main storage?
2) There are billions of users on insta/fb, how is flink handling this data?
@jordanhasnolife5163 3 місяці тому ⁺¹
1) It doesn't go anywhere, flink can store things in a persistent manner, see rocksdb persistence
2) Partitioning!!!!
@zuowang5185 5 місяців тому ⁺¹
would you consider a graph db for the follower relationships?
@jordanhasnolife5163 5 місяців тому
I think in this case no, as we aren't traversing the graph. I do think if this were like Facebook friends and we wanted to suggest people then graph traversal could be more useful.
@shibhamalik1274 3 місяці тому ⁺¹
Hi Jordan you prepare nice design videos
I have a question on why can we have two different tables for user id and its followers and second one which is user id and whom that user follows
It would duplicate data a bit and also create hot partitions though
@jordanhasnolife5163 3 місяці тому
It would certainly create duplicate data (which is fine, we use replication all the time). I don't think that creating a second table is exactly making our hot partition issue any worse here (as opposed to a single table with a user and everyone that follows them, could debatably be worse than a table for a user and everyone that they follow, but we discuss how to deal with this), it literally just allows us to make faster reads when we want to search who our followers are or who we are following.
@shibhamalik1274 3 місяці тому
@@jordanhasnolife5163 Thanks yes. think after watching your videos a couplemof times it makes sense to me now
@shibhamalik1274 3 місяці тому ⁺¹
I have another question. Which would be the news feed cache ? will it on the device of the user ? Will the news feed be push or pull based ? what r their trade offs? thanks for making such awesome content
@jordanhasnolife5163 3 місяці тому
@@shibhamalik1274 Nope! It's just a pull based server/database/redis, whatever you have the money for basically lol
@nhancu3964 Місяць тому ⁺¹
Great explanation. I have the wonder that how these systems avoid re-recommend posts in newsfeeds (like TikTok does with video). Do they store all viewed history 🙄🙄
@jordanhasnolife5163 Місяць тому
Short answer is yes + using bloom filters as an optimization, long answer is that I have a video posting on this in 5 minutes :)
@fr33s7yl0r 3 місяці тому ⁺¹
Is it my youtube app or do these videos drift towards video lag over time? I checked a few other videos from your list and all of them lag, but other videos on youtube don't
@jordanhasnolife5163 3 місяці тому
Probably me being an idiot
@prashantgodhwani6304 7 місяців тому ⁺²
Hi Jordan,
Great video! I have a quick question: You mentioned that if we have a (user, follower) relationship, it will cause one of the (followers, following) queries to be slow. Could you please elaborate on how this issue would be addressed? It appears that with the mention of Cassandra and CDC by you, the data is eventually brought into a similar schema.
Keep making such great videos! :)
@abhijit2614 7 місяців тому ⁺⁴
Not Jordan but we use two different tables. So Abhijit follows Jordan becomes:
User_Follows
Abhijit Jordan
User_Following
Jordan Abhijit
Both tables are sharded by the first column. If you don't duplicate follows-data like the above, you will need to do a scatter gather across all shards for one of the two queries based on the schema. Try both queries (follows/following) with an example for the two diff table schemas at 5:35.
@prashantgodhwani6304 7 місяців тому ⁺¹
@@abhijit2614 that makes sense. Thanks Abhijit!
@jordanhasnolife5163 7 місяців тому ⁺¹
Great explanation Abhijit, thank you!
@manishasharma-hy5mj Місяць тому ⁺¹
Hi Jordan, can u pls explain once more thie CDC part, how is it working, from which table to which table it is going to capture the change and what advantage we have. Please 🙏
@jordanhasnolife5163 Місяць тому
1) When a user posts, we want to send their post to all their followers (assuming they don't have too many)
2) We have all of those relationships in a database table already
3) We need to load their followers, but this can be an expensive call to the DB
4) We instead pre-cache all of this information by using change data capture on the DB to get it into a flink instance, which we shard on posterId, so all of there followers will already be there
5) When the post comes in, look at the followers, and send the post to the appropriate place.
@kushalsheth0212 3 місяці тому ⁺²
the confidence at 20:49 that "all women follow me, they love me" 😂was amazing.
@jordanhasnolife5163 3 місяці тому ⁺¹
Purr 💅🏻💅🏻
@levyshi Місяць тому ⁺¹
One question on post service, when a user is posting, do they send the photos to s3 first, and then after upload is done, they'll make a call to the post service? or do they send the photo to the post service, and the post service will write to s3. how should we handle post service failing after the upload to s3 is complete but before they can write to the cassandra db?
@jordanhasnolife5163 Місяць тому
Ultimately that's your call.
You probably have less ability to do validation uploading right to S3, but also less latency in having to upload to a server first.
You can just do a client retry for the second part. If your client is down, whatever we have an extra image in S3, no big deal.
@kehsihba2716 2 місяці тому ⁺¹
Awesome video. No one ever gives the details such as "Cassandra is good for fast ingestion because of leaderless replication and LSM". Thanks.
My query: Are we storing the whole tweets in the cache or just the tweet ids like {"user_id": [tweet1, tweet6, tweet9,...]}?
If we store it like this and then store all the recent tweets in a separate cache (single copies, not replicated 100 times), we can pick the tweet Ids from each user's list from cache1 and then fetch the actual tweet from cache2.
Does this make sense or am I talking like a noob?
@jordanhasnolife5163 2 місяці тому ⁺¹
Makes sense! I anticipate the whole tweet because
1) they're small
2) if you don't then you have to read from a lot of tweet partitions
@amanjain2659 Місяць тому ⁺¹
@@jordanhasnolife5163 For 2nd, mostly the tweets id are sorted by the descending order of creation time in user feed cache. So we can shard the tweet partition on the created date, with this the read will be from limited partition, wdyt?
@jordanhasnolife5163 Місяць тому
@@amanjain2659 I think that depends on the number of tweets. If there's too many, writing all of them to a single db partition at a time could overload it (for reads, we can probably use replicas).
If there are too many tweets, the range per partition becomes very small, and then we have to do a scatter/gather.
@jameshunt1822 6 місяців тому ⁺¹
I have tough time sitting through your videos. Nothing wrong on your part. I don't feel confident I can use these words(technologies) in an interview without knowing the ins and outs of the them. Yeah I need to learn the basics first.
@jordanhasnolife5163 6 місяців тому
Have you watched the concepts series? That's a bit of a prerequisite
@dreezydreez 3 місяці тому ⁺¹
Hey! amazing vids, can you make a calendly like system design?
@jordanhasnolife5163 3 місяці тому
Hey! Where do you think the complexity comes from for calendy?
@dreezydreez 2 місяці тому ⁺¹
@@jordanhasnolife5163 I think that mostly db design, of the appointments and times available per user
@akbarkool 3 місяці тому ⁺¹
Question on the common CDC pattern you use when adding data to 2 tables, why not just make 2 separate calls to these tables (with retrys) why do you need 2P commit? (If rolling back is the issue, I don't see how the CDC pattern is helping with that)
@jordanhasnolife5163 3 місяці тому ⁺¹
1) You can do this, now you just have to spend some more time thinking about all of the partial failure scenarios, how long you want to retry for, etc etc. In practice it's probably fine.
2) When a write gets undone to table 1, that also goes through table 1's CDC and gets propagated to table 2.
@akbarkool 3 місяці тому ⁺¹
@@jordanhasnolife5163All the extra thinking about retrys and everything else is surely less than the effort to setup an entire kafka queue + CDC piepline? I think we are over engineering this aspect of the design and an interviewer could push back on this
@CompleteAbsurdist Місяць тому ⁺¹
For the posts DB, what's your opinion on using mongodb? at the end, the posts data is almost always text. won't a document based DB be suitable instead of Cassandra?
@jordanhasnolife5163 Місяць тому
I don't really think it's json text, so as long as it isn't too long (e.g 140 characters) I don't think that the document format would make a huge difference, but hard to say!
@mcee311 7 місяців тому ⁺²
for write conflicts what if a user follows then unfollows. I guess last write wins in this situation?
@jordanhasnolife5163 7 місяців тому ⁺³
Yeah basically, it's an edge case but ideally shouldn't happen too often. Fortunately if we screw up a following it's not the end of the world.
@Prakhar1405 4 місяці тому ⁺¹
What happens when a flink node is down during caching, will this caching data will be stored in S3 as well. How we will recover from this?
@jordanhasnolife5163 4 місяці тому
If a flink node goes down, we can bring up another new node, restore the state of the failed node from its previous checkpoint in S3, and then the kafka queues beginning from the checkpoint barriers associated with this s3 checkpoint. I'd probably recommend watching the video that I made about Flink!
@brandonwl 7 місяців тому ⁺³
What is the point of having flink? I don't see checkpointing being a benefit because there is no state needed between each user-follower message?
@jordanhasnolife5163 7 місяців тому ⁺²
Good point! I suppose it's not completely necessary from a functionality perspective, though from a correctness perspective Flink does help us ensure that each message will be handled at least once, which is good for keeping our tables in sync.
@truptijoshi2535 Місяць тому ⁺¹
Hi Jordan, What is the difference between User-verified-following cache and the popular caches?
@jordanhasnolife5163 Місяць тому
One contains posts from verified users, the other contains which users are verified
@scbs2007 5 місяців тому ⁺¹
"Kafka - partition by follower Id": do you think it is a pragmatic approach to have tons of partitions in kafka? instead should we not have created a limited set of partitions and mapped users to one of the entries in that set? Or did you mean to say the latter?
@jordanhasnolife5163 5 місяців тому
When I say "partition by follower Id", what I really mean is "use a partitioning function that uses the range of hashes on follower id". Agree with your concern otherwise!
@hung291100 5 місяців тому ⁺¹
So in conclusion write conflicts won’t be a problem if the data doesn’t usually get updated?
@jordanhasnolife5163 5 місяців тому
What are you saying this in reference to? I'd say that write conflicts aren't a problem if we don't have writes that overwrite one another.
@koeber99 5 місяців тому ⁺¹
Instagram + Twitter + Facebook + Reddit (part-1)
How would your design change for each of the services? Instagram and Facebook have more photos and videos, therefore, CDN and S3 would be involved. But are there potential other changes to be made?
In regard to using kafka to process post, while sharding by userID improves message locality and simplifies processing for specific users, it does not enforce message order within each user's stream. I think this will be ok for human users, however, if there is automated service using Tweets that send MSGs one after the other the ordering will not be correct in the newsfeed!
@jordanhasnolife5163 5 місяців тому ⁺¹
1) Yeah I wouldn't really change anything here, just supply the s3 url in the content of the tweet
2) You can perform ordering in the cache itself via something like a sorted list on the timestamp of each post. Inserts will be a little bit slower, but who cares?
@koeber99 5 місяців тому
@@jordanhasnolife5163 cool thanks.... whenever, you get chance can you look at part-2 of my questions. Thanks!
@Dhrumeel 4 місяці тому ⁺¹
Why use Kafka for one side of this and Flink for the other? Aren't both of those able to help us perform stream processing just as well?
@jordanhasnolife5163 4 місяці тому
Nope they're different - Kafka just carries the messages and flink consumes them and stores them in memory if need be to process them statefully
@Dhrumeel 3 місяці тому ⁺¹
@@jordanhasnolife5163I guess I'm asking about Kafka Streams, since it seems to give us the ability to perform stream processing within Kafka itself (and not have to set up/manage separate clusters to run Flink) ?
@jordanhasnolife5163 3 місяці тому
@@Dhrumeel Ah I see what you're saying - yeah that's a viable option too!
@Abhishek-nh2dw 2 місяці тому ⁺²
What do u use to take notes … which software and device
@jordanhasnolife5163 2 місяці тому ⁺¹
iPad, oneNote
@ViralLordGaming 4 місяці тому ⁺¹
CDC from user follower will come only when there is change in data right? how are we using that?
@jordanhasnolife5163 4 місяці тому
Not sure entirely what you mean by this question but yes, the idea is that we take the following relationship changes and cache them in flink via change data capture to Kafka. This way when a post comes in we already have the follower following relationships cached.
@kippadlock Місяць тому
@@jordanhasnolife5163 Do we cache all the user followers table in flink? Is there a limit in how much data flink can handle? Is it ok to duplicate all these tables in flink?
@seemapandey2020 Місяць тому ⁺¹
Thanks.
@jordan With CDC, does it mean that entire user-follower mapping would be available in Kafka, all the time ?.
Generally my view was CDC is for 'change' capture so good for incremental change processing. And the initiated 'change' stream of user-follower mapping updates via introduced Kafka producer would be marked completed, post processing via flink and eventually flushed out.
Though here it's being recommended to be used as a replacement for persistence modelling. Need help creating a mental model of it - why so ? Else the approach is never intuitive extension, ever.
How does the Kafka gets re-populated across deployments, and for any re-boots post eventuality ?
Also appreciate your fresh thought & detailed approach to the solution - Is it really implemented practically at any of similar use-case is industry ? Else its limited to be discussed just theoretically
@jordanhasnolife5163 Місяць тому
I'm not sure I understand your question. Once the data gets to flink and makes it into an S3 snapshot, it can be cleared from kafka. At that point, we'll always have the state available for us.
@seemapandey2020 Місяць тому ⁺¹
@@jordanhasnolife5163 My view is that - the stream processing of runtime upstream 'change' from Follower service on Flink as the change consumer would also be flushed from S3 snapshot eventually, once its change processing is complete.
Would the entire user-following mapping be persisted and maintained on Flink even after the 'change' processing is complete? Else how does it serve the queries on user-following mapping ?
@jordanhasnolife5163 Місяць тому
@@seemapandey2020 You're correct, I'm advising you cache the entire thing
@joonlee3863 2 місяці тому ⁺¹
This may sound like a dumb question, but for when designing the schema for User-Follower table, you mentioned the reason why you chose not to have both User-Follower and User-Following DBs is to avoid 2PC since it'd be a distributed write.
But what if we put both User-Follower and User-Following Tables in the same DB? What other reasons are there other than there's no good way to partition the DB without screwing up the other table (distributed query)?
@jordanhasnolife5163 2 місяці тому ⁺¹
You just named the best reason haha!
@joonlee3863 2 місяці тому
@@jordanhasnolife5163 thanks! Was wondering if there were other good reasons on why you separated the 2 tables into separate DBs
@AnkitGaurav-qt1hs 4 місяці тому ⁺¹
Not able to understand how using the cdc does the work of 2 phase commit
@jordanhasnolife5163 4 місяці тому
Two phase commit is good when we need our data in both places to be updated at the same time.
CDC allows us to just put the data into Kafka (if the database goes down we can't but this would have stopped 2pc also). Then, using something like Flink, we can guarantee that the message will be processed at least once successfully and will keep trying to process it until we move the data to the other place where we need it to go.
If we only care about eventual consistency, this works nicely.
@vorandrew 6 місяців тому ⁺¹
user following task - make 'user|follower' table storage model. read model = redis set :followers and :following sets and update them async (flink)
feed - totally disagree. You make individual feed per user - yes. But you don't clone posts... you make array of post_ids ONLY, order/filter them by time, importance, whatever... Fetch posts on client side by id
@jordanhasnolife5163 6 місяців тому
I may have to double disagree with you back since that fetch of posts by I'd can go to any number of partitions
@vorandrew 6 місяців тому
@@jordanhasnolife5163 so what? use same Cassandra db...
saving posts will give overhead = avg 100 posts in feed * timestamp (8) + post (140) etc...
@priyanka971990 2 місяці тому ⁺¹
Can you do a video on google/outlook calendar?
@jordanhasnolife5163 2 місяці тому
Sure - what do you envision as the challenging part of this problem?
@priyanka971990 2 місяці тому
@@jordanhasnolife5163 Recurring events, Updating recurring events, storage for future dates, checking non-overlapping slots between a bunch of people(team). Also, love all your videos, discovered them just when I thought I need to re-red DDIA. Your videos made it easy to not go through the book again.
@priyanka971990 2 місяці тому
@@jordanhasnolife5163 Scheduling recurring events for future(6 months to a year), updating some of them and storing them efficiently. Discovering overlapping events/slots(already booked) for multiple people in teams while booking future events for team. Also, thanks for the videos they are super binge-able.
@John-nhoJ 7 місяців тому ⁺¹
Disagree with the scatter-gather for followers/following. Index both fields, shard on the user_id of the follower. It's not a linear scan if you use both indices.
@jordanhasnolife5163 7 місяців тому ⁺²
You can't index on both fields on the same table unless you mean an inner sort, which doesnt help us get all followers for a given user without a linear scan
Feel free to elaborate
@John-nhoJ 7 місяців тому ⁺¹
@@jordanhasnolife5163 Table - follower
follower_user_id: PK, idx (shard key)
followed_user_id: FK (user.id), idx
get_followers_by_user_id(user_id) will have to do scatter gather over replicas of the shards, but you don't have demands for strong consistency and most places prevent you from paginating deeply into someone's followers. E.g. no way Twitter will show you all of Taylor Swift's followers.
@John-nhoJ 7 місяців тому ⁺²
@jordanhasnolife5163 why no response? Cowering in fear?
@techlifewithmohsin6142 24 дні тому ⁺¹
Do we really need CDC complex approach? Databases like DDB with secondary index can be used, where user_id is partition key while follower_id is secondary index. This solves both followers and following in same table.
@jordanhasnolife5163 24 дні тому
How do you deal with different partitioning schemas? DDB has global indexes but then you need to two phase commit.
@techlifewithmohsin6142 24 дні тому
@@jordanhasnolife5163 when we add relationship or entry in DDB it would be single transaction, there is no two phase commit
@techlifewithmohsin6142 24 дні тому ⁺¹
@@jordanhasnolife5163 idea here would be to use LSI given for this use case we don't need GSI so two phase commit would be avoided which is happening internally with shadow table. It can possibly lead to hot partition, if we're calling a particular user more often but then our cache will any come into the role. so I think DDB can further simplify this. A good trade off to consider
@jordanhasnolife5163 24 дні тому ⁺¹
@@techlifewithmohsin6142 I don't think I agree that a local secondary index would be sufficient here. If I want to find all followers of user x and all people that user x follows, how would I partition the table so that I can efficiently do this without duplicating a ton of data?
@techlifewithmohsin6142 23 дні тому
@@jordanhasnolife5163 yeah now I'm realizing, the two phase commit, we would need GSI.
@kunalsinghal3558 15 днів тому ⁺¹
I think your user verified following cache is wrong because you have sharded the flink node on basis of followerId .
Suppose we have from userService [1:verified , 2:notVerified]
and from follower service user - follower as [1 : {2,3} ] . Because you have partitioned on followerId . You can have [2 is following 1 on one partition ] and [3 is following 1 on second partition] . You don't have the information on the flink node whether 1 is verfied or not . We only have whether 2 is verfied or not and 3 is verfied or not because we partitioned by followerId . We need to get the information 1 is verfied on the same node where 2 and 3 are there.
This can be resolved if we shard by userId instead of followerId . Is my understanding correct or am I missing something ? Thanks for your efforts
@jordanhasnolife5163 15 днів тому
Probably just a typo on my part. I just want to quickly figure out who I follow that is verified, so yeah that seems reasonable
@kunalsinghal3558 14 днів тому ⁺¹
@@jordanhasnolife5163 Thank you so much for your efforts. Loving your playlist !
@DavidWoodMusic 3 місяці тому ⁺¹
Man confesses death by blood shit less than 69 seconds into the video
Based
@anilsaikrishnadevarasetty9870 Місяць тому ⁺¹
Jordan, could you share your notes please?
@jordanhasnolife5163 Місяць тому
Hey Anil - I'm planning on it, but this time around I'll likely build a website to do so. This is something that I hope to be able to take on within the next few weeks.
@anilsaikrishnadevarasetty9870 Місяць тому
@@jordanhasnolife5163 Thanks. I really enjoy your videos. Keep doing these kind of stuff. Looks like, you like Flink a lot :P
@gourabsarker9552 7 місяців тому ⁺¹
Sir how much do you earn as a software engineer? Plz reply. Thanks a lot.
@jordanhasnolife5163 7 місяців тому
C'mon man you know I can't do that, one day I can be more candid
@MultiCckk 26 днів тому ⁺¹
takes me 3 hours to understand and complete a 45 min video rip😂
@gourabsarker9552 7 місяців тому ⁺¹
Sir do you earn 160k dollars a year as a software engineer? Plz reply. Thanks a lot.
@jordanhasnolife5163 7 місяців тому ⁺¹
No, I do not earn exactly 160k as a software engineer.
@soumik76 3 місяці тому ⁺¹
Hi Jordan,
We had a CDC from user following table to update User Verified Following Cache (discussed around the 25th min when you shared the advatages of derived data).
Don't see that in the final diagram. Instead in the fianl diagram, we are capturing a CDC from the User Followers DB to update the User Verified Cache.
Did I get that right?
@jordanhasnolife5163 3 місяці тому ⁺¹
Yeah, I think it probably does make more sense from the user-following table, in the sense that when I make a query to the news feed service I care about who I follow that is verified. So probably less data manipulation to do it from that table.
@soumik76 3 місяці тому ⁺¹
From a database perspective, don't you think that following has a causal dependency? Let's say I follow someone and it goes to one leader, and I want to unfollow him, and this write goes to another leader where this follow update hasn't flown yet.
@jordanhasnolife5163 3 місяці тому
@@soumik76 Yeah, there is a causal dependency but as long as we shard all of the operations properly the event gets handled by the same flink consumer
@SatinderSingh71 7 місяців тому ⁺¹
When doing capacity estimation, just use rough calculations instead. 200 bytes * 365. Just round the 365 to 300. Makes it easier for us weebs to estimate on the fly
@smahns 7 місяців тому ⁺¹
i know its big...
@jordanhasnolife5163 7 місяців тому
It's aggressively average
@yiannig7347 4 місяці тому ⁺¹
DBAs will hate you for all that cdc lol
@jordanhasnolife5163 4 місяці тому
Lol, this is the new meta
@ar4angelkz 7 місяців тому ⁺³
First
@Rohit-hs8wp 6 місяців тому ⁺¹
Have Questions on user-follow DB Choice. You said Write Conflict will not be an issues ( why so ?).
Suppose I have 3 node, and I am using Quorums Read and Write ( R=2 , W=2). Suppose , User 1 follows User 2, 3, 4 in a very Quick Session.
User1 Follows User 2 write goes to Node1, Node 2 . User1 Follows User 3 write goes to Node 2 , Node 3. User1 Follows User 4 write goes to Node1, Node 3.
Node 1 -> (User1,User2) , ( User1, User4) , Node 2 -> (User1,User2) , ( User1, User3) , Node 3 -> ( User1, User3 ), (User 1, User 4).
Now Since Cassandra uses LWW, One of this Follows relationship would be lost based on timestamp.
We could have mitigated this issues if we have used Riak and maintaining User-follow( user, setas Follower_id) and set CRDT or User_Follow( user_id, Follower_id) and using version vector and storing sibling in the face of Concurrent Write.
Please do comments your thought on this. ( Thank you for videos, Learning a lot from your videos )
@jordanhasnolife5163 6 місяців тому
Hey the reason why the above should be fine is that those writes should be different rows, so they should never conflict with one another in the first place.
Eventually, anti entropy will take place and then we'll sync up as expected.
@Rohit-hs8wp 6 місяців тому
@@jordanhasnolife5163 Yes you are right. Thank you for the reply.
@Abhishek-nh2dw 2 місяці тому ⁺¹
What do u use to take notes … which software and device
@jordanhasnolife5163 2 місяці тому
iPad, oneNote
@gourabsarker9552 7 місяців тому ⁺¹
Sir do you earn 120k dollars a year as a software engineer? Plz reply. Thanks a lot.
@jordanhasnolife5163 7 місяців тому
Nope

Наступне

Автоматичне відтворення

3: Dropbox + Google Drive | Systems Design Interview Questions With Ex-Google SWE