Find the distance between friends/connections - LinkedIn System Design with
Вставка
- Опубліковано 5 лип 2024
- Have you noticed a small icon on the right of LinkedIn profiles?
It tells you how closely you are related to a user (1st, 2nd or 3rd-degree connection).
These distances are calculated using graph algorithms. Specifically, LinkedIn uses Bi-directional BFS.
This is a two-way algorithm where a search is started from both source and destination, and terminated midway. The total distance traveled from source+destination is the degree of the connection.
But at scale, this solution brings challenges. Let's limit ourselves to finding 3rd-degree connections:
1. Every query needs a graph traversal on our social network.
2. A naive BFS would take O(n^3) for search. A bi-directional BFS needs O(n^2) for searching users + O(n^2) for merging results.
3. To avoid performing these queries repeatedly, we cache the second-degree connections of users. This saves up on the first O(n^2) factor for subsequent queries.
Great, but how do we store all this data in a single cache?
Well, we can't. LinkedIn is forced to scale out, with multiple cache nodes storing second-degree connections. User-Connection data is sharded into caches according to user_id.
But what if a machine crashes? All queries to the shards hosted by that machine will fail.
To avoid this, we replicate shards into different machines. Even if one of the machines fails, the shard can be found on another machine.
And here we meet our main villain: High Latency.
We don't want to hit all the machines with our shards. We want to hit the SMALLEST possible number of machines, that contain all our shards.
For the mathematically inclined, this is a set-cover problem. LinkedIn uses a modified greedy version of the problem to reduce its compute costs in half!
The basic idea is to find machines that host most of the users who are your 2nd-degree connections, and avoid those that have few 2nd-degree connections.
Less effort, more work done!
00:00 Introduction
00:35 Capacity Estimation
01:20 Bruteforce SQL query
04:00 Bidirectional BFS
06:37 Caching connection data
10:00 Potential databases
11:11 Redundancy for fault tolerance
12:12 Cold starts
13:48 Removing connections
14:37 Replicated Caches
16:03 Ideal replication factor?
17:25 Summary
Here is the paper: www.usenix.org/system/files/c...
References:
Sharding: • What is DATABASE SHARD...
System Design at InterviewReady: interviewready.io/
Designing Data-Intensive Applications Book: amzn.to/3SyNAOy
You can follow me on:
Github: github.com/InterviewReady/sys...
Twitter: / gkcs_
#SystemDesign #InterviewReady #SocialNetworks
The video description is also worth reading :D
Also, try this free system design question at InterviewReady.
interviewready.io/learn/system-design-course/google-docs-collaborative-editor-design/complete-the-design---google-docs
Cheers!
Loved this, thanks for sharing.
Hey, I think this is more like a People You May Know (PYMK) Machine Learning System Design problem, we normally use similarity vectors to approach this problems with the plethora of similarity searches in vector space, so instead of doing o(n^2), it's cut down into computing similarity vector searches for just connections of 1st degree. And then rank / re rank them using scores to push to the main user.
I agree if you want to find the distance between two users, you'll have to use something like bi directional BFS. Just that I don't see a use case for why we should store this user - user distances, when we can do it using vector representations for similarity searches.
Computing People You May Know: is very critical to social media platforms like Twitter, Instagram where you can n number of attribute to compute the vector embeddings for a user and then doing the similarity search.
But for asking referrals or finding a set of people in a company you're interested is a deterministic problem and bi-directional bfs will do the job perfectly.
But for recommending PYMK, even Linkedin will use the approach that you've described.
LinkedIn has a 3rd+ degree connection, so IMO, they are just identifying 1st and 2nd-degree connections and the rest of the connections they are classifying as 3rd+. For Identifying 2nd degree connections we should be able to do it in O(n), the way you specified. Also, for 1st-degree connections, it seems an eventually consistent design. Search results didn't update as soon after accepting a connection request. btw, Thanks for the explanation, traversing both ways was new.
In fact, they store second degree connections of all active users, and can tell if a third degree connection exists by doing an intersection 😁
Can you cover system design of multiplayer online games
Hey do u do one one one prep for system design interview (particularly for stripe) if not stripe then similar company like PayPal
How can I set up something like that ? I have a system design interview soon and I will like to prep with u
Still did not get how the birectional efficient and works: Are we going apply intersection {A->ist degree B's degree but how are we going to do comparsion user by user } on every user on every search, i felt every time graph traversal on both direction degree would be super expensive eventhough its bi-directional. Still lot granuler details were not covered like what we store in cache.
probably because 2*n^2 is better than n^3 ?
@Gauvravsen Thanks for the wonderful video. Can you please ebay auction system design
NGL, I checked my slack when I heard that notification sound.
Hahaha!