Find the distance between friends/connections - LinkedIn System Design with

Поділитися
Вставка
  • Опубліковано 5 лип 2024
  • Have you noticed a small icon on the right of LinkedIn profiles?
    It tells you how closely you are related to a user (1st, 2nd or 3rd-degree connection).
    These distances are calculated using graph algorithms. Specifically, LinkedIn uses Bi-directional BFS.
    This is a two-way algorithm where a search is started from both source and destination, and terminated midway. The total distance traveled from source+destination is the degree of the connection.
    But at scale, this solution brings challenges. Let's limit ourselves to finding 3rd-degree connections:
    1. Every query needs a graph traversal on our social network.
    2. A naive BFS would take O(n^3) for search. A bi-directional BFS needs O(n^2) for searching users + O(n^2) for merging results.
    3. To avoid performing these queries repeatedly, we cache the second-degree connections of users. This saves up on the first O(n^2) factor for subsequent queries.
    Great, but how do we store all this data in a single cache?
    Well, we can't. LinkedIn is forced to scale out, with multiple cache nodes storing second-degree connections. User-Connection data is sharded into caches according to user_id.
    But what if a machine crashes? All queries to the shards hosted by that machine will fail.
    To avoid this, we replicate shards into different machines. Even if one of the machines fails, the shard can be found on another machine.
    And here we meet our main villain: High Latency.
    We don't want to hit all the machines with our shards. We want to hit the SMALLEST possible number of machines, that contain all our shards.
    For the mathematically inclined, this is a set-cover problem. LinkedIn uses a modified greedy version of the problem to reduce its compute costs in half!
    The basic idea is to find machines that host most of the users who are your 2nd-degree connections, and avoid those that have few 2nd-degree connections.
    Less effort, more work done!
    00:00 Introduction
    00:35 Capacity Estimation
    01:20 Bruteforce SQL query
    04:00 Bidirectional BFS
    06:37 Caching connection data
    10:00 Potential databases
    11:11 Redundancy for fault tolerance
    12:12 Cold starts
    13:48 Removing connections
    14:37 Replicated Caches
    16:03 Ideal replication factor?
    17:25 Summary
    Here is the paper: www.usenix.org/system/files/c...
    References:
    Sharding: • What is DATABASE SHARD...
    System Design at InterviewReady: interviewready.io/
    Designing Data-Intensive Applications Book: amzn.to/3SyNAOy
    You can follow me on:
    Github: github.com/InterviewReady/sys...
    Twitter: / gkcs_
    #SystemDesign #InterviewReady #SocialNetworks

КОМЕНТАРІ • 14

  • @gkcs
    @gkcs  2 місяці тому +3

    The video description is also worth reading :D
    Also, try this free system design question at InterviewReady.
    interviewready.io/learn/system-design-course/google-docs-collaborative-editor-design/complete-the-design---google-docs
    Cheers!

  • @antkin608
    @antkin608 4 дні тому

    Loved this, thanks for sharing.

  • @crane71
    @crane71 2 місяці тому +2

    Hey, I think this is more like a People You May Know (PYMK) Machine Learning System Design problem, we normally use similarity vectors to approach this problems with the plethora of similarity searches in vector space, so instead of doing o(n^2), it's cut down into computing similarity vector searches for just connections of 1st degree. And then rank / re rank them using scores to push to the main user.
    I agree if you want to find the distance between two users, you'll have to use something like bi directional BFS. Just that I don't see a use case for why we should store this user - user distances, when we can do it using vector representations for similarity searches.

    • @indavarapuaneesh2871
      @indavarapuaneesh2871 2 місяці тому +1

      Computing People You May Know: is very critical to social media platforms like Twitter, Instagram where you can n number of attribute to compute the vector embeddings for a user and then doing the similarity search.
      But for asking referrals or finding a set of people in a company you're interested is a deterministic problem and bi-directional bfs will do the job perfectly.
      But for recommending PYMK, even Linkedin will use the approach that you've described.

  • @ronakkalyani5035
    @ronakkalyani5035 2 місяці тому +1

    LinkedIn has a 3rd+ degree connection, so IMO, they are just identifying 1st and 2nd-degree connections and the rest of the connections they are classifying as 3rd+. For Identifying 2nd degree connections we should be able to do it in O(n), the way you specified. Also, for 1st-degree connections, it seems an eventually consistent design. Search results didn't update as soon after accepting a connection request. btw, Thanks for the explanation, traversing both ways was new.

    • @gkcs
      @gkcs  2 місяці тому

      In fact, they store second degree connections of all active users, and can tell if a third degree connection exists by doing an intersection 😁

  • @pulkitmittal584
    @pulkitmittal584 2 місяці тому +1

    Can you cover system design of multiplayer online games

  • @Bob_tygabk
    @Bob_tygabk Місяць тому

    Hey do u do one one one prep for system design interview (particularly for stripe) if not stripe then similar company like PayPal
    How can I set up something like that ? I have a system design interview soon and I will like to prep with u

  • @Prepaccount-fq6cr
    @Prepaccount-fq6cr 2 місяці тому

    Still did not get how the birectional efficient and works: Are we going apply intersection {A->ist degree B's degree but how are we going to do comparsion user by user } on every user on every search, i felt every time graph traversal on both direction degree would be super expensive eventhough its bi-directional. Still lot granuler details were not covered like what we store in cache.

    • @rentedunicorn
      @rentedunicorn 2 місяці тому +1

      probably because 2*n^2 is better than n^3 ?

  • @premkumar-wt8lu
    @premkumar-wt8lu Місяць тому

    @Gauvravsen Thanks for the wonderful video. Can you please ebay auction system design

  • @priyanshsingh6162
    @priyanshsingh6162 2 місяці тому

    NGL, I checked my slack when I heard that notification sound.

    • @gkcs
      @gkcs  2 місяці тому

      Hahaha!