Twitter Search/ElasticSearch Design Deep Dive with Google SWE! | Systems Design Interview Question 8

Поділитися
Вставка
  • Опубліковано 15 жов 2024

КОМЕНТАРІ • 56

  • @Snehilw
    @Snehilw 2 роки тому +5

    Amazing content as always dude. Love how much in depth you go in all of your videos! My favorite channel of all by far! Have recommended this to several friends.

  • @AAASHIVAM
    @AAASHIVAM 15 днів тому +1

    Hey Jordan, Question here:
    1. Can we have hierarchical shards. Can we have a sharding strategy to segregate nodes by search terms. i.e. aa -> going to shard1, ab -> shard2, abc to abz -> shard3, ac -> shard4 and so on.
    2. With this strategy we will be writing document ids for terms to particular shards.
    3. In case of any hotspotting we can further split it.
    4. In case of extremely popular terms like trump, we can have it span over multiple shards.
    5. We can keep track of the shards using gossip protocol or zookeeper.
    From what I understood, in our current solution, we are doing a scatter gather where we are querying each partition and aggregating it. If we have hundreds of partitions, then we will have to filter out a lot of results. I understand that we can do it in parallel, but if we are only interested in top 2 results, then we will end up fetching top 2 results locally in every partition and then filter it out.

    • @jordanhasnolife5163
      @jordanhasnolife5163  13 днів тому

      1) Yes you can
      2) Agreed, but you have to be smart about how this is done, if you think of twitter you have a shit ton of tweets with the word "the" in it, how do you smartly partition these?
      3) See #2
      4) See #2
      5) Sure
      Agree with your general sentiment, and agree that it theoretically makes more sense. In practice, I believe the majority of search terms that are used are extremely popular anyways, and so just keeping a local index of recently published tweets with some scoring of each local document (keep in mind relative scoring is tough as well when doing global partitioning because you need access to the whole document to do TF-IDF on a particular search term, meaning the document itself needs to be sent to potentially hundreds of shards on write).

  • @Ms452123
    @Ms452123 2 роки тому +12

    Sshhh Man's been hiding the gun show this whole time. Giga Chad on the low

  • @cc-to2jn
    @cc-to2jn Рік тому +2

    dude u along with neetcode are my goto. Great content, and clear explanations.

  • @mickeyp1291
    @mickeyp1291 9 місяців тому +1

    as allways great videos - 18:00 forgot all that stuff about ESs caching so thanks for the reminder, gonna reread that part in ES docs. great job knowing about lucene most of my applicants have no clue about ES definitely not that lucene is not a db but a search engine (hate the json syntax, but what can you do) again, super fun to listen to your vids and watch this content

  • @shivamsinha642
    @shivamsinha642 2 роки тому +2

    liked solely for the description

  • @yashagarwal8249
    @yashagarwal8249 7 місяців тому +2

    Will Search Service pull the actual documents from the DB once it receives the documents Ids from cache/search index?

  • @RandomShowerThoughts
    @RandomShowerThoughts Рік тому +1

    16:00 exactly, I was thinking the same thing. Typically write to the source of truth, and use a queue to send it out to the various locations

    • @mickeyp1291
      @mickeyp1291 9 місяців тому

      today you assume the queue is the source of truth, then you spill into s3

  • @SwapnilSuhane
    @SwapnilSuhane 4 місяці тому +1

    great depth of core search design discuss with bit comedy ;)

  • @RandomShowerThoughts
    @RandomShowerThoughts Рік тому +1

    16:00 we can also use debezium (for certain databases) and that would write to kafka and listen on that topic

    • @jordanhasnolife5163
      @jordanhasnolife5163  Рік тому +1

      I'll have to look into this! Haven't had the privilege of using Kafka during my career so haven't heard of debezium

    • @RandomShowerThoughts
      @RandomShowerThoughts Рік тому +1

      @@jordanhasnolife5163 it’s pretty cool, I used it at my last company. We used debezium to capture changes from the database using the WAL, it would then write to a Kafka topic and we can read off it.
      The one downside here is that it writes all the messages into a single topic, and a single partition to ensure ordering. So the approach you mentioned of writing directly to Kafka will allow us to write to multiple partitions if needed (allowing more parallelization)

  • @kyabia2333
    @kyabia2333 Рік тому +1

    amazing, very helpful

  • @kamalsmusic
    @kamalsmusic 2 роки тому +3

    If we use the local index (meaning each node stores term -> [doc id's] and multiple nodes can reference the same term), does this mean we need to query all the nodes to answer a search query? How do we know which nodes have the term we are interested in if we are not partitioning by term?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 роки тому

      Yes, you have to query them all and aggregate. It's unfortunate, but there's typically too much data to shard by term as opposed to document.

    • @axings1
      @axings1 Рік тому

      @@jordanhasnolife5163 could we first partition by term, then further partition into multiple shards if a single term has too much data?

  • @neethijoe
    @neethijoe 5 місяців тому +1

    Don't we need a parser/lexer service between kafka and search index that parses the tweets, hashes it to the correct partitions of the search index ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 місяців тому

      Something like elastic search will do this for us, hence why I don't explicitly include it.

  • @FarhanKhan-wu3fq
    @FarhanKhan-wu3fq Рік тому +2

    Did you really just "NOPQRS"ed to figure out what comes after P?

  • @lv0320
    @lv0320 29 днів тому +1

    Is it recommended to partition the search index by term or by the tweet_id/user-id?

    • @jordanhasnolife5163
      @jordanhasnolife5163  29 днів тому

      I don't think by term is realistic at Twitter's scale to be honest

    • @lv0320
      @lv0320 29 днів тому +1

      @@jordanhasnolife5163 I see, so when a user searches for a term, the request would hit all partitions to get subset results and then they would be aggregated?

    • @jordanhasnolife5163
      @jordanhasnolife5163  28 днів тому

      @@lv0320 Ideally, there should be some partitions that correspond to recent data. That way you don't have to hit all of them.

  • @AmolGautam
    @AmolGautam 7 місяців тому +1

    Thanks giga bro

  • @idobleicher
    @idobleicher Рік тому +1

    I liked your videos, new sub!

  • @anupamdey4893
    @anupamdey4893 Рік тому +1

    Love your content ! Keep up the good work !!

  • @maxmanzhos8411
    @maxmanzhos8411 7 місяців тому +1

    Wrote a long comment about how a posting list (documents containing a term) is implemented as a skip-list + encoding as per apache/lucene github repo Lucene99PostingsFormat. As I was wondering why we can't use similar idea for follower/following list storage in news feed problem (from System Design 2). But it's only viable if you either store the data in Lucene (I guess no one does that with this purpose in mind) or if you have a full control over DB code, so that you can do such advanced customization over a column (also not practical).
    nice guns

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 місяців тому +1

      Interesting, haven't heard of that data structure but would agree that it may be an overoptimization.
      Thanks, I work hard on the guns haha

  • @neek6327
    @neek6327 2 роки тому +1

    Hey man, qq. I was wondering if you thought it would be important in an interview to mention how we know which machine holds which partition? I was thinking maybe we could have a distributed search/index service that maintains the mappings between the partition -> machine. And that mapping could be made consistent across the “search/index service” nodes via a consensus algo or maybe zk. Does this make sense at all or am I missing something? Maybe it’s the local secondary indexes that take care of the problem I’m describing and I just don’t understand 🤷‍♂️

    • @neek6327
      @neek6327 2 роки тому

      Like rather than relying on local index, if we knew which machine held which partition couldn’t we just go directly to the correct shard and perform a binary search?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 роки тому +2

      Yes you would use zookeeper or a gossip protocol to keep track of which docs are held on which partition. Though this shouldn't really matter since we have to query each partition anyways.

    • @neek6327
      @neek6327 2 роки тому +1

      Hmm sorry, maybe this is going over my head. Why is it that we need to query each partition if we know exactly the partition that contains the word we’re looking for?
      Like say someone is searching the word “gigachad” and we know that machine 1 holds the partition range with that word in it. Couldn’t we go directly to machine 1 and perform a binary search there rather than querying all the shards?
      Maybe my understanding is off?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 роки тому +2

      @@neek6327 We aren't partitioning that way here - we are partitioning by groups of document Ids, not term. While in theory, partitioning by term is optimal, the reality is that there are often too many document IDs associated with one term to fit on a given machine, and as a result we have no choice really but to use local indexes on a group of documents.

    • @neek6327
      @neek6327 2 роки тому +1

      Got it, that makes sense. Thanks 🙏

  • @raj_kundalia
    @raj_kundalia 7 місяців тому +1

    thank you!

  • @DileepBC-r2x
    @DileepBC-r2x 10 місяців тому +2

    i dont understand most of the things. but thanks for the video.

  • @eudaimonian9473
    @eudaimonian9473 2 роки тому +1

    Gigachad42 in da house

  • @RandomShowerThoughts
    @RandomShowerThoughts Рік тому +1

    Grokking the system design sucks at this question ngl, searched for a solution right after reading it

  • @art4eigen93
    @art4eigen93 2 роки тому +2

    interviewee: Api design going to be pretty tiny
    Interviewer: How much tiny?
    Interviewee: You know....

  • @RandomShowerThoughts
    @RandomShowerThoughts Рік тому +1

    00:40 lmaooooo the day in my life as a software engineer videos are cringey af