Web Crawler - System Design Interview Question

Поділитися
Вставка
  • Опубліковано 4 лис 2024

КОМЕНТАРІ • 12

  • @games-are-for-losers
    @games-are-for-losers 8 місяців тому +7

    The UA-cam algorithm has picked up your channel. Really good content

  • @LouisDuran
    @LouisDuran 6 місяців тому +1

    I like that these are short and sweet. It shouldn't take an hour to explain TinyURL or web crawler. Thanks!

  • @ChimiChanga1337
    @ChimiChanga1337 8 місяців тому +1

    Excellent! Could also talk about what kind of network protocols will be used for services to talk to eachother?

  • @sayantanscs
    @sayantanscs Місяць тому

    is this really a good use case for bloom filters ? they will have false positive which means they might say something is visited while it is not i.e assuming we keep a list of visited url's. So we will have roughly 0.1 to 1% of URL's which are never visited !
    Now since this is a continuous process if there is a way to ensure the values in bloom filters changes with every run so even if something is missed first time in next run it's not automatically missed, this might be a work around.

  • @rajaryanvishwakarma8915
    @rajaryanvishwakarma8915 8 місяців тому +1

    Great video man

  • @LearningNewThings0407
    @LearningNewThings0407 6 місяців тому +1

    Is it Font queue prioritizer or Front queue prioritizer ?

  • @jjlee4883
    @jjlee4883 8 місяців тому

    Awesome video. Would it make sense for the url seen detector and url filter to come after the html parser step?

    • @TechPrepYT
      @TechPrepYT  8 місяців тому

      Thanks for the comment! You wold want the duplicate detection to occur directly after the HTML parser as we don't want to process the same data and extract the same URLs from the same page and that's why the URL Seen Detector and URL filter happen later on in the system. Hope this makes sense!

  • @WINDSORONFIRE
    @WINDSORONFIRE 4 місяці тому

    How does the design of a web crawler not include geo located servers etc?

  • @dibll
    @dibll 8 місяців тому

    During duplicate detection step, how Content Cache is being used? Could someone please explain?