Design a Basic Search Engine (Google or Bing) | System Design Interview Prep

Поділитися
Вставка
  • Опубліковано 6 чер 2024
  • Visit Our Website: interviewpen.com/?...
    Join Our Discord (24/7 help): / discord
    Join Our Newsletter - The Blueprint: theblueprint.dev/subscribe
    Like & Subscribe: / @interviewpen
    This is an example of a full video available on interviewpen.com. Check out our website to find more premium content like this!
    Problem Statement:
    Provide a design overview of a basic search engine. Your search engine system must support the following:
    - *Retrieval:* The search engine should display a list of relevant web pages in response to a user query. The results should include the page title, URL, and a brief summary.
    - *Indexing:* The system should be able to crawl and index web pages from the Internet. The indexing process should store metadata about the web pages, such as their URL, title, and a brief summary.
    - *Scalability:* The system should be designed to handle a large number of queries and indexed web pages, ensuring that response times remain low as the search engine scales.
    Finer concerns such as query processing & page ranking can be briefly addressed, but are not mandatory.
    Table of Contents:
    0:00 - Requirements
    0:20 - How Search Works
    1:57 - API: Accepting Search Queries
    2:16 - Database: Storing Site Metadata
    4:19 - Database Demands
    4:51 - Page BLOB Store
    5:17 - Database Sharding
    6:10 - Global Index
    6:33 - Text Index
    7:09 - The System Thus Far
    7:52 - Crawling
    9:06 - robots.txt Cache
    9:24 - Crawler Demands
    10:31 - The System So Far
    11:04 - URL Frontier: Priority
    11:39 - URL Frontier: Politeness
    12:01 - Naive URL Frontier
    12:31 - Multiple Queues
    13:35 - Solving for Politeness
    15:51 - URL Frontier: Recap
    16:16 - URL Frontier Demands
    17:24 - Full Design Review
    17:49 - Extensions
    19:10 - Visit interviewpen.com
    Socials:
    Twitter: / interviewpen
    Twitter (The Blueprint): / theblueprintdev
    LinkedIn: / interviewpen
    Website: interviewpen.com/?...

КОМЕНТАРІ • 300

  • @interviewpen
    @interviewpen  Рік тому +21

    Thanks for watching! Visit interviewpen.com/? for more great Data Structures & Algorithms + System Design content 🧎

  • @prathamshenoy9840
    @prathamshenoy9840 Рік тому +85

    needless to say.... your channel will grow superbly. this video was RECOMMENDED by youtube

    • @interviewpen
      @interviewpen  Рік тому +7

      Thanks for the kind words! Yeah we will be posting a lot more & hope to create more quality stuff.

    • @newman6492
      @newman6492 Рік тому +2

      Yes.

  • @vladd3172
    @vladd3172 11 місяців тому +6

    Clean, clear, efficient. ❤
    I’d love to see more videos like this from you!

    • @interviewpen
      @interviewpen  11 місяців тому

      Will do, thanks for watching!

  • @carlboneri7772
    @carlboneri7772 Рік тому +31

    One of the best walkthroughs I've ever seen, regardless of the topic or technical depth. Superb work man.

    • @interviewpen
      @interviewpen  Рік тому

      thanks for commenting & the nice words - more videos coming soon!

  • @lucasoliveira-xs5yh
    @lucasoliveira-xs5yh Рік тому +30

    Awesome content! I liked to see some data structure (such as queue and heap) used in practice, because the simple examples are good in the beginning, but it is not that good with the time. Continue with this, really a hidden gem this channel

  • @williefr
    @williefr Рік тому +14

    I really enjoyed the video! Thank you guys for taking your time and posting it, it was very entertaining and educational. Best regards

  • @dmitrydmitriev2554
    @dmitrydmitriev2554 Рік тому +4

    Greetings, I just came to UA-cam to watch video about SQL optimization and your channel was offered. And I started to watch this video. It is amazing, the way you explain is brilliant and outstanding. Very clear, full of information, not boring because too obvious, not difficult because too sophisticated and convoluted - a golden middle.
    Thank you!

    • @interviewpen
      @interviewpen  Рік тому

      thanks for the kind words! and thanks for watching - more coming

  • @raghuboyapati7311
    @raghuboyapati7311 Рік тому +4

    This channel is gonna explode. The content is just too good. Thank you.

    • @interviewpen
      @interviewpen  Рік тому

      Thanks for watching - we'll be posting a lot more!

  • @JM_utube
    @JM_utube 10 місяців тому +1

    I really appreciate this video! Information was clear and concise. Levels of depth are perfect for the viewer to be able to continue educating themselves about any of the topics mentioned here. Thank you so much

    • @interviewpen
      @interviewpen  10 місяців тому

      Sure - thanks for watching!

  • @artemvolsh387
    @artemvolsh387 Рік тому +16

    Channel currently hugely underrated, material is just delicious, especially for those who seek examples of complex system schemes.
    Love it.

    • @interviewpen
      @interviewpen  Рік тому +1

      Thanks! We have more coming - production starts this week!

    • @artemvolsh387
      @artemvolsh387 Рік тому

      @@interviewpen Great to hear!

  • @SlaHu.
    @SlaHu. 10 місяців тому

    woow loved it
    by 1:48 I was in love because you ruled out everything every small detail required + planning this makes understanding alot easier rather than directly jumping into code and saying on the go

  • @BraisonsCrece
    @BraisonsCrece Рік тому +2

    keep it going!
    High quality content and a very solid platform! Without a doubt, I will buy the subscription soon and start learning!
    a hug from a new Spanish subscriber

    • @interviewpen
      @interviewpen  Рік тому +1

      cool! Thanks for watching. Let us know in Discord if u need any help.

  • @juanitoMint
    @juanitoMint 3 місяці тому

    Really appreciate the back-of-the-envelope calculations in between!
    Great work!

    • @interviewpen
      @interviewpen  3 місяці тому +1

      Thanks, glad you enjoyed it!

  • @MrRetroboyish
    @MrRetroboyish 9 місяців тому +3

    Only a 1/3 of the way through and already one of the best I've seen. Focused, logical leaps from topic to topic, minimal digressions. Keep it up

    • @interviewpen
      @interviewpen  9 місяців тому

      Thanks! and thanks for watching

  • @marko3808
    @marko3808 Рік тому +2

    This is amazing! I honestly cant wait to look into your other videos!

    • @interviewpen
      @interviewpen  Рік тому +1

      More videos coming! Thanks for watching.

    • @marko3808
      @marko3808 Рік тому +1

      @@interviewpen eagerly waiting!

  • @strawberriesandcream2863
    @strawberriesandcream2863 6 місяців тому

    amazing video, thanks👏👏i like how you guys dig deep into complex aspects of every system that some other content just gloss over

  • @johnny_silverhand
    @johnny_silverhand Рік тому +3

    Exceptional way of explaining things , I'm subscribed to you guys now

  • @linonator
    @linonator Рік тому +2

    Wooooah!!!! This is what I needed in my life 😢. I’m now complete

  • @marwanezzat2637
    @marwanezzat2637 Рік тому +31

    Dude, your content quality is superior, Keep going.
    Yesterday you had 145 subscribers and now you have 245 i am so happy for you.

  • @greed7513
    @greed7513 Рік тому +38

    lmao if this was the interview question I'd just not do it.. but I'm not there yet

    • @interviewpen
      @interviewpen  Рік тому +6

      we'll get u there 🧎🧎 be brave

    • @theuniverse2268
      @theuniverse2268 Рік тому

      ​@@interviewpen only slaves need to do this
      It's not worth it 🤷‍♂️ If you know how to build a search engine you're already the top 1% of the human population just make your own company and forget about a job lol

  • @chandrasekharmandapalli9181
    @chandrasekharmandapalli9181 Рік тому +1

    Great work buddy....very detailed explanation... cheers

  • @s8x.
    @s8x. Рік тому +1

    Wow, this is information all for free. Thank you for making this video

  • @henrythomas7112
    @henrythomas7112 3 місяці тому

    I extremely like the video, man. Very helpful and informative. Thank you very much. It is presented so well too. Great, positive work.

  • @andydataguy
    @andydataguy Рік тому +29

    Your course looks great. I love that you have a teaching assistant and the explaining styles are awesome. Will try it out! Only thing is I really wish you supported Rust 🦀🙏🏾

    • @interviewpen
      @interviewpen  Рік тому +5

      Thanks! We can add language support in under an hour. (from the engineering angle) We can push changes in a day. Just let us know in Discord.

    • @timSquash
      @timSquash Рік тому

      yeae ive just started learning rust. It's such a cool language

  • @maharshiguin7813
    @maharshiguin7813 Рік тому +2

    Great video, really like your way of explaining stuff.

  • @frankguo1748
    @frankguo1748 3 місяці тому

    Really clear, concise and efficient explanation and narrative. 👍

  • @sinnloses746
    @sinnloses746 9 місяців тому

    Second Video I watch from you. It’s so good. thank you

  • @Roshen_Nair
    @Roshen_Nair 11 місяців тому +2

    Loved the video! A video I'd love to see in the future is system design for video streaming applications e.g. UA-cam, Netflix.

    • @interviewpen
      @interviewpen  11 місяців тому +1

      Will do - thanks for watching

  • @gmanonDominicana
    @gmanonDominicana 11 місяців тому +2

    I was looking for something like this for a while. This content is worth the time spent.

  • @yipmong
    @yipmong 25 днів тому

    I am impressed, you really deserved my sub❤

  • @pankaj.pilkhwal
    @pankaj.pilkhwal Рік тому +1

    really wow!!!!!!! amazing content.

  • @rockosaji9400
    @rockosaji9400 Рік тому +2

    Wow...Super impressed

    • @interviewpen
      @interviewpen  Рік тому

      Thanks! A lot more coming! We will be posting consistently.

  • @Sgene9
    @Sgene9 Рік тому +1

    This was amazing. Now I want to try build a search engine!

  • @govardhannarayan3907
    @govardhannarayan3907 10 місяців тому

    Great video..
    Keep it up folks.

  • @ekanshmishra4517
    @ekanshmishra4517 Рік тому +4

    Never saw such a difficult problem explained so easily❤️ subscribed instantly
    Love from India❤

  • @christhornham
    @christhornham 10 місяців тому

    Outstanding! Thank you!

  • @premparihar
    @premparihar Рік тому +1

    The video is really awesome and helpful ❤.

  • @notenlish
    @notenlish 17 днів тому

    Great video man, wish I had found this before

  • @sperpflerperberg8147
    @sperpflerperberg8147 Рік тому +3

    This channel is amazing

  • @syn3rman65
    @syn3rman65 Рік тому +1

    Holy shit I'm glad I found this before you've blown up 🙌

  • @user-mm4mv9cb2w
    @user-mm4mv9cb2w 9 місяців тому

    It was an incredibly detailed explanation

  • @MuscleTeamOfficial
    @MuscleTeamOfficial Рік тому +2

    This is high quality content.

  • @FranciscoGomez-tw1ii
    @FranciscoGomez-tw1ii Рік тому +2

    Amazing!!!

  • @eazypeazy8559
    @eazypeazy8559 Рік тому +1

    cool guide, thanks

    • @interviewpen
      @interviewpen  Рік тому

      sure - thanks for watching, more videos coming

  • @FeyroozeCode
    @FeyroozeCode 7 місяців тому

    Very Simple and Good

  • @khuntasaurus88
    @khuntasaurus88 Рік тому +1

    Well thats an instant sub!!

  • @BrianStDenis-pj1tq
    @BrianStDenis-pj1tq 9 місяців тому

    This is great content. Regarding shingles, that takes a LOT to implement - lots of space and lots of CPU to compare them. The idea of the personalized recommendations is a huge success Google has and is surely difficult to implement considering the entire search, rank (personalize) and retrieve has to be done in a second.

    • @interviewpen
      @interviewpen  9 місяців тому

      Thanks! You're exactly right--Google has built an incredibly impressive system :)

  • @Vinod_Kumar827
    @Vinod_Kumar827 10 місяців тому

    Very nicely explained

  • @basharatwani3948
    @basharatwani3948 8 місяців тому

    Thank you for sharing, Good content and good work. Suggest start with core functional and non functional requirements and then capacity planning numbers and read write per sec needing to support the core functional needs. Otherwise seems we go straight into solution which is ok, some may want to know how we think ahead of an ambiguity and the problem space and have conversation around what we want to do with the interviewer. Maybe also consider adding handling copyright issues when we are extracting and rendering html, de dupe service and bloom filter, how nested cyclic loops in a site will be handed, caching strategy etc.

    • @interviewpen
      @interviewpen  8 місяців тому

      Thanks for watching. You're right, addressing the requirements ahead of time is very important in this process, and our more recent videos tend to be better about that :)

  • @dibll
    @dibll Рік тому +1

    Informative video! Very nicely explained. Could you pls do one on distributed key/value stores?

    • @interviewpen
      @interviewpen  Рік тому

      thanks for watching - yes that's in our backlog

  • @dave6012
    @dave6012 9 місяців тому

    Dang, I never thought I could understand this whole process. I typically wrote off most of the implementation details as a black box, but this seems halfway approachable.
    Has me thinking a lot about single page applications, and how the crawlers handle them. A similar type of video would be awesome if you had it.

    • @interviewpen
      @interviewpen  8 місяців тому

      Glad you liked it! Yes, SPAs are notoriously hard to optimize for crawlers. However, strategies like static rendering and routing can make SPAs look more like typical websites to a crawler. I'm not an SEO expert though :)

    • @dave6012
      @dave6012 8 місяців тому

      @@interviewpen haha I appreciate the legal disclaimer

  • @CertificationTerminal
    @CertificationTerminal 9 місяців тому

    Awesome!

  • @mus_g117
    @mus_g117 8 місяців тому

    nice content thank you

  • @SaveCount-bh8tp
    @SaveCount-bh8tp 6 днів тому

    Your Channel is very good

  • @theprovego2934
    @theprovego2934 8 місяців тому

    2:00 This is how to make an ad, good job!

  • @danielghani3903
    @danielghani3903 Рік тому +1

    terima kasih puan

  • @VermeilChan
    @VermeilChan Рік тому +1

    The amount of time u put in this video is crazy 😭
    Keep it up 😼😼

  • @maksym7703
    @maksym7703 Рік тому +2

    man it's so good content, who are personally you btw?)

    • @interviewpen
      @interviewpen  Рік тому +1

      The instructor is named Bobby - I am Benyam, I do our Data Structures & Algorithms. Thanks for watching.

  • @andrewkamoha4666
    @andrewkamoha4666 Рік тому +1

    Piece of cake !!!

  • @dombat44
    @dombat44 7 місяців тому

    Great content, yours are the best system design interview mocks I've seen on here. Could you do one on a RSS feed website?

    • @interviewpen
      @interviewpen  7 місяців тому

      Thanks! Sure, we'll add it to the backlog :)

  • @jeromeeusebius
    @jeromeeusebius Місяць тому

    Thank you for sharing the great design prep video. What tools or combination of tools/software is used to create the figures (with the black blackground). Thanks

    • @interviewpen
      @interviewpen  Місяць тому

      Thanks for watching! We use GoodNotes on an iPad.

  • @savanpatel4938
    @savanpatel4938 Рік тому +1

    awesome

    • @interviewpen
      @interviewpen  Рік тому

      thanks for watching - more videos coming soon!

  • @ahmad-ali14
    @ahmad-ali14 Рік тому +1

    Thanks

  • @langtuyetvuanh1999
    @langtuyetvuanh1999 Рік тому +1

    great video, but can I ask? can we use elasticsearch instead? I'm not a professor but seeing a lot of system using elastic search to optimize their query performace.

    • @interviewpen
      @interviewpen  Рік тому +1

      Glad you liked it! ElasticSearch actually uses a very similar data structure to the "text index" we described, and this could certainly be swapped out for our database in this system. It's just about tradeoffs between ease of use in a managed service and flexibility.

  • @rembautimes8808
    @rembautimes8808 2 місяці тому

    One application of this solution is for horizon risk scanning. The use case is that a large multinational corporation wants to have an idea of new risks which are emerging and adopting this approach allows them to have a traceability back to the web source. Of course they won’t be crawling 100M pages but maybe 100k pages.

    • @interviewpen
      @interviewpen  2 місяці тому

      Interesting! Thanks for watching :)

  • @edgararrizon5736
    @edgararrizon5736 7 місяців тому

    what are you using to draw on and the software to make this? i find it super helpful and would like to make my own videos using it, thank you

    • @interviewpen
      @interviewpen  7 місяців тому

      Cool, we're using GoodNotes on an iPad. Thanks!

  • @moacir8663
    @moacir8663 Рік тому +2

    I'd like to watch a deeper explanation about how to search for data in a shard database like you explained.

    • @interviewpen
      @interviewpen  Рік тому +2

      we'll cover sharding in-depth soon! thanks for watching!

    • @moacir8663
      @moacir8663 Рік тому

      @@interviewpen I'm looking forward to watch it.

  • @satyamkumaryadav1560
    @satyamkumaryadav1560 Рік тому +3

    Which app you are using for writing?
    BTW quality content 👌🏿

  • @ShueFig
    @ShueFig Рік тому +1

    recognised the B2B SWE voice :)

  • @amigos786
    @amigos786 Рік тому +1

    Hey awesome video. Just subd. What is the app you are using in ipad for this?

  • @wayneisthebestable
    @wayneisthebestable 8 місяців тому

    Great video, but im curious is it really neccassar to sort by frequency of a word in URL?
    i think most well designed URL wont have key word like cat appear more than one time in Url?
    Also if there's cat and dog in a URL should I have two record for a URL?

    • @interviewpen
      @interviewpen  8 місяців тому

      No, we're searching the content of the pages here, not the url. Thanks for watching!

  • @nathantablang2705
    @nathantablang2705 Рік тому +2

    oh my goodddddddddeddd

  • @chenhaofeng4842
    @chenhaofeng4842 3 місяці тому

    Really appreciate it. I have several questions for politeness part. If there are 10k hosts, are we supposed to have 10k queues for politeness? Let's say if one host has only 3 urls, after all the 3 urls are visited. are we supposed to delete the idle queue? Each time we have a new host, are we supposed to created a new queue.

    • @interviewpen
      @interviewpen  2 місяці тому +1

      Yep, we'd need one queue for each host. There'd probably be far more than 10k in fact! Of course, these would simply be logical partitions residing on a far smaller set of physical machines. We would need to add a queue when a host is visited for the first time (this would be trivial since a queue is just a logical abstraction), but we probably wouldn't need to worry about deleting since we'll keep re-crawling hosts. Hope that helps!

  • @NitinVarmaManthena
    @NitinVarmaManthena 2 місяці тому

    What software do you use for the UI for the workflow and to highlight pen?

    • @interviewpen
      @interviewpen  2 місяці тому

      We use GoodNotes on an iPad. Thanks!

  • @ShubhamSharma-lp1ng
    @ShubhamSharma-lp1ng Рік тому +1

    woooooow

    • @interviewpen
      @interviewpen  Рік тому

      thanks for watching - more videos coming soon!

  • @TarrenHassman
    @TarrenHassman 8 місяців тому +1

    Also important to remember that search engines are moving to Vector databases with machine learning matrixes

  • @cankuter
    @cankuter 5 місяців тому

    Very nice walkthrough appreciate the effort. I have a question tho, maybe a stupid one. I didnt quite get if "heap" means the data structure heap or the heap as a general memory space just like it is called in Java. I mean if its the data structure, wouldnt it be very inefficient to search for the correct pointer for the politeness queue you are looking for? From your explanation I am inferring that this heap is more like a memory space and works more like a hash map. Is this correct?

    • @interviewpen
      @interviewpen  5 місяців тому

      We did mean the heap data structure--this works very efficiently here since the earliest timestamp will always be at the top of the heap. The heap just tells us which politeness queue to look at next; no searching necessary. Thanks!

  • @dibll
    @dibll Рік тому +3

    Could someone pls explain what text and hash indexes are? Are they separate DBs storing partial information compare to the main DB or something else? Thanks!

    • @interviewpen
      @interviewpen  Рік тому

      You're exactly right. You can think of global indexes as a copy of the database but organized onto nodes differently, and the records generally only include enough data to be able to look up the corresponding record in the primary.

  • @yourlogarithm8607
    @yourlogarithm8607 9 місяців тому

    Could you explain to me a thing I'm confused about here 13:35. When the router selects an element from the priority queue - it adds it to the politeness queue, by doing that wouldn't we loose the initial prioritization given that the politeness queues are sorted just by domain?

    • @interviewpen
      @interviewpen  9 місяців тому

      Sure. The router uses a weighted random algorithm to select a priority queue, so the higher priority queues are more likely to be selected. This ensures that higher priority pages are crawled more frequently, regardless of what politeness queue they end up in. Thanks!

  • @renanmonteirobarbosa8129
    @renanmonteirobarbosa8129 10 місяців тому

    Only if it was this simple hahahahaha But it is really cool to see the thought process and the basic mind map

  • @nikitaluparev6478
    @nikitaluparev6478 3 місяці тому

    while you've been explaining Schema you mentioned hash as a way to make sure something is unique. Can you explain in detail how hash helps with that?

    • @interviewpen
      @interviewpen  3 місяці тому

      Sure--hashing a large piece of data (such as a webpage) yields a far shorter, fixed-length string that uniquely represents that data and can be stored in a database. By checking if this hash already exists in our database, we can effectively check if the webpage has already been seen without having to compare the page content against petabytes of other pages.

  • @dzuchun
    @dzuchun Рік тому +1

    have a trouble finding that shingles technique author mentioned close to the end. can anyone give some sort of reference?

    • @interviewpen
      @interviewpen  Рік тому

      Thanks for watching! It's a bit math heavy but here's a reference for shingling: nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html

  • @user-ok1up9sw5b
    @user-ok1up9sw5b 8 місяців тому

    kindly make a video systems design for algorithms

  • @scottthornton4220
    @scottthornton4220 4 місяці тому

    Love the video but I'm perplexed as to why you want to store the site contents. I figure that you would just scrape it for word frequencies for matching later to queries?

    • @interviewpen
      @interviewpen  4 місяці тому +1

      Good question--we store the site contents so we don't have to scrape them again later if we want to change our algorithms. Google does this too! Thanks for watching.

  • @tofahub
    @tofahub Рік тому +7

    How does sorting by frequency give us the most popular results? The frequency is the number of times the word occurs in that specific url. The word may appearing in that url too many times like being a common word doesn't make it the most popular search result

    • @interviewpen
      @interviewpen  Рік тому +6

      You're completely right! Google uses the PageRank algorithm in addition to a more advanced index to handle that--we glossed over this for our "basic" search engine since it's more of an algorithms problem than a system design one. Regardless, there's some cool infrastructure that goes into calculating PageRank at scale so that's certainly something to look into if you're curious. Thanks for watching!

    • @esm2000
      @esm2000 Рік тому +5

      ironically sorting by frequency was the original implementation of the page rank algorithm, long before it became more advanced

    • @H3llsHero
      @H3llsHero Рік тому

      You can lookup tf-idf (term frequency-inverse document frequency) to learn more about how common "filler" words are filtered out in a basic search engine.

  • @TungLe-mm7eo
    @TungLe-mm7eo 6 місяців тому

    what is the tool you are using for presentation? thank you

    • @interviewpen
      @interviewpen  6 місяців тому

      We're using GoodNotes on an iPad. Thanks!

  • @andrecorreia8568
    @andrecorreia8568 8 місяців тому

    Thanks, great video but I have 1 comment. You are saying that you are going to cache the robots.txt file. How does Google system then know that the robots.txt was updated? From what you mentioned, you always take it from cache as long as it is there but you didn't mention cache invalidation.

    • @interviewpen
      @interviewpen  8 місяців тому

      Thanks for watching! Really good point-in this system it’s not critical for the robots.txt to be constantly up to date, but there definitely should be some TTL set in the cache to make sure the data is re-fetched periodically.

  • @shs4293
    @shs4293 Рік тому +1

    Instead of sharding right off the hook, could use partioning. Sharding should be the final resort

    • @interviewpen
      @interviewpen  Рік тому +1

      Good point, but 31TB of metadata is a lot to store on one node so it's necessary in this case to scale horizontally. Our query patterns work very nicely here (always single-record reads/writes by a unique key), so it shouldn't be a problem. Thanks for watching!

  • @youknowkbbaby
    @youknowkbbaby Рік тому +1

    If I struggled with basic math word problems like dimensional analysis, can I do this?

    • @interviewpen
      @interviewpen  Рік тому

      Sure, it's just problem solving and thinking about the solution from different angles. Keep watching, we'll get you there!

  • @tirthdoshi1337
    @tirthdoshi1337 6 місяців тому

    Can someone explain how does the priorityQueue really work for choosing the next element in the queue? Is it like a min priority queue where the top element will be having the minimum time to remove and we compare current time and minimum time and finally process the element and then if multiply rendering time by 10 and put it back to the queue and the priority queue. In that case if a 2 elements have the same time in priority queue how do we choose which one to pick?

    • @interviewpen
      @interviewpen  6 місяців тому

      Yep you got it right, we’re looking for the earliest timestamp. If two elements have the same timestamp, it doesn’t matter which one we pick. Thanks!

  • @kayeshparvez
    @kayeshparvez 11 місяців тому

    are we going to remove the blob after creating hash index and word index ?

    • @interviewpen
      @interviewpen  11 місяців тому

      It depends on the requirements of the system, but in this case we'll keep the BLOB around. This is helpful since there's so much overhead involved in scraping sites--for example if we decided to change our indexing algorithm, we could do so from the saved BLOBs without having to re-crawl every page. Google does this too--in fact you can view Google's copy of a page by clicking the "cached" link on a search result. Thanks for watching!

  • @darkwoodmovies
    @darkwoodmovies 7 місяців тому +1

    The fact that when you crunch the numbers, the metadata is only

  • @eikodunn
    @eikodunn Рік тому +1

    ⭐️⭐️⭐️⭐️⭐️

    • @interviewpen
      @interviewpen  Рік тому +2

      thanks for watching! more videos coming

  • @rushio8673
    @rushio8673 9 місяців тому

    Please explain how the prioritizer works here

    • @interviewpen
      @interviewpen  9 місяців тому

      Sure. There's a number of algorithms we could implement here, but the general idea is to analyze the page and how frequently it changes to determine how frequently to crawl it. The prioritizer will take in all the data and insert the page into the correct queue based on its calculated priority. Thanks for watching!

  • @Tony-dp1rl
    @Tony-dp1rl 7 місяців тому

    Seems like a huge amount of complexity to avoid crawlers hitting the same URL. I would take the approach that they will rarely select the same URL anyway, so just have at it and wear that occasional doubling up for the massive speed increase it gives you on the 99% case - especially given the huge number of URLs something like google must be crawling.

    • @interviewpen
      @interviewpen  7 місяців тому

      When the crawler discovers a new site, it's pretty likely that several pages on that site would line up close together in the URL frontier. At the scale of thousands of crawlers, we'd basically be DDOSing every new site! But you're absolutely right that it's an important tradeoff to consider. Thanks!

  • @qingrex
    @qingrex Рік тому +1

    🎉Great video🎉May I try to up a Chinese CC? It s useful to someone under me❤

    • @interviewpen
      @interviewpen  Рік тому

      sure - what is an email we can use to add you as a CC moderator?

  • @yuganderkrishansingh3733
    @yuganderkrishansingh3733 Рік тому +2

    Don't think the schema design for the query pattern "Search for a word " is included. The video says there is a text index but I don't see "word" or "frequency" at ua-cam.com/video/0LTXCcVRQi0/v-deo.html
    I think the schema needs to include these so that index automatically creates a table on top of these.
    Also the part about Router routing URLs to correct queue, It's mentioned that if there is no Queue corresponding to domain then it will added to "empty" queue. But then what about updating the Heap and selector.
    Also the mapping of a domain to queue has to be stored somewhere. Most likely in Redis cache as it seems like changing a lot in case queue becomes empty.

    • @interviewpen
      @interviewpen  Рік тому +2

      1. The "site content" field in the schema should hold the full text of the site, so words and their associated frequencies can be calculated when records are added/updated, and this data is what propagates to the text index.
      2. Yep, when a new host is added to the second set of queues, the router is responsible for adding that host to the heap so the selector knows about it.
      3. The host-to-queue mapping would be stored in the router, that way the router is able to quickly check which queue the next URL should be added to. It's worth noting that the router is low-traffic enough (

    • @yuganderkrishansingh3733
      @yuganderkrishansingh3733 Рік тому

      ​@@interviewpen for the point 1, you mentioned that the word and frequency is calculated when a record is added or updated. But then also it needs the corresponding attributes so that it can be added to Databased when record is added or updated.
      As per timestamp 3:32 the schema doesn't contain word or frequency. Am I missing something? It might be something dumb apologies.

  • @CanRau
    @CanRau 5 місяців тому

    Is there some kind of open dataset to get the database going without having to crawl the whole web from 0?

    • @interviewpen
      @interviewpen  5 місяців тому +1

      There is! Check out www.commoncrawl.org/ (just one example)

    • @CanRau
      @CanRau 4 місяці тому

      ​@@interviewpenooooh that's incredible thank you so much 🙏🥰

  • @bestsagittarius7925
    @bestsagittarius7925 10 місяців тому

    Wonderful and pratical reference for me!

  • @69k_gold
    @69k_gold Рік тому +3

    Bro developed Google Search in 19 minutes