System Design distributed web crawler to crawl Billions of web pages | web crawler system design

Поділитися
Вставка
  • Опубліковано 17 лис 2024

КОМЕНТАРІ • 200

  • @tabishnaqvi5748
    @tabishnaqvi5748 19 днів тому

    im not even here to study for interview, just watching this for my passion project, very helpful

  • @Sarah-il5dr
    @Sarah-il5dr 4 роки тому +24

    Guys, please like the video, as an engineer I know how much hard work behind a video like this. This is my go to system design resources. Great work!

    • @Sarah-il5dr
      @Sarah-il5dr 4 роки тому +1

      Goutam Singh well, some of the audience might be “engineer-to-be”😉

    • @ghisskartadchoo3618
      @ghisskartadchoo3618 Рік тому

      Fake comments

  • @arvindaaswani1303
    @arvindaaswani1303 4 роки тому +51

    Awesome explanation, As a engineer i know that, how much hard work behind the scenes. Really Appreciate 👏

  • @sumonmal009
    @sumonmal009 3 роки тому +22

    estimation 5:30
    HLD 6:33
    queue manage 25:30
    update and duplicate handle 33:40
    Sim hash 39:26
    storage 42:00

  • @petar55555
    @petar55555 2 роки тому +12

    Great in detail System design. The only part I would probably skip is the heap (each queue is already tied to a thread/worker) as it looks more like a bottleneck and serves only as a timer to slow down the crawling for politeness which can be done in different ways.

    • @aarushjuneja6640
      @aarushjuneja6640 2 роки тому +1

      I was also thinking on the same lines.

    • @NANDINIGOEL
      @NANDINIGOEL 10 місяців тому

      Think it this way, priorities based queue and then host based but you don’t know once in hosts which host to handle first ( priority is lost) so pq is filled with first elements of each of back a and then urls downloaded based on priority ensuring politeness. Merge k sorted arrays is good pointer to this , there is no point in locking threads to each queue if that is doubt because then priority is per q and not across all. Think a host has all priory 100 urls and others have 1-99 so then why should that 100 host be prioritized, it should not be unless we implement nice call something similar to increase priority to avoid aging

  • @adamhughes9938
    @adamhughes9938 4 роки тому +117

    Makes me sad that this dude crams so much amazing content into these videos and gets 42k views but the dumbest 10 second videos get millions of views...
    I wish youtube had a notion of content score and quality.

    • @junjiechen7341
      @junjiechen7341 3 роки тому

      ikr! too much going on to be fully appreciated in his vids.

    • @warriorgeneral2735
      @warriorgeneral2735 3 роки тому +7

      Hey it totally depends on what people are interested in...

  • @TheMdaliazhar
    @TheMdaliazhar 4 місяці тому

    Thanks for this. Most detailed design. No other youtuber explained exactly how the URL Frontier works.

  • @monikaa8230
    @monikaa8230 3 роки тому +5

    I have a suggestion to include two things in your videos which will definitely help:
    1. QPS Calculation
    2. Sharding key when we are planning to shard the DB

  • @ksenthu
    @ksenthu 4 роки тому +13

    The more detailed and clear content of crawler design I've seen. Thanks for doing this. It would be great if you can also clarify how the data transition happens between various services such as Extractor, Duplicate Detection, URL filter and Loader.

  • @PeterParker-vn2hv
    @PeterParker-vn2hv 2 роки тому +1

    Narenda, thank you for this excellent video. Much appreciated.

  • @iitgupta2010
    @iitgupta2010 5 років тому +8

    Finally you start building the video in actual flow, that's really great and it will really help the viewer to understand and build the actual knowledge of SD. Great bro.

  • @jessica-mx5pw
    @jessica-mx5pw Рік тому +1

    thank you for the video! this was by far the most helpful system design video walk through I've seen. I've been struggling a lot with system design. Thank you for putting this together!

  • @manojbgm
    @manojbgm 3 роки тому +1

    Awesome, knowledgeable. thank you for the video

  • @prathashukla6596
    @prathashukla6596 4 роки тому +1

    awesome explaination of all the high level components. Good job

  • @theFifthMountain123
    @theFifthMountain123 7 місяців тому

    Had to watch multiple times to understand everything in the video. Thanks for the awesome explanation!

  • @anastasianaumko923
    @anastasianaumko923 Рік тому +1

    Thank you for this elaborate design, great work!

  • @aleeshaali7180
    @aleeshaali7180 3 роки тому

    Bes channel I came across for learning about system design, Thank you and keep it up
    Kudos to the wonderful work!!!

  • @venjan21
    @venjan21 4 роки тому

    Generally I don't post comments but this is one of the best system design (in detail) I have ever seen. It has re-kindled my thought process on how to think for a System Design question.

  • @howellPan
    @howellPan 5 років тому +15

    Great content.. appreciate the details and thoroughness!

  • @lambdaboss5528
    @lambdaboss5528 Рік тому +1

    Superb video, very helpful. Thank you.

  • @akshaymonga
    @akshaymonga 2 роки тому +1

    very nice n detailed video, thank you sir!

  • @alokuttamshukla
    @alokuttamshukla 5 років тому +10

    Thank you so much for these efforts. I mean 45 minutes video is not a joke with so much to grasp.

    • @TechDummiesNarendraL
      @TechDummiesNarendraL  5 років тому

      I am trying make it short. But failed to do so

    • @alokuttamshukla
      @alokuttamshukla 5 років тому

      @@TechDummiesNarendraL No , I am in no way complaining at all. I loved it. I am so thankful to you for this.

    • @TechDummiesNarendraL
      @TechDummiesNarendraL  5 років тому

      @@alokuttamshukla thanks

    • @readingsteiner6061
      @readingsteiner6061 4 роки тому +4

      @@TechDummiesNarendraL Blaise Pascal, In his Lettres Provinciales, the French philosopher and mathematician Blaise Pascal famously wrote - "I would have written a shorter letter, but I did not have the time." : )
      Buddy you're awesome. Keep up the good work. Wish you the best.

    • @JM_utube
      @JM_utube 4 роки тому

      after watching a lot of system design videos i really had to understand that this level of detail is NOT EXPECTED in an interview. i really stressed myself out trying to ask so many clarifying questions, and cover every single aspect of a system in a 45 minute block. this is not expected. remember - these videos are edited, shortened, rehearsed, and practiced. trust me when i say set a lower bar for yourself for interviews LOL
      thanks!!!

  • @utsavkapoor6069
    @utsavkapoor6069 10 місяців тому +1

    Great explanation man. Loved your videos. Why have you stopped making these. Hope to see you back soon!!

  • @iitgupta2010
    @iitgupta2010 5 років тому +15

    I really really appreciate your effort bro, whoever ask me I always suggest your name first. There are few others like gkcs but if you ask me there are nothing in front of your design skills. You really talk about things which matters. This is something I have not found in even paid courses. This is awesome in one word.
    You should have lot of subscriber. They will be soon.

  • @pryansh_
    @pryansh_ 2 роки тому +2

    very informative, thanks

  • @ajaypremshankar
    @ajaypremshankar 5 років тому +2

    It's not easy to make such in-depth content-rich video. Thank you Narendra :)

  • @IdoKleinman
    @IdoKleinman 3 роки тому

    Good stuff! Thank you. One suggestion, for the next video, keep the information text slides on screen for more than 300ms...

  • @CODFactory
    @CODFactory 2 роки тому

    a) Why not use a graph db instead of bigtable or anything
    b) why do those envelope calculations like 6PB or anything when we never used it and we never proved that the design will handle that amount of data
    c) We definitely should talk about how to make it distributed since 1 crawler cannot crawl everything, so how are we going to make sure that multiple crawlers are not crawling the same things
    d) how are we going to store these documents in different db and what kind of sharding we are doing to use
    i think those are some important things to talk about especially giving interviews

  • @aliaksandrsheliutsin2374
    @aliaksandrsheliutsin2374 Рік тому

    Just have to say that it's amazing content. Ket it up, Narendra!

  • @karupakulasampath
    @karupakulasampath 5 років тому +3

    making things clear and easier, thanks for your effort. I really appreciate your efforts.

  • @adrianliu2817
    @adrianliu2817 5 років тому +3

    You are the best! Enjoyed all of your system design videos!

  • @ragdoll2324
    @ragdoll2324 4 роки тому

    Very detailed discussion. Thanks for making this vdo.

  • @neoli8110
    @neoli8110 4 роки тому +9

    why do you need a heap? it sounds like a bottom neck right there. why can't backqueue selector use LB like round robine select the queue and remove item from the queue.

  • @ShabnamKhan-cj4zc
    @ShabnamKhan-cj4zc 3 роки тому

    Thanks a lot for exlpaning all the modules in simple manner.. Your channel is the place where one can stop and learn everything in easy way..thanks a ton and keep doing this great work

  • @impossible7434
    @impossible7434 3 роки тому

    such an amazing explanation, thank you very much, keep up the good work

  • @JM_utube
    @JM_utube 4 роки тому +1

    thank you so much for posting! i love your videos.
    i just got asked this in a facebook interview and i wish i had seen this video beforehand.

  • @aashnavaid6918
    @aashnavaid6918 2 роки тому

    amazing video thank you so very much sir!!!

  • @SkyCityInc
    @SkyCityInc 2 роки тому

    This was a really, really excellent overview, thank you for putting this video together!

  • @augustoclaro
    @augustoclaro 3 роки тому

    I have watched this video so many times in the past year that I'm almost quoting every word you say

  • @pinkylover911
    @pinkylover911 2 роки тому

    A lot of great effort has been put into your videos, thanks

  • @keshavKumar-le4df
    @keshavKumar-le4df 7 місяців тому

    Nice explanation.

  • @manmohanakash4222
    @manmohanakash4222 3 роки тому

    This is the kinda of teammate I would like to work with. So much content. Thanks for sharing

  • @meetpatel5054
    @meetpatel5054 4 місяці тому

    Instead of coupling back-queues with threads, I would say have more number of threads for priority URLs and less for others.
    for this to work, we can handle the politeness at front-queues where we put the subsequent URLs in low priority queues.

  • @wellingtonrafaelbarrosamor4260
    @wellingtonrafaelbarrosamor4260 3 роки тому

    Awesone didactic

  • @StormcastMarine
    @StormcastMarine Рік тому

    Thanks a lot for the video mate, really useful

  • @vedant9173
    @vedant9173 3 роки тому

    Sir, thank you so much for these great lessons

  • @chickentikkasauce1301
    @chickentikkasauce1301 5 років тому +2

    Heap is an implementation detail. Im being nit picky (this is a great video) but just some thoughts - Why does time stamp based priority even matter in this system? You didn’t mention that. It could be because you don’t want certain queues to get starved. A simpler approach might be to process each queue round robin and only mention the priority queue to your interviewer if they nudge you in that direction or if you want to slowly build to it to discuss trade offs. If each back queue has a priority, then just call out that we want a priority queue. You could say back queues have same priority but maybe other back queues dedicated to urls that we expect are updated at a faster rate have higher priority. But then you need a solution to the problem of other lower priority queues getting starved.

    • @psn999100
      @psn999100 4 роки тому +1

      Great explanation. Yes .
      What I gather is that "URL Frontier" essentially implements a
      1. Priority selection . -> Front Queue
      2. Politeness guarantee . -> Back Queue
      The main issue what we are looking at is how to pick the next URL from the "URL frontier" microservice to be sent to a thread for processing.
      As you said, we could do a round-robin method where all Back queues get picked from in an equal - fashion. Or kind of a "weighted" method aka. priority_queue based solution to make sure the hottest websites get crawled in smaller/tighter time intervals.
      I think its always better to just give the simplest approach first (i.e just draw a black box tagged "Queue Selection" ) and deep dive later if the interviewer wishes to. There is a saying in system designing world = "KISS" == Keep It Simple and Stupid . Its' unlikely that you would run your interviewer out of questions, so better to even nudge the interviewer in your direction of thinking by giving out ever so slightest of hints, so that he starts asking the questions which you already have the answer to.

  • @elachichai
    @elachichai 3 роки тому

    Definitely helpful ! Appreciate it Narendra!

  • @iitgupta2010
    @iitgupta2010 5 років тому +3

    I crawled a word from this video is "basically" and inverted index it ....lol [don' have that much time 😝
    Great video as always

  • @t4ruvk107
    @t4ruvk107 4 роки тому

    Thanks for your time,efforts and content.

  • @puravshah2342
    @puravshah2342 5 років тому +6

    Hi Naren, thanks for the awesome video, can you also make a video on designing distributed scheduling system

  • @rishabhnitc
    @rishabhnitc 2 роки тому

    As always excellent. just remove the music at 46 second mark :)

  • @SimranGupta-pz7nw
    @SimranGupta-pz7nw 3 роки тому

    Thank you so much for the beautiful explanation :)

  • @heller166
    @heller166 3 роки тому

    This is going to be a lot of help for my distributed systems course :). Thanks for all the hard work.

  • @hlibpylypets1333
    @hlibpylypets1333 3 роки тому

    Very detailed explanation - best ever :)

  • @nazmavazid9141
    @nazmavazid9141 2 роки тому

    Very very nice sir

  • @priyakishan
    @priyakishan 3 роки тому

    Great Video.
    How many queues/worker threads can be created on a single machine?-->HTTP fetcher/renderer (I am looking at if I have 100 machines for this)--how many other machines are needed? to continue the remaining ones
    URL frontier->How many queues can be created on a single machine?
    Redis-->how many machines are needed?

  • @shreyasns1
    @shreyasns1 2 роки тому +1

    @Narendra, Thanks for the video and detailed explanation. Could you also add the links to white papers you mentioned in the video description? This would help us to dive deep further to understand the concepts. Thanks again

  • @roooooot9545
    @roooooot9545 4 роки тому

    Great work

  • @Akashkumar-md6rg
    @Akashkumar-md6rg 4 роки тому

    Thnq sir!! For such a great content.
    Your videos are the most practical and interesting way to learn CS.
    You made me your fan sir...
    I really appreciate your hard work. Keep going.🙌🙌

  • @theranajayant
    @theranajayant 5 років тому

    Heyy Narendra, Quite interesting topic you have chosen and it's interesting to learn this topic. You are curating really good and valuable content.

  • @ramesh4joylife
    @ramesh4joylife Рік тому

    It would have helped much better if you had gone through this entire thing with an example crawl from a scaled site

  • @harishaseri
    @harishaseri 5 років тому +1

    Best explained. Thanks u so much naran

  • @jpnr8
    @jpnr8 3 роки тому

    for back queue we can use kafka topics. it maintains order and number of consumers can be mapped to topics count... we can eliminate the heap.

  • @kartik-agarwal
    @kartik-agarwal 3 роки тому

    Kudos

  • @rhythmPhil
    @rhythmPhil 5 років тому +2

    Thanks for your work. This was really interesting.

  • @gouravkhanijoe1059
    @gouravkhanijoe1059 3 роки тому

    Nice

  • @Wei-up2jn
    @Wei-up2jn 3 роки тому +4

    Great content! One question I have in mind is why we want to use one queue for one host? Is it because of http connection overhead if you connect to different host back and forth is high? But in realability the URL coming from front queues might be mixed with different hosts, e.g. a.com/a, b.com, a.com/c, in that case we still have to connect back and forth (assuming we only have one back queue). Unless we could guarantee that all URLs from the same host will come together to the back queue router.

  • @apurvasharma2853
    @apurvasharma2853 4 роки тому

    Excellent explanation!

  • @jyotir124
    @jyotir124 2 роки тому +1

    Thanks a lot for the content. One questions I have. Why do we require DNS resolver. Why specific IP is required. Why can't we just render the page basis domain name? Could you please help udnerstanding.

  • @jamess5330
    @jamess5330 2 роки тому

    Narendra, awesome video for system design! Would you like to host mock interview sessions at Meetapro?

  • @Imkflow
    @Imkflow 3 роки тому

    Thanks for the work on this, very helpful. Quick note, I think if every processor need to receive the same message what you need is a topic instead of a queue.

  • @sayantanray9595
    @sayantanray9595 4 роки тому

    Helpful and detailed!!!

  • @w.maximilliandejohnsonbour725
    @w.maximilliandejohnsonbour725 5 років тому

    Nice info...!!!!!.

  • @vishalraut20
    @vishalraut20 4 роки тому +2

    What is the purpose of Redis? if we are pushing the entries in the queue, what is the need of cache?

  • @mtsmithtube
    @mtsmithtube 2 роки тому

    @16:38 "make it a standard convention of converting it to a lowercase" - careful because URLs are case sensitive. Maybe your duplicate detector should do a case insensitive compare but you don't want to lose the original case when saving urls.

  • @parupatimadhukarreddy6972
    @parupatimadhukarreddy6972 5 років тому

    Hi Narendra,
    I am basically a software developer who mainly deals with Java script technologies. I saw this videos of Distributed systems on your channel, it seems more interesting knowing the architectural front of the web space, even a newbies are able to understand the conceptual part of the subject Appreciate your efforts. What are the technologies or tools that i need to learn or start with to get to know more about Distributed Systems. Thank you

  • @yeniaraserol
    @yeniaraserol 5 років тому +1

    Dude, great job! I am glad you got a new mic. The sound quality is much better. Thank you! Would you consider remake of all videos with this new sound? I will watch all the ads promise 😃

  • @iitgupta2010
    @iitgupta2010 5 років тому

    I think we should decoupled the priority based crawler to normal crawler otherwise due to back queue router, all low priority crawler will be starve and never gets the chance to get crawl.
    We can have two/more system which are responsible for crawling every minute or less (like share market), every 5 minute or 1hr ... 1 day or week up to 1 month.
    This way we can scale them very easily and manage them better. This also help us to build politeness too.

  • @pengli7213
    @pengli7213 3 роки тому +2

    What is the implementation of back queue? I don't think it's a Kafka queue, right? Or there might be too many topics. I guess it can be a key-value data structure, such as [domain_name, url, fetched(boolean)] ? Each time when we want to get a url from the "back queue", we just query the key-value and get a url which is not fetched ?

    • @hemanthaugust7217
      @hemanthaugust7217 15 днів тому

      Yes, I too felt the same.There could be 100M websites.We can't have so many Kafka Topics.
      So, if we keep the {DomainName} => {URL, TimeToFetchAfter} in KV store. It has to be a Distributed KV store as we're storing a huge amount of data. Now, the challenge is - there can will be multiple KV Store shards (like Redis) such as 1=>{{DomainName} => {URL, TimeToFetchAfter} }, 2=>{{DomainName} => {URL, TimeToFetchAfter} }, ...
      ShardKey = func(domainName).
      It seems a lot of work. But, don't see any other alternative.

  • @PiyushSingh-vx7bx
    @PiyushSingh-vx7bx 4 роки тому

    Amazing explanation brother 🔥

  • @AyushRaj-so3zh
    @AyushRaj-so3zh 3 роки тому

    This was GOLD !! Amazing content

  • @shreyade5000
    @shreyade5000 Рік тому +1

    Nice content but long pause at 40:31, it distracts you if you are listening with concentration. Please edit it.

    • @Amin-wd4du
      @Amin-wd4du 2 місяці тому

      This was the funniest and most interesting part of the video, please don’t edit.

  • @argstutorial2916
    @argstutorial2916 3 роки тому

    Very nice conceptual explanations & tools utilizations. You have put a lot of energy with R&D. I hope this will help who are seeking to develop their own system for data processing / scraping mechanisms. Great Work, Keep it Up MaN.

  • @renon3359
    @renon3359 3 роки тому

    Great video man. You deserve much more subscribers.

  • @vishalmahavratayajula9658
    @vishalmahavratayajula9658 4 роки тому

    Awesome video. Can't thank you enough narenndra

  • @dimakhamula
    @dimakhamula 7 місяців тому

    Great video, but I have a question. Imagine the situation: We have 100 Back Queues and all of them are filled with distinct domains. Later as our URL frontier works, it pops one URL from some Front Queue with the biggest priority. Then it routs this URL to some Back Queue, but we don't such domain within those Back Queues. What to do at this situation? Make one more Queue and one more Thread?

  • @ShailySaini-j6b
    @ShailySaini-j6b 4 місяці тому

    have a question: at 25:30 it was mentioned that number of Back queues are same as number of worker threads , so is their a one-to-one mapping between back queue and worker thread as well . If so , what is the use of Heap here , whenever worker thread needs new job it will get from its assigned back queue ?

  • @ambermani1667
    @ambermani1667 4 роки тому +4

    19:06 why we directly jumped to conclusion to use bloom filter? why can't a distributed hash table will work to know if a site is already crawled or not. its not O(n). we can hash the urls and shard the urls based on hash, then search the url in specific shred hash table.

  • @ashish0687
    @ashish0687 5 років тому +1

    Thank you Naren, These video's are great source of learning. Very much appreciate the details/time/efforts on your part to build the content and present/share it across. If possible can you also please make a video about Geohashing (& usecase around performing geospatial searches) ...

  • @kkiitian
    @kkiitian 4 роки тому

    Can you post some link for asking questions .
    1. Why dedupe and url extractor are parallel . If content is same of the document , do we need to extract the urls in that page. If not shouldnt these steps be in sequence with dedupe being first.
    2. If we already have 1 worker per domain in frontier quest, why do we need heap. We can just simply have wait(6 sec) on thread to wait before

  • @helikopter1231
    @helikopter1231 2 роки тому

    Wow such detail and explained so well! Thank you so much! You actually made it sound interesting haha - im not a huge fan of web stuff but this actually made me curious.

  • @RealAbhishekSingh
    @RealAbhishekSingh 4 роки тому

    wow, such great explanation, thank you :)

  • @zhaoc033
    @zhaoc033 4 роки тому +6

    Not sure if this works...if there is one queue for each domain name, there will be a lot of queues. What if one queue dies? Also why do you need a heap?

    • @vishalraut20
      @vishalraut20 4 роки тому

      I was thinking the same thing. If we are keeping the values sorted, why do we need a heap?

    • @kobew1351
      @kobew1351 3 роки тому

      using a queue for each different host name doesn't look like a workable solution.

  • @dkdraipur
    @dkdraipur 5 років тому +2

    Can you please upload a video on e-commerce website. Like how Amazon/ Flipkart handles huge traffic on their sales like big billion day.

  • @DebasisUntouchable
    @DebasisUntouchable 5 років тому +1

    Great video! Thanks for sharing! Can you please refer me a book where I can get such great examples on System design?

  • @JosephKalash
    @JosephKalash 3 роки тому +1

    to implement freshness, a crawler needs to continuously recrawl pages already crawled to check for content change. but the URL loader (bloom filter) will likely always reject an already-crawled URL. So, how are entries in the bloom index expired regularly based on the same priority factor used on historical data mentioned, so that same URLs can be recrawled? That is necessary to maintain a fairly current representation of each indexed page.

    • @aniljuneja175
      @aniljuneja175 2 роки тому

      can we not write this logic by passing some parameter in url frontier and check in crawl duplicate service ?

  • @samirhere4341
    @samirhere4341 5 років тому +1

    Great video. Keep up the good work. Can you do system design video on amazon fresh/getbojo/blue apron/plated/embrace box/trytheworld. The concept of how subscription and continues reoccurring delivery system works. Thank you

  • @subee128
    @subee128 3 місяці тому

    Thanks

  • @АникинКирилл-з2б
    @АникинКирилл-з2б 4 роки тому

    Hi, really awesome videos, thanks!