Great in detail System design. The only part I would probably skip is the heap (each queue is already tied to a thread/worker) as it looks more like a bottleneck and serves only as a timer to slow down the crawling for politeness which can be done in different ways.
Think it this way, priorities based queue and then host based but you don’t know once in hosts which host to handle first ( priority is lost) so pq is filled with first elements of each of back a and then urls downloaded based on priority ensuring politeness. Merge k sorted arrays is good pointer to this , there is no point in locking threads to each queue if that is doubt because then priority is per q and not across all. Think a host has all priory 100 urls and others have 1-99 so then why should that 100 host be prioritized, it should not be unless we implement nice call something similar to increase priority to avoid aging
Makes me sad that this dude crams so much amazing content into these videos and gets 42k views but the dumbest 10 second videos get millions of views... I wish youtube had a notion of content score and quality.
I have a suggestion to include two things in your videos which will definitely help: 1. QPS Calculation 2. Sharding key when we are planning to shard the DB
The more detailed and clear content of crawler design I've seen. Thanks for doing this. It would be great if you can also clarify how the data transition happens between various services such as Extractor, Duplicate Detection, URL filter and Loader.
Finally you start building the video in actual flow, that's really great and it will really help the viewer to understand and build the actual knowledge of SD. Great bro.
thank you for the video! this was by far the most helpful system design video walk through I've seen. I've been struggling a lot with system design. Thank you for putting this together!
Generally I don't post comments but this is one of the best system design (in detail) I have ever seen. It has re-kindled my thought process on how to think for a System Design question.
@@TechDummiesNarendraL Blaise Pascal, In his Lettres Provinciales, the French philosopher and mathematician Blaise Pascal famously wrote - "I would have written a shorter letter, but I did not have the time." : ) Buddy you're awesome. Keep up the good work. Wish you the best.
after watching a lot of system design videos i really had to understand that this level of detail is NOT EXPECTED in an interview. i really stressed myself out trying to ask so many clarifying questions, and cover every single aspect of a system in a 45 minute block. this is not expected. remember - these videos are edited, shortened, rehearsed, and practiced. trust me when i say set a lower bar for yourself for interviews LOL thanks!!!
I really really appreciate your effort bro, whoever ask me I always suggest your name first. There are few others like gkcs but if you ask me there are nothing in front of your design skills. You really talk about things which matters. This is something I have not found in even paid courses. This is awesome in one word. You should have lot of subscriber. They will be soon.
a) Why not use a graph db instead of bigtable or anything b) why do those envelope calculations like 6PB or anything when we never used it and we never proved that the design will handle that amount of data c) We definitely should talk about how to make it distributed since 1 crawler cannot crawl everything, so how are we going to make sure that multiple crawlers are not crawling the same things d) how are we going to store these documents in different db and what kind of sharding we are doing to use i think those are some important things to talk about especially giving interviews
why do you need a heap? it sounds like a bottom neck right there. why can't backqueue selector use LB like round robine select the queue and remove item from the queue.
Thanks a lot for exlpaning all the modules in simple manner.. Your channel is the place where one can stop and learn everything in easy way..thanks a ton and keep doing this great work
Instead of coupling back-queues with threads, I would say have more number of threads for priority URLs and less for others. for this to work, we can handle the politeness at front-queues where we put the subsequent URLs in low priority queues.
Heap is an implementation detail. Im being nit picky (this is a great video) but just some thoughts - Why does time stamp based priority even matter in this system? You didn’t mention that. It could be because you don’t want certain queues to get starved. A simpler approach might be to process each queue round robin and only mention the priority queue to your interviewer if they nudge you in that direction or if you want to slowly build to it to discuss trade offs. If each back queue has a priority, then just call out that we want a priority queue. You could say back queues have same priority but maybe other back queues dedicated to urls that we expect are updated at a faster rate have higher priority. But then you need a solution to the problem of other lower priority queues getting starved.
Great explanation. Yes . What I gather is that "URL Frontier" essentially implements a 1. Priority selection . -> Front Queue 2. Politeness guarantee . -> Back Queue The main issue what we are looking at is how to pick the next URL from the "URL frontier" microservice to be sent to a thread for processing. As you said, we could do a round-robin method where all Back queues get picked from in an equal - fashion. Or kind of a "weighted" method aka. priority_queue based solution to make sure the hottest websites get crawled in smaller/tighter time intervals. I think its always better to just give the simplest approach first (i.e just draw a black box tagged "Queue Selection" ) and deep dive later if the interviewer wishes to. There is a saying in system designing world = "KISS" == Keep It Simple and Stupid . Its' unlikely that you would run your interviewer out of questions, so better to even nudge the interviewer in your direction of thinking by giving out ever so slightest of hints, so that he starts asking the questions which you already have the answer to.
Great Video. How many queues/worker threads can be created on a single machine?-->HTTP fetcher/renderer (I am looking at if I have 100 machines for this)--how many other machines are needed? to continue the remaining ones URL frontier->How many queues can be created on a single machine? Redis-->how many machines are needed?
@Narendra, Thanks for the video and detailed explanation. Could you also add the links to white papers you mentioned in the video description? This would help us to dive deep further to understand the concepts. Thanks again
Thnq sir!! For such a great content. Your videos are the most practical and interesting way to learn CS. You made me your fan sir... I really appreciate your hard work. Keep going.🙌🙌
Great content! One question I have in mind is why we want to use one queue for one host? Is it because of http connection overhead if you connect to different host back and forth is high? But in realability the URL coming from front queues might be mixed with different hosts, e.g. a.com/a, b.com, a.com/c, in that case we still have to connect back and forth (assuming we only have one back queue). Unless we could guarantee that all URLs from the same host will come together to the back queue router.
Thanks a lot for the content. One questions I have. Why do we require DNS resolver. Why specific IP is required. Why can't we just render the page basis domain name? Could you please help udnerstanding.
Thanks for the work on this, very helpful. Quick note, I think if every processor need to receive the same message what you need is a topic instead of a queue.
@16:38 "make it a standard convention of converting it to a lowercase" - careful because URLs are case sensitive. Maybe your duplicate detector should do a case insensitive compare but you don't want to lose the original case when saving urls.
Hi Narendra, I am basically a software developer who mainly deals with Java script technologies. I saw this videos of Distributed systems on your channel, it seems more interesting knowing the architectural front of the web space, even a newbies are able to understand the conceptual part of the subject Appreciate your efforts. What are the technologies or tools that i need to learn or start with to get to know more about Distributed Systems. Thank you
Dude, great job! I am glad you got a new mic. The sound quality is much better. Thank you! Would you consider remake of all videos with this new sound? I will watch all the ads promise 😃
I think we should decoupled the priority based crawler to normal crawler otherwise due to back queue router, all low priority crawler will be starve and never gets the chance to get crawl. We can have two/more system which are responsible for crawling every minute or less (like share market), every 5 minute or 1hr ... 1 day or week up to 1 month. This way we can scale them very easily and manage them better. This also help us to build politeness too.
What is the implementation of back queue? I don't think it's a Kafka queue, right? Or there might be too many topics. I guess it can be a key-value data structure, such as [domain_name, url, fetched(boolean)] ? Each time when we want to get a url from the "back queue", we just query the key-value and get a url which is not fetched ?
Yes, I too felt the same.There could be 100M websites.We can't have so many Kafka Topics. So, if we keep the {DomainName} => {URL, TimeToFetchAfter} in KV store. It has to be a Distributed KV store as we're storing a huge amount of data. Now, the challenge is - there can will be multiple KV Store shards (like Redis) such as 1=>{{DomainName} => {URL, TimeToFetchAfter} }, 2=>{{DomainName} => {URL, TimeToFetchAfter} }, ... ShardKey = func(domainName). It seems a lot of work. But, don't see any other alternative.
Very nice conceptual explanations & tools utilizations. You have put a lot of energy with R&D. I hope this will help who are seeking to develop their own system for data processing / scraping mechanisms. Great Work, Keep it Up MaN.
Great video, but I have a question. Imagine the situation: We have 100 Back Queues and all of them are filled with distinct domains. Later as our URL frontier works, it pops one URL from some Front Queue with the biggest priority. Then it routs this URL to some Back Queue, but we don't such domain within those Back Queues. What to do at this situation? Make one more Queue and one more Thread?
have a question: at 25:30 it was mentioned that number of Back queues are same as number of worker threads , so is their a one-to-one mapping between back queue and worker thread as well . If so , what is the use of Heap here , whenever worker thread needs new job it will get from its assigned back queue ?
19:06 why we directly jumped to conclusion to use bloom filter? why can't a distributed hash table will work to know if a site is already crawled or not. its not O(n). we can hash the urls and shard the urls based on hash, then search the url in specific shred hash table.
Thank you Naren, These video's are great source of learning. Very much appreciate the details/time/efforts on your part to build the content and present/share it across. If possible can you also please make a video about Geohashing (& usecase around performing geospatial searches) ...
Can you post some link for asking questions . 1. Why dedupe and url extractor are parallel . If content is same of the document , do we need to extract the urls in that page. If not shouldnt these steps be in sequence with dedupe being first. 2. If we already have 1 worker per domain in frontier quest, why do we need heap. We can just simply have wait(6 sec) on thread to wait before
Wow such detail and explained so well! Thank you so much! You actually made it sound interesting haha - im not a huge fan of web stuff but this actually made me curious.
Not sure if this works...if there is one queue for each domain name, there will be a lot of queues. What if one queue dies? Also why do you need a heap?
to implement freshness, a crawler needs to continuously recrawl pages already crawled to check for content change. but the URL loader (bloom filter) will likely always reject an already-crawled URL. So, how are entries in the bloom index expired regularly based on the same priority factor used on historical data mentioned, so that same URLs can be recrawled? That is necessary to maintain a fairly current representation of each indexed page.
Great video. Keep up the good work. Can you do system design video on amazon fresh/getbojo/blue apron/plated/embrace box/trytheworld. The concept of how subscription and continues reoccurring delivery system works. Thank you
im not even here to study for interview, just watching this for my passion project, very helpful
Guys, please like the video, as an engineer I know how much hard work behind a video like this. This is my go to system design resources. Great work!
Goutam Singh well, some of the audience might be “engineer-to-be”😉
Fake comments
Awesome explanation, As a engineer i know that, how much hard work behind the scenes. Really Appreciate 👏
estimation 5:30
HLD 6:33
queue manage 25:30
update and duplicate handle 33:40
Sim hash 39:26
storage 42:00
Great in detail System design. The only part I would probably skip is the heap (each queue is already tied to a thread/worker) as it looks more like a bottleneck and serves only as a timer to slow down the crawling for politeness which can be done in different ways.
I was also thinking on the same lines.
Think it this way, priorities based queue and then host based but you don’t know once in hosts which host to handle first ( priority is lost) so pq is filled with first elements of each of back a and then urls downloaded based on priority ensuring politeness. Merge k sorted arrays is good pointer to this , there is no point in locking threads to each queue if that is doubt because then priority is per q and not across all. Think a host has all priory 100 urls and others have 1-99 so then why should that 100 host be prioritized, it should not be unless we implement nice call something similar to increase priority to avoid aging
Makes me sad that this dude crams so much amazing content into these videos and gets 42k views but the dumbest 10 second videos get millions of views...
I wish youtube had a notion of content score and quality.
ikr! too much going on to be fully appreciated in his vids.
Hey it totally depends on what people are interested in...
Thanks for this. Most detailed design. No other youtuber explained exactly how the URL Frontier works.
I have a suggestion to include two things in your videos which will definitely help:
1. QPS Calculation
2. Sharding key when we are planning to shard the DB
The more detailed and clear content of crawler design I've seen. Thanks for doing this. It would be great if you can also clarify how the data transition happens between various services such as Extractor, Duplicate Detection, URL filter and Loader.
Narenda, thank you for this excellent video. Much appreciated.
Finally you start building the video in actual flow, that's really great and it will really help the viewer to understand and build the actual knowledge of SD. Great bro.
thank you for the video! this was by far the most helpful system design video walk through I've seen. I've been struggling a lot with system design. Thank you for putting this together!
Awesome, knowledgeable. thank you for the video
awesome explaination of all the high level components. Good job
Had to watch multiple times to understand everything in the video. Thanks for the awesome explanation!
Thank you for this elaborate design, great work!
Bes channel I came across for learning about system design, Thank you and keep it up
Kudos to the wonderful work!!!
Generally I don't post comments but this is one of the best system design (in detail) I have ever seen. It has re-kindled my thought process on how to think for a System Design question.
Great content.. appreciate the details and thoroughness!
Superb video, very helpful. Thank you.
very nice n detailed video, thank you sir!
Thank you so much for these efforts. I mean 45 minutes video is not a joke with so much to grasp.
I am trying make it short. But failed to do so
@@TechDummiesNarendraL No , I am in no way complaining at all. I loved it. I am so thankful to you for this.
@@alokuttamshukla thanks
@@TechDummiesNarendraL Blaise Pascal, In his Lettres Provinciales, the French philosopher and mathematician Blaise Pascal famously wrote - "I would have written a shorter letter, but I did not have the time." : )
Buddy you're awesome. Keep up the good work. Wish you the best.
after watching a lot of system design videos i really had to understand that this level of detail is NOT EXPECTED in an interview. i really stressed myself out trying to ask so many clarifying questions, and cover every single aspect of a system in a 45 minute block. this is not expected. remember - these videos are edited, shortened, rehearsed, and practiced. trust me when i say set a lower bar for yourself for interviews LOL
thanks!!!
Great explanation man. Loved your videos. Why have you stopped making these. Hope to see you back soon!!
I really really appreciate your effort bro, whoever ask me I always suggest your name first. There are few others like gkcs but if you ask me there are nothing in front of your design skills. You really talk about things which matters. This is something I have not found in even paid courses. This is awesome in one word.
You should have lot of subscriber. They will be soon.
very informative, thanks
It's not easy to make such in-depth content-rich video. Thank you Narendra :)
Good stuff! Thank you. One suggestion, for the next video, keep the information text slides on screen for more than 300ms...
a) Why not use a graph db instead of bigtable or anything
b) why do those envelope calculations like 6PB or anything when we never used it and we never proved that the design will handle that amount of data
c) We definitely should talk about how to make it distributed since 1 crawler cannot crawl everything, so how are we going to make sure that multiple crawlers are not crawling the same things
d) how are we going to store these documents in different db and what kind of sharding we are doing to use
i think those are some important things to talk about especially giving interviews
Just have to say that it's amazing content. Ket it up, Narendra!
making things clear and easier, thanks for your effort. I really appreciate your efforts.
You are the best! Enjoyed all of your system design videos!
Very detailed discussion. Thanks for making this vdo.
why do you need a heap? it sounds like a bottom neck right there. why can't backqueue selector use LB like round robine select the queue and remove item from the queue.
Thanks a lot for exlpaning all the modules in simple manner.. Your channel is the place where one can stop and learn everything in easy way..thanks a ton and keep doing this great work
such an amazing explanation, thank you very much, keep up the good work
thank you so much for posting! i love your videos.
i just got asked this in a facebook interview and i wish i had seen this video beforehand.
amazing video thank you so very much sir!!!
This was a really, really excellent overview, thank you for putting this video together!
I have watched this video so many times in the past year that I'm almost quoting every word you say
A lot of great effort has been put into your videos, thanks
Nice explanation.
This is the kinda of teammate I would like to work with. So much content. Thanks for sharing
Instead of coupling back-queues with threads, I would say have more number of threads for priority URLs and less for others.
for this to work, we can handle the politeness at front-queues where we put the subsequent URLs in low priority queues.
Awesone didactic
Thanks a lot for the video mate, really useful
Sir, thank you so much for these great lessons
Heap is an implementation detail. Im being nit picky (this is a great video) but just some thoughts - Why does time stamp based priority even matter in this system? You didn’t mention that. It could be because you don’t want certain queues to get starved. A simpler approach might be to process each queue round robin and only mention the priority queue to your interviewer if they nudge you in that direction or if you want to slowly build to it to discuss trade offs. If each back queue has a priority, then just call out that we want a priority queue. You could say back queues have same priority but maybe other back queues dedicated to urls that we expect are updated at a faster rate have higher priority. But then you need a solution to the problem of other lower priority queues getting starved.
Great explanation. Yes .
What I gather is that "URL Frontier" essentially implements a
1. Priority selection . -> Front Queue
2. Politeness guarantee . -> Back Queue
The main issue what we are looking at is how to pick the next URL from the "URL frontier" microservice to be sent to a thread for processing.
As you said, we could do a round-robin method where all Back queues get picked from in an equal - fashion. Or kind of a "weighted" method aka. priority_queue based solution to make sure the hottest websites get crawled in smaller/tighter time intervals.
I think its always better to just give the simplest approach first (i.e just draw a black box tagged "Queue Selection" ) and deep dive later if the interviewer wishes to. There is a saying in system designing world = "KISS" == Keep It Simple and Stupid . Its' unlikely that you would run your interviewer out of questions, so better to even nudge the interviewer in your direction of thinking by giving out ever so slightest of hints, so that he starts asking the questions which you already have the answer to.
Definitely helpful ! Appreciate it Narendra!
I crawled a word from this video is "basically" and inverted index it ....lol [don' have that much time 😝
Great video as always
Thanks for your time,efforts and content.
Hi Naren, thanks for the awesome video, can you also make a video on designing distributed scheduling system
Sure
As always excellent. just remove the music at 46 second mark :)
Thank you so much for the beautiful explanation :)
This is going to be a lot of help for my distributed systems course :). Thanks for all the hard work.
Very detailed explanation - best ever :)
Very very nice sir
Great Video.
How many queues/worker threads can be created on a single machine?-->HTTP fetcher/renderer (I am looking at if I have 100 machines for this)--how many other machines are needed? to continue the remaining ones
URL frontier->How many queues can be created on a single machine?
Redis-->how many machines are needed?
@Narendra, Thanks for the video and detailed explanation. Could you also add the links to white papers you mentioned in the video description? This would help us to dive deep further to understand the concepts. Thanks again
Great work
Thnq sir!! For such a great content.
Your videos are the most practical and interesting way to learn CS.
You made me your fan sir...
I really appreciate your hard work. Keep going.🙌🙌
Heyy Narendra, Quite interesting topic you have chosen and it's interesting to learn this topic. You are curating really good and valuable content.
It would have helped much better if you had gone through this entire thing with an example crawl from a scaled site
Best explained. Thanks u so much naran
for back queue we can use kafka topics. it maintains order and number of consumers can be mapped to topics count... we can eliminate the heap.
Kudos
Thanks for your work. This was really interesting.
Nice
Great content! One question I have in mind is why we want to use one queue for one host? Is it because of http connection overhead if you connect to different host back and forth is high? But in realability the URL coming from front queues might be mixed with different hosts, e.g. a.com/a, b.com, a.com/c, in that case we still have to connect back and forth (assuming we only have one back queue). Unless we could guarantee that all URLs from the same host will come together to the back queue router.
Excellent explanation!
Thanks a lot for the content. One questions I have. Why do we require DNS resolver. Why specific IP is required. Why can't we just render the page basis domain name? Could you please help udnerstanding.
Narendra, awesome video for system design! Would you like to host mock interview sessions at Meetapro?
Thanks for the work on this, very helpful. Quick note, I think if every processor need to receive the same message what you need is a topic instead of a queue.
Helpful and detailed!!!
Nice info...!!!!!.
What is the purpose of Redis? if we are pushing the entries in the queue, what is the need of cache?
@16:38 "make it a standard convention of converting it to a lowercase" - careful because URLs are case sensitive. Maybe your duplicate detector should do a case insensitive compare but you don't want to lose the original case when saving urls.
Hi Narendra,
I am basically a software developer who mainly deals with Java script technologies. I saw this videos of Distributed systems on your channel, it seems more interesting knowing the architectural front of the web space, even a newbies are able to understand the conceptual part of the subject Appreciate your efforts. What are the technologies or tools that i need to learn or start with to get to know more about Distributed Systems. Thank you
Dude, great job! I am glad you got a new mic. The sound quality is much better. Thank you! Would you consider remake of all videos with this new sound? I will watch all the ads promise 😃
:P
I think we should decoupled the priority based crawler to normal crawler otherwise due to back queue router, all low priority crawler will be starve and never gets the chance to get crawl.
We can have two/more system which are responsible for crawling every minute or less (like share market), every 5 minute or 1hr ... 1 day or week up to 1 month.
This way we can scale them very easily and manage them better. This also help us to build politeness too.
What is the implementation of back queue? I don't think it's a Kafka queue, right? Or there might be too many topics. I guess it can be a key-value data structure, such as [domain_name, url, fetched(boolean)] ? Each time when we want to get a url from the "back queue", we just query the key-value and get a url which is not fetched ?
Yes, I too felt the same.There could be 100M websites.We can't have so many Kafka Topics.
So, if we keep the {DomainName} => {URL, TimeToFetchAfter} in KV store. It has to be a Distributed KV store as we're storing a huge amount of data. Now, the challenge is - there can will be multiple KV Store shards (like Redis) such as 1=>{{DomainName} => {URL, TimeToFetchAfter} }, 2=>{{DomainName} => {URL, TimeToFetchAfter} }, ...
ShardKey = func(domainName).
It seems a lot of work. But, don't see any other alternative.
Amazing explanation brother 🔥
This was GOLD !! Amazing content
Nice content but long pause at 40:31, it distracts you if you are listening with concentration. Please edit it.
This was the funniest and most interesting part of the video, please don’t edit.
Very nice conceptual explanations & tools utilizations. You have put a lot of energy with R&D. I hope this will help who are seeking to develop their own system for data processing / scraping mechanisms. Great Work, Keep it Up MaN.
Great video man. You deserve much more subscribers.
Awesome video. Can't thank you enough narenndra
Great video, but I have a question. Imagine the situation: We have 100 Back Queues and all of them are filled with distinct domains. Later as our URL frontier works, it pops one URL from some Front Queue with the biggest priority. Then it routs this URL to some Back Queue, but we don't such domain within those Back Queues. What to do at this situation? Make one more Queue and one more Thread?
have a question: at 25:30 it was mentioned that number of Back queues are same as number of worker threads , so is their a one-to-one mapping between back queue and worker thread as well . If so , what is the use of Heap here , whenever worker thread needs new job it will get from its assigned back queue ?
19:06 why we directly jumped to conclusion to use bloom filter? why can't a distributed hash table will work to know if a site is already crawled or not. its not O(n). we can hash the urls and shard the urls based on hash, then search the url in specific shred hash table.
Thank you Naren, These video's are great source of learning. Very much appreciate the details/time/efforts on your part to build the content and present/share it across. If possible can you also please make a video about Geohashing (& usecase around performing geospatial searches) ...
Can you post some link for asking questions .
1. Why dedupe and url extractor are parallel . If content is same of the document , do we need to extract the urls in that page. If not shouldnt these steps be in sequence with dedupe being first.
2. If we already have 1 worker per domain in frontier quest, why do we need heap. We can just simply have wait(6 sec) on thread to wait before
Wow such detail and explained so well! Thank you so much! You actually made it sound interesting haha - im not a huge fan of web stuff but this actually made me curious.
wow, such great explanation, thank you :)
Not sure if this works...if there is one queue for each domain name, there will be a lot of queues. What if one queue dies? Also why do you need a heap?
I was thinking the same thing. If we are keeping the values sorted, why do we need a heap?
using a queue for each different host name doesn't look like a workable solution.
Can you please upload a video on e-commerce website. Like how Amazon/ Flipkart handles huge traffic on their sales like big billion day.
Working on it
Great video! Thanks for sharing! Can you please refer me a book where I can get such great examples on System design?
to implement freshness, a crawler needs to continuously recrawl pages already crawled to check for content change. but the URL loader (bloom filter) will likely always reject an already-crawled URL. So, how are entries in the bloom index expired regularly based on the same priority factor used on historical data mentioned, so that same URLs can be recrawled? That is necessary to maintain a fairly current representation of each indexed page.
can we not write this logic by passing some parameter in url frontier and check in crawl duplicate service ?
Great video. Keep up the good work. Can you do system design video on amazon fresh/getbojo/blue apron/plated/embrace box/trytheworld. The concept of how subscription and continues reoccurring delivery system works. Thank you
Thanks
Hi, really awesome videos, thanks!