I have been watching too many system design videos, and most of them throw boxes and tools at the canvas just for the sake of it. But your videos follow an interesting and pragmatic approach that someone could actually use to design a real system. Above all, I truly appreciate the framework that you are infusing in viewers mind to tackle problems. Thanks for your efforts 🚀
I often don't comment for the videos. But couldn't stop commenting your video just to say "What a valuable content". Thanks a lot for all your videos!! Keep doing this..
I had an interview last Friday (June 14) and I followed your exact steps. The question was to design the Ticketmaster. The Redis cache solution was the best. Thank you for these amazing videos
Hey, thanks for your video. I have watched all your content and I gained immense amount of knowledge. I gave my E4 interview a week back, and my question was this (with a slight variation of the crawling being done through an app which was deployed in 10k devices). I covered all the content which you've presented here in the same structure, and was able to dive deep into all the parts the interviewer asked. I was expecting an offer but got rejected due to "No Hire" in Design round. After retrospection, I could find some people talking about chord algorithm and peer2peer crawler was expected. I still don't understand what would be the cause for No hire, because interviewer didn't even hint towards anything and was aligned throughout. The experience was really heartbreaking. SO, I just wanted to leave it out here that even though I did my best, it wasn't my day (I guess). thanks for your videos, nonetheless
In my opinion one of the most important bullets of your strategy is how you minimize the initial HLD and you make sure you deliver something that actually covers all the functional requirements. I find this calibration really valuable and not that easy to achieve, since as a Senior candidate, one can be tempted to go straight to deep dives without actually setting clearly that pause from HLD to deep dives. What do you recommend to get better at this?
Bro, pls don't stop posting this kind of contents, really loved it so far with all of your videos. Able to relate with the kind of small impactful problems and solutions you mentioned during your videos, which indirectly impact the interviews
By far the most inspiring, relevant and practical system design interview content. I found them really useful to perform strongly in my system design interviews
Thank you for the great content and congratulations for making it a goto channel for system design. Content is refreshing and watch once never forget types. I request you to make a content to share how to approach a problem that we have not seen before. What best we could do like either map it to any related system or think logically how api/design would work focusing on the problem asked.
Great content as always, thank you! Some comments about the design. 1. Concurrency within a crawler is going to bring a huge performance bonus. 2. Running an async framework for network io is much more faster than using threading. 3. We can put the retry logic within the crawler to make things simpler. 4. DNS caching looked like overengineering because DNS is already cached on multiple layers, programming language, OS, ISP and etc. 5. We're processing the html in another service but we're hashing the HTML in the crawler, that seems wrong.
I've been building a web scraper on my own and using similar logic, and after a month, I see this. I swear to god this helped me a lottttt, but honestly, it's good that I didn't see this on day 1. Otherwise, I would not have learned things on my own. Great job, guys. PS: I got to know about you from Jordan. Keep posting great content, both of you guys!!!
I'm watching your videos to get prepared for my interview 4 days later, I hope I'll be able to handle it :DDD , so far the best SD videos I could ever find on youtube.
One of the first things that came to mind in the beginning of this problem is dynamic webpages. Most websites don't display the majority of their content on simple HTML. To be honest if I was interviewing a senior or above level candidate, not mentioning dynamic content early on would be seen as a red flag. I'm glad you included it at the end of your video, but I do think it is important enough to be mentioned early on.
If Kafka does not support retry out of the box, what does that exactly mean? if you do not commit, does it not get move the offset, which could potentially serve as retry like(?) Also, could you compare this with some other queueing service that allows for retry like SQS maybe? Comparison on when to use Kafka vs SQS would be really good too! message broker vs task queue might be their most frequent use cases but might be good to provide justifications in this scenario!
I could not find the information where it is mentioned that aws sqs have inbuilt exponential backoff retry mechanism. Can anyone please share the link for the same. Thanks a lot!
Great video! I would probably add a proxy component to this system design for the part where the crawler makes the HTTP calls to fetch the HTML (maybe for the DNS lookups as well). This is a critical part of designing a web crawler because you want avoid making the calls through the network where the web crawlers are deployed for case you get all your network ip addresses blocked and also for security reasons, you want to isolate the outgoing network calls from your instances.
To avoid batching URLs from the same domain together, can we use Kafka partitions and spread messages by hash(URL)? Since different crawlers work at different paces, it is likely they will pick up those URLs at a different time.
This is the first video of yours I watched and I loved it. Your pace is just right and you explain things well, so I didn't feel overwhelmed like I usually do when I watch systems design videos. Thank you!
Thank you for the effort, please keep doing the good job. I'm watching your videos as if it was a Netflix series, very exciting. I was hoping to cover some topics like if the crawler processed the message and failed to commit back to the queue that it processed the message due to a crash, how would you handle such a case? Is there a generic solution where it can be used in different systems instead of workarounds?
Wow, wish I had found this much earlier. Now I certainly wouldn't just go into my next interview and throw the bloom filter onto the diagram without deep thinking 😝
Very well explained! If possible please do share some tips on how one can keep up with latest technologies and develop a mindset towards such system designs. I feel like I'm good at coding but not that great when it comes to designing architecture like this. Basically what I'm looking for is how does one progress from a Developer role to an Architect role..
This is awesome, it is a very comprehensive and clean explanation, I've learnt a lot from your videos, Thanks. May I ask what tool or website you use as the white board?
Here are my concerns: your solution is so nice, but if everyone is going to talk about the same thing during the interview, especially when one is driving the process, will it raise any red flags on the hiring committee side as they might think candidates are referring to the same sources?
This is not meant to be a script. If your plan is to regurgitate this back to an interviewer I’d recommend not doing that. Instead it’s a teaching resource to learn about process, technologies, and potential deep dives. If you get this problem, then sure, talk about some of this stuff, but also let it be a conversation with the interviewer
But if there an issue if you answer all/most of interviewer questions correctly? I believe it is an issue if you memorise this, but can’t go any further, but if you can there is nothing wrong.
@@hello_interview Yeah, makes sense. You present a good framework to structure the talking points that candidates can bring up. And I found it pretty useful. My system design question is the top-k video and I followed the key points you mentioned. My target is E5 and the interviewer just had a handful of follow-up questions (90% of the time I was talking). Eventually, I passed that round with a "strong hire". Of course, I added my points of view during the interview, but I feel like I was just taking something off the shelf.
Suggestions: Please mention what are the clarifying questions to be asked for a specific problem. Even if the problem is well known, the panel still expects to ask few clarifying questions, specially for a senior candidate. Also, if you can cover company specific expectations (if any) for top MAANG companies, that would be excellent.
Damn this is extremely nuanced. Some of the big-picture improvements (like adding the parsing queue) seemed kind of obvious, but then Evan would optimize it with a neat detail (e.g. including link in request so we don't have to fetch from database) that was so simple and yet hadn't occurred to me. Great series, great content, thanks so much!
Interesting. You know, as many times I’ve asked this, no one has every proposed it. Top of my head I see no obvious reason why you couldn’t get it to work, especially for just a one off.
@@hello_interview I do crawling for a large company, typically you would do something like the video's design when you care about data freshness, if you don't care about that, like the LLM use case you, would do a sparky thing where you just split the work to a bunch of workers, you can have the html fetching and processing parts in different stages. Your inputs can be the URLs and previous crawled pages and join them, so that you crawl only new urls, or recrawl URLs only after some time since their last crawl. The main disadvantage compared to your design is that you are not as fault tolerant as you can't do much in terms of checkpointing. Also it is less fun to discuss:)
@@hello_interview Feedback: really enjoyed the video! Would love if future videos were also mostly skewed towards deep dives. Suggesting other topics to research yourself (or hash out with others in the comments) is also super valuable. Finally, calling out the anti patterns that are being regurgitated (e.g. bloom filters) is very valuable as well.
@@davidoh0905 during the deep dive Evan says that Bloom Filters are commonly used in the interviews because it's they are used in solutions in the popular interview prep books. But the interview prep books don't do a great job of discussing the tradeoffs behind using a Bloom Filters vs more practical solutions. It's a nice theoretical solution, but in a real world system you could do something simpler and just bruteforce the problem.
Your channel is a gold mine! Thanks a ton. How to decide whether to use Kinesis data streams or SQS? Although they serve different purposes, it feels like both are good options to begin with, generally. Here, SQS ended up being a better option because of retries, DLQ support, etc. But ideally, I'd like to be able to deterministically and correctly choose the right option in the beginning itself. It'll be super helpful if you could quickly reason out in the videos (in just 1 or 2 lines) why you pick a certain offering over other seemingly similar technologies/offering!
Great content. In deep dives around 52:41 "when you get a new URL you'll put it on here it'll be undefined and then when we actually parse it we'll update this with" and 52:46 "the real last craw time and with the S3 link which also would have been undefined so that would handle that" - I think you mean -- when we actually crawl and download it, we'll update it with the last crawl time and with the S3 link. Also when you use Dynamo the look up will be Log(1) not Log (n). Would be great if you had the DynamoDB GSI schema.
Few inputs: - Bandwidth calculation need to factor in upload data to S3 as well. You will probably also do some compression while upload, and given HTML data had be fairly highly compressible. - At that rate, the system will likely not be network throughput bound, but usually latency and number of connections bound. Assume that each site takes 1 sec to return the web page, so for 10k requests per sec for each node, you will need 10k TCP connections, which if under possible limit but will lead to a number perf issues. - Memory requirements: 10k * 2 MB = 20 GB, should be enough, but all of these are GCable. less reusable memory and TCP connection - You will likely be better off using a lower node type, around 50 Gbps, utilisation beyond that for a single node is going to be challenging and you will hit other limits. - Another optimisation will be to have the parsing and crawling in the same process to avoid passing off the HTML content to a separate process. You can also update the DB in one write with all the links.
thanks for the great content as always! One quick question: for redis and global secondary index comparison, given the data can be stored in the single instance, if we use hash based index (not sure if it is supported by dynamo, but should be supported by MySQL), then it should also be O(1) and redis in this case should be over-engineering a bit?
Thanks for this! As far as checking the hash @ 57:00, wouldn’t we already have the last hash since we had to retrieve that url record before we fetched the webpage because we had to go get the lastCrawlTime?
Depth should be on Domain Table instead of URL Table. URLs would be unique, so the depth would not increase. Whereas, the depth will increase on a Domain, and having max depth will restrict us from falling into a loop-trap.
I usually refrain from commenting but this is by far the best explanation I can find for this problem statement. I work at Amazon, the use of message visibility timeout for exponential backoff is exactly what we do to add a delay of 1 hour for our retryable messages. One very minor practical insight is to not use the metric approximate message receive count because it is almost always incorrect because the count goes up if a thread reads the message but doesn't process it. I used a retry count attribute while putting message in the queue and checked whether it exceeds the retry threshold.
can you post the 2nd top voted one (youtube) earlier? At least written version :) Also very interested in the stock exchange question, but I see that's further down.
Thanks for this video. This video is one of the best in the internet for crawler system design. With a full preparation you are going to an hour, how to manage it in 35mins of 45mins interview.
This is one of the best system design videos on the interview. Kudos to you. I would like to understand a little more on how do we handle duplicate content? What if the content is 80% same on two pages? Hash will work only when pages are exactly the same.
I am not entirely sure I agree with the trade-off discussion between Bloom-filter vs Hash(GSI). Hash collisions can occur, which means we can still receive false positives with GSI hashes.
Hash collisions will almost certainly not occur. They’re so rare they’re not worth designing around for a system like this, where the consequence is minor. It’s 1 in 340 undecillion chance lol
So sorry for being Microsoft Word, but on all of your videos THE APROACH is spelled incorrectly. Thank you so much for posting all your videos. Super helpful for all of us interviewees out there!
Thanks for sharing the SD on web crawler. question : - how to consider dynamic pages / sub domain / url which loop back to same url / url with query string what the best approach to identify duplicate. thanks
So for being able to give the right estimation of the back of the envelope calculation, the base knowledge is that the person knows that an AWS instance capacity is 400Gbps. I don't have this knowledge in mind, is that ok we can ask or search during interview or is this something we should keep in mind?
I think it’s useful to have some basic specs as a note maybe on your desk when interviewing. But it’s also ok to ask. The intuition that caches can have up to around 100gb and dbs up to around 100TB is good intuition to have though.
Not related to this Video in particular but I have question about partitioning - Lets say we have a DB with 2 columns firstname and lastname. When we say we want to prefix the partition key which is firstname with lastname, Does that mean all similar lastnames will be on same node , if yes what will happen to firstNames how they will be arranged? Thanks
Can not thank you enough for all this valuable content. Just amazing work! Btw can you share some good resources for preparing for the system designs interview? Books, courses, engineering blogs, etc. A dedicated video would be much more helpful!
Im certainly biased, but i think our content is some of (if not the) best out there. so I would start at www.hellointerview.com/learn/system-design/in-a-hurry/introduction. Some useful blogs on system design too depending on your level which can be found at www.hellointerview.com/blog all written by either me or my co-founder (ex meta sr. hiring manager)
Very nice explanation! When actually crawling the pages, it could be blocked by the website owner. Do you think we need to mention this in the interview and provide some solutions like using rotating proxies?
I gave the meta interview last week only and I was able to crack it. All thanks to you brother. The system design round went extremely well. I followed the exact same approach in all the questions and everything went really well. Keep posting the videos, these are the best content over the internet for system design.
I saw you used many AWS services during your design. Is it a good practice to use specific products and their features (dlq/SQS, GSI / dynamo db) in the design? What if the interviewer never used these products and had no concept of these services/features.
Depends on the company, in general, yes. But, importantly, don't just say the technology. This important part is that you understand the features and why they'd be useful. For example, Bad: I'll use DynamoDB here Good: I need a DB that can XYZ. DynamoDB can do this, so I'll choose it.
I mention this at somepoint I believe when discussing the alternate approach of having a "URL Scheduler Service." They have to get back on the queue somehow, so either directly or via a scheduler where state is in the DB.
I think at 39:03, you are saying that set the visibility timeout of the message to now - crawlDelay, but visibility timeout concept is for a queue, then how are you planning to set it at message level ?
You can set them at the message level with SQS! From the docs, “Every Amazon SQS queue has the default visibility timeout setting of 30 seconds. You can change this setting for the entire queue. Typically, you should set the visibility timeout to the maximum time that it takes your application to process and delete a message from the queue. When receiving messages, you can also set a special visibility timeout for the returned messages without changing the overall queue timeout.”
hello, i enjoyed your content a lot, i'm learning a lot from it, thannks! one question related to the design, you were talking at minute 52:00 that the check that the urlLink already exists should be done in the parser. but if this uniqueness check is not done earlier in the crawler, then the crawler could save the same text in s3 twice for the same urlLink, right?
Wow, the amount of Depth here is absolutely insane. How can you compressed so much information into a 1 hour interview? I learn so much information from this video that I never see else where, and it is all presented so elegant and natural. The speaker speaks clearly, no ums and ahs, no speed up? You must be a great engineer at work! One thing that I am a bit unsatisfied is about duplicated content. Is it even possible that we actually have completely duplicated content? Even when there are two different web pages, I think that they might just have a few location that the content is different. That would completely break our hash function right? Do you know of any hash function that would allow two webpages that are mostly similar to be close together? Do you see any role in word2vec or vector storage here?
I think this is a great question! I want to attempt to answer this, but I’m no expert haha. As the goal of this particular system is to train language models, it’s nice to understand if optimizing for “similar” web pages is necessary for our top level goal. In general, it could be helpful to prioritize learning based on chards of text, that appear in many pages. But we have to remember that connecting back to the source could also be required later, for things like citations. So we have to be a bit smart about this. TL;DR it’s a can of worms and I would try to better understand the priority of this compared to existing requirements of the system.
Nice, thanks for the content. I also really appreciated the videos from the mock interview. I found that much more useful and would love to see more of those.
I was going to ask the same question there - you can not avoid downloading by using a hash of the content 😊 You can use this hash to mark duplicates and not store the text output N times, true... You also mentioned PK lookup before going into hash and said log(N), obvious typo. Great content overall
I have been watching too many system design videos, and most of them throw boxes and tools at the canvas just for the sake of it. But your videos follow an interesting and pragmatic approach that someone could actually use to design a real system. Above all, I truly appreciate the framework that you are infusing in viewers mind to tackle problems. Thanks for your efforts 🚀
Glad you find it valuable!
by far the best System design interview content I've come across - please continue making these. you are doing an invaluable service!
♥️
I often don't comment for the videos. But couldn't stop commenting your video just to say "What a valuable content". Thanks a lot for all your videos!! Keep doing this..
I had an interview last Friday (June 14) and I followed your exact steps. The question was to design the Ticketmaster. The Redis cache solution was the best. Thank you for these amazing videos
Nice! Hope you passed 🤞🏼
Did you get an offer?
Soo soo soo much thankful I am for all this content.
400gbps nic😂
I think he was off by a couple orders of magnitude there 😅
Great video! I have a question, is 5k requests per second realistic? Even with the most powerful machine on EC2?
Hey, thanks for your video. I have watched all your content and I gained immense amount of knowledge.
I gave my E4 interview a week back, and my question was this (with a slight variation of the crawling being done through an app which was deployed in 10k devices).
I covered all the content which you've presented here in the same structure, and was able to dive deep into all the parts the interviewer asked.
I was expecting an offer but got rejected due to "No Hire" in Design round. After retrospection, I could find some people talking about chord algorithm and peer2peer crawler was expected. I still don't understand what would be the cause for No hire, because interviewer didn't even hint towards anything and was aligned throughout.
The experience was really heartbreaking. SO, I just wanted to leave it out here that even though I did my best, it wasn't my day (I guess).
thanks for your videos, nonetheless
So sorry to hear that, that’s such disappointing news to receive. It’s always a toss up. Keep your head high and best of luck with future endeavors 💪
Please please keep posting more! It educates so many people and you make the world better!! :) Absolutely the best system design series!
🥲
In my opinion one of the most important bullets of your strategy is how you minimize the initial HLD and you make sure you deliver something that actually covers all the functional requirements. I find this calibration really valuable and not that easy to achieve, since as a Senior candidate, one can be tempted to go straight to deep dives without actually setting clearly that pause from HLD to deep dives.
What do you recommend to get better at this?
Bro, pls don't stop posting this kind of contents, really loved it so far with all of your videos.
Able to relate with the kind of small impactful problems and solutions you mentioned during your videos, which indirectly impact the interviews
I got you!
By far the most inspiring, relevant and practical system design interview content. I found them really useful to perform strongly in my system design interviews
Awesome! Congratulations 🎊
Thank you for the great content and congratulations for making it a goto channel for system design. Content is refreshing and watch once never forget types. I request you to make a content to share how to approach a problem that we have not seen before. What best we could do like either map it to any related system or think logically how api/design would work focusing on the problem asked.
Cool idea, we'll give that a go!
Great content as always, thank you! Some comments about the design.
1. Concurrency within a crawler is going to bring a huge performance bonus.
2. Running an async framework for network io is much more faster than using threading.
3. We can put the retry logic within the crawler to make things simpler.
4. DNS caching looked like overengineering because DNS is already cached on multiple layers, programming language, OS, ISP and etc.
5. We're processing the html in another service but we're hashing the HTML in the crawler, that seems wrong.
5. You don't want to put the same content into Blob. We are IO bound, compute a rolling hash (SHA) is cheap.
This is such a great example for any kind of data application that needs asynchronous processing! Widely applicable!
I've been building a web scraper on my own and using similar logic, and after a month, I see this.
I swear to god this helped me a lottttt, but honestly, it's good that I didn't see this on day 1. Otherwise, I would not have learned things on my own.
Great job, guys.
PS: I got to know about you from Jordan. Keep posting great content, both of you guys!!!
Again the best System Design interview overview I ever met. Please keep doing it for us!
🫡
I'm watching your videos to get prepared for my interview 4 days later, I hope I'll be able to handle it :DDD , so far the best SD videos I could ever find on youtube.
Good luck!! You got this!
One of the first things that came to mind in the beginning of this problem is dynamic webpages. Most websites don't display the majority of their content on simple HTML. To be honest if I was interviewing a senior or above level candidate, not mentioning dynamic content early on would be seen as a red flag. I'm glad you included it at the end of your video, but I do think it is important enough to be mentioned early on.
If Kafka does not support retry out of the box, what does that exactly mean? if you do not commit, does it not get move the offset, which could potentially serve as retry like(?) Also, could you compare this with some other queueing service that allows for retry like SQS maybe? Comparison on when to use Kafka vs SQS would be really good too! message broker vs task queue might be their most frequent use cases but might be good to provide justifications in this scenario!
I could not find the information where it is mentioned that aws sqs have inbuilt exponential backoff retry mechanism. Can anyone please share the link for the same. Thanks a lot!
On mobile but scroll through the comments. I linked the aws docs in response to another comment.
@@hello_interview Thanks for reply, but could not find it
@@hello_interview I haven't been able to find the link and I also wasn't able to find this exponential back-off feature mentioned in the SQS docs...
love your content , learned a lot, please keep updating more. ❤
Great video! I would probably add a proxy component to this system design for the part where the crawler makes the HTTP calls to fetch the HTML (maybe for the DNS lookups as well).
This is a critical part of designing a web crawler because you want avoid making the calls through the network where the web crawlers are deployed for case you get all your network ip addresses blocked and also for security reasons, you want to isolate the outgoing network calls from your instances.
To avoid batching URLs from the same domain together, can we use Kafka partitions and spread messages by hash(URL)? Since different crawlers work at different paces, it is likely they will pick up those URLs at a different time.
This is the first video of yours I watched and I loved it. Your pace is just right and you explain things well, so I didn't feel overwhelmed like I usually do when I watch systems design videos. Thank you!
Thank you for the effort, please keep doing the good job. I'm watching your videos as if it was a Netflix series, very exciting. I was hoping to cover some topics like if the crawler processed the message and failed to commit back to the queue that it processed the message due to a crash, how would you handle such a case? Is there a generic solution where it can be used in different systems instead of workarounds?
Wow, wish I had found this much earlier.
Now I certainly wouldn't just go into my next interview and throw the bloom filter onto the diagram without deep thinking 😝
How would you choose the initial frontier URLs? How many should be enough?
Very well explained!
If possible please do share some tips on how one can keep up with latest technologies and develop a mindset towards such system designs.
I feel like I'm good at coding but not that great when it comes to designing architecture like this.
Basically what I'm looking for is how does one progress from a Developer role to an Architect role..
This is awesome, it is a very comprehensive and clean explanation, I've learnt a lot from your videos, Thanks. May I ask what tool or website you use as the white board?
I am watching this during the wait time for by flight back to home from GOA :) and completed it
💪
Excellent video!. Have one thoughts - would it be possible to increase the font a bit? Thanks so much!
I absolutely love the details you talk and you have great presentation skill! super admirable! you just made system design interview easier for me
Here are my concerns: your solution is so nice, but if everyone is going to talk about the same thing during the interview, especially when one is driving the process, will it raise any red flags on the hiring committee side as they might think candidates are referring to the same sources?
This is not meant to be a script. If your plan is to regurgitate this back to an interviewer I’d recommend not doing that. Instead it’s a teaching resource to learn about process, technologies, and potential deep dives. If you get this problem, then sure, talk about some of this stuff, but also let it be a conversation with the interviewer
But if there an issue if you answer all/most of interviewer questions correctly? I believe it is an issue if you memorise this, but can’t go any further, but if you can there is nothing wrong.
@@hello_interview Yeah, makes sense. You present a good framework to structure the talking points that candidates can bring up. And I found it pretty useful. My system design question is the top-k video and I followed the key points you mentioned. My target is E5 and the interviewer just had a handful of follow-up questions (90% of the time I was talking). Eventually, I passed that round with a "strong hire". Of course, I added my points of view during the interview, but I feel like I was just taking something off the shelf.
I hope someone asks me Web Crawler question.
Why do we need a DNS server? Would it be enough to grab text from a url?
iam not able to understand the math. for no of aws instances. can someone explain?
Can this be asked in product architecture interview at Meta or just system design?
Should be system design not product architecture in meta world. But, you never know, some interviewers go rogue.
Finally a new update! Apprecaite!
Suggestions:
Please mention what are the clarifying questions to be asked for a specific problem. Even if the problem is well known, the panel still expects to ask few clarifying questions, specially for a senior candidate.
Also, if you can cover company specific expectations (if any) for top MAANG companies, that would be excellent.
Damn this is extremely nuanced. Some of the big-picture improvements (like adding the parsing queue) seemed kind of obvious, but then Evan would optimize it with a neat detail (e.g. including link in request so we don't have to fetch from database) that was so simple and yet hadn't occurred to me. Great series, great content, thanks so much!
Great design! I wonder why there was never a mention of doing the whole thing with spark, using offline batch jobs rather than realtime services?
I was thinking about batch as well
Interesting. You know, as many times I’ve asked this, no one has every proposed it. Top of my head I see no obvious reason why you couldn’t get it to work, especially for just a one off.
@@hello_interview I do crawling for a large company, typically you would do something like the video's design when you care about data freshness, if you don't care about that, like the LLM use case you, would do a sparky thing where you just split the work to a bunch of workers, you can have the html fetching and processing parts in different stages. Your inputs can be the URLs and previous crawled pages and join them, so that you crawl only new urls, or recrawl URLs only after some time since their last crawl. The main disadvantage compared to your design is that you are not as fault tolerant as you can't do much in terms of checkpointing. Also it is less fun to discuss:)
commenting for the algo. thanks for excellent and free content!
Legend 🫡
@@hello_interview Feedback: really enjoyed the video! Would love if future videos were also mostly skewed towards deep dives. Suggesting other topics to research yourself (or hash out with others in the comments) is also super valuable. Finally, calling out the anti patterns that are being regurgitated (e.g. bloom filters) is very valuable as well.
@@letsgetyucky is bloom filters a anti-pattern!? just curious!
@@davidoh0905 during the deep dive Evan says that Bloom Filters are commonly used in the interviews because it's they are used in solutions in the popular interview prep books. But the interview prep books don't do a great job of discussing the tradeoffs behind using a Bloom Filters vs more practical solutions. It's a nice theoretical solution, but in a real world system you could do something simpler and just bruteforce the problem.
Your channel is a gold mine! Thanks a ton.
How to decide whether to use Kinesis data streams or SQS? Although they serve different purposes, it feels like both are good options to begin with, generally. Here, SQS ended up being a better option because of retries, DLQ support, etc. But ideally, I'd like to be able to deterministically and correctly choose the right option in the beginning itself.
It'll be super helpful if you could quickly reason out in the videos (in just 1 or 2 lines) why you pick a certain offering over other seemingly similar technologies/offering!
Great content.
In deep dives around 52:41 "when you get a new URL you'll put it on here it'll be undefined and then when we actually parse it we'll update this with" and 52:46
"the real last craw time and with the S3 link which also would have been undefined so that would handle that" - I think you mean -- when we actually crawl and download it, we'll update it with the last crawl time and with the S3 link.
Also when you use Dynamo the look up will be Log(1) not Log (n). Would be great if you had the DynamoDB GSI schema.
Hope you can create videos of the write ups done by other authors on HelloInterview in the near future. Love the content. Thank you!!
Few inputs:
- Bandwidth calculation need to factor in upload data to S3 as well. You will probably also do some compression while upload, and given HTML data had be fairly highly compressible.
- At that rate, the system will likely not be network throughput bound, but usually latency and number of connections bound. Assume that each site takes 1 sec to return the web page, so for 10k requests per sec for each node, you will need 10k TCP connections, which if under possible limit but will lead to a number perf issues.
- Memory requirements: 10k * 2 MB = 20 GB, should be enough, but all of these are GCable. less reusable memory and TCP connection
- You will likely be better off using a lower node type, around 50 Gbps, utilisation beyond that for a single node is going to be challenging and you will hit other limits.
- Another optimisation will be to have the parsing and crawling in the same process to avoid passing off the HTML content to a separate process. You can also update the DB in one write with all the links.
thanks for the great content as always! One quick question: for redis and global secondary index comparison, given the data can be stored in the single instance, if we use hash based index (not sure if it is supported by dynamo, but should be supported by MySQL), then it should also be O(1) and redis in this case should be over-engineering a bit?
Can't we just ignore failed websites, no need to retry as we already having million others to process in the frontier queue?
Product decision!
I confirm your hair and hat didn't have any negative influence in the making of this System Design video.
😂🫶
I still can see the ad here
Thanks for this! As far as checking the hash @ 57:00, wouldn’t we already have the last hash since we had to retrieve that url record before we fetched the webpage because we had to go get the lastCrawlTime?
S Tier system design content! Another exceptional video 👏
Evan your explanations are extremely amazing and the best on this channel. Hope to hear more soon.
Depth should be on Domain Table instead of URL Table. URLs would be unique, so the depth would not increase. Whereas, the depth will increase on a Domain, and having max depth will restrict us from falling into a loop-trap.
True! Might’ve mistyped/misspoke. Thanks!
@@hello_interview Your system design videos are amazing.
Another bump for the algo!
You all are the best!
I usually refrain from commenting but this is by far the best explanation I can find for this problem statement.
I work at Amazon, the use of message visibility timeout for exponential backoff is exactly what we do to add a delay of 1 hour for our retryable messages. One very minor practical insight is to not use the metric approximate message receive count because it is almost always incorrect because the count goes up if a thread reads the message but doesn't process it. I used a retry count attribute while putting message in the queue and checked whether it exceeds the retry threshold.
Super cool and good to know! Appreciate you sharing that
Thank you for making these videos so engaging! Your eloquent and logical style of explaining the concepts makes watching these videos so much fun.
High praise! Right on :)
Just in time!!!!
can you post the 2nd top voted one (youtube) earlier? At least written version :) Also very interested in the stock exchange question, but I see that's further down.
The written coming this week or early next at the latest! Almost done :)
@@hello_interview Looking forward to it :) Love the videos btw, feel like its the only system designs I can trust for interview prep
Best explaination for bloom filter, redis set and hash as GSI.
Thanks for this video. This video is one of the best in the internet for crawler system design. With a full preparation you are going to an hour, how to manage it in 35mins of 45mins interview.
Yah the hour here because of all the fluff and teaching. This is reasonably 35 without that.
This is one of the best system design videos on the interview. Kudos to you. I would like to understand a little more on how do we handle duplicate content? What if the content is 80% same on two pages? Hash will work only when pages are exactly the same.
Yah, only exactly the same
what is the tool are you using to draw and take note , Evan?
Excaldiraw
Just wondering , there is no mention related to inverted index in this crawling flow as this inverted index would help during the searches ?
Searches of what?
@@hello_interview I mean when user searches for results of query on search engine
good stuff!
I am not entirely sure I agree with the trade-off discussion between Bloom-filter vs Hash(GSI).
Hash collisions can occur, which means we can still receive false positives with GSI hashes.
I think it might be necessary to consider byte-by-byte checking when we find a hashing match, to make sure its not just a hash collision.
Hash collisions will almost certainly not occur. They’re so rare they’re not worth designing around for a system like this, where the consequence is minor. It’s 1 in 340 undecillion chance lol
@@hello_interview I agree with you. My point was hash collision is as likely as false positive in bloom filter
There is no mention of shardimg here ?
I like the deep dive section
what is the Text editor you are using? I like it
Excalidraw
Best❤
So sorry for being Microsoft Word, but on all of your videos THE APROACH is spelled incorrectly. Thank you so much for posting all your videos. Super helpful for all of us interviewees out there!
🤦🏻♂️first person to notice this. Will fix next video!
Thanks for sharing the SD on web crawler.
question : - how to consider dynamic pages / sub domain / url which loop back to same url / url with query string what the best approach to identify duplicate.
thanks
May not totally understand the question, but you could just drop the query strings from extracted urls
So for being able to give the right estimation of the back of the envelope calculation, the base knowledge is that the person knows that an AWS instance capacity is 400Gbps. I don't have this knowledge in mind, is that ok we can ask or search during interview or is this something we should keep in mind?
I think it’s useful to have some basic specs as a note maybe on your desk when interviewing. But it’s also ok to ask. The intuition that caches can have up to around 100gb and dbs up to around 100TB is good intuition to have though.
Not related to this Video in particular but I have question about partitioning - Lets say we have a DB with 2 columns firstname and lastname. When we say we want to prefix the partition key which is firstname with lastname, Does that mean all similar lastnames will be on same node , if yes what will happen to firstNames how they will be arranged? Thanks
if the primary key is a composite of first and last then no, this just means that people with the same first and last name will be on the same ndoe
Can not thank you enough for all this valuable content. Just amazing work!
Btw can you share some good resources for preparing for the system designs interview? Books, courses, engineering blogs, etc.
A dedicated video would be much more helpful!
Im certainly biased, but i think our content is some of (if not the) best out there. so I would start at www.hellointerview.com/learn/system-design/in-a-hurry/introduction.
Some useful blogs on system design too depending on your level which can be found at www.hellointerview.com/blog
all written by either me or my co-founder (ex meta sr. hiring manager)
Very nice explanation! When actually crawling the pages, it could be blocked by the website owner. Do you think we need to mention this in the interview and provide some solutions like using rotating proxies?
Good place for depth! Ask your interviewer :)
I gave the meta interview last week only and I was able to crack it. All thanks to you brother.
The system design round went extremely well. I followed the exact same approach in all the questions and everything went really well.
Keep posting the videos, these are the best content over the internet for system design.
Let’s go!!!! Congrats! Thrilled to hear that. Well done 👏🏼
I saw you used many AWS services during your design. Is it a good practice to use specific products and their features (dlq/SQS, GSI / dynamo db) in the design? What if the interviewer never used these products and had no concept of these services/features.
Depends on the company, in general, yes. But, importantly, don't just say the technology. This important part is that you understand the features and why they'd be useful. For example,
Bad: I'll use DynamoDB here
Good: I need a DB that can XYZ. DynamoDB can do this, so I'll choose it.
Can you also provide system design interview flow and product design interview flow for each problem?
They're mostly the same tbh. www.hellointerview.com/blog/meta-system-vs-product-design
what's the reason not storing URLs in databases like MySQL. for retrying, just add some column like "retry times"
I mention this at somepoint I believe when discussing the alternate approach of having a "URL Scheduler Service." They have to get back on the queue somehow, so either directly or via a scheduler where state is in the DB.
I think at 39:03, you are saying that set the visibility timeout of the message to now - crawlDelay, but visibility timeout concept is for a queue, then how are you planning to set it at message level ?
You can set them at the message level with SQS! From the docs, “Every Amazon SQS queue has the default visibility timeout setting of 30 seconds. You can change this setting for the entire queue. Typically, you should set the visibility timeout to the maximum time that it takes your application to process and delete a message from the queue. When receiving messages, you can also set a special visibility timeout for the returned messages without changing the overall queue timeout.”
Thank you!
to avoid duplicate URLS, do we need to discuss using a cache or Is it ok to only use the data base
Same convo as the duplicate content. Cache is certainly an option. The DB index enough imo.
Wont there be a case that even though HTML will be diff but the hash will be same? is it even possible?
Not worth even considering. Hash collisions are so unlikely they’re not worth discussing
I wonder if questions about the type of content we are scraping matters? i.e. ignore suspicious sites or offensive content
Valid question for interviewer!
Kafka also support configurable exponention back off from producer side
Yup, that’s just to make sure the message gets on the queue, so not the same problem we’re solving here.
Great content! Keep it coming!
hello, i enjoyed your content a lot, i'm learning a lot from it, thannks!
one question related to the design, you were talking at minute 52:00 that the check that the urlLink already exists should be done in the parser. but if this uniqueness check is not done earlier in the crawler, then the crawler could save the same text in s3 twice for the same urlLink, right?
Nope! We won’t add new links to the queue if they already exist. Thats why we check in the parser
@@hello_interview understood, thank you!
Wow, the amount of Depth here is absolutely insane. How can you compressed so much information into a 1 hour interview? I learn so much information from this video that I never see else where, and it is all presented so elegant and natural. The speaker speaks clearly, no ums and ahs, no speed up? You must be a great engineer at work!
One thing that I am a bit unsatisfied is about duplicated content. Is it even possible that we actually have completely duplicated content? Even when there are two different web pages, I think that they might just have a few location that the content is different. That would completely break our hash function right?
Do you know of any hash function that would allow two webpages that are mostly similar to be close together? Do you see any role in word2vec or vector storage here?
I think this is a great question! I want to attempt to answer this, but I’m no expert haha.
As the goal of this particular system is to train language models, it’s nice to understand if optimizing for “similar” web pages is necessary for our top level goal.
In general, it could be helpful to prioritize learning based on chards of text, that appear in many pages. But we have to remember that connecting back to the source could also be required later, for things like citations. So we have to be a bit smart about this. TL;DR it’s a can of worms and I would try to better understand the priority of this compared to existing requirements of the system.
This isn’t skirting off the question, but it’s a good step towards delivering our final solution.
Which drawing tool is this?
Excaldiraw
Thank you
really good video but please stop panning uselessly :D appreciate ur work!
Example?
@@hello_interview 7:17 is the main one thnx love u
Nice, thanks for the content. I also really appreciated the videos from the mock interview. I found that much more useful and would love to see more of those.
Tougher there for privacy reasons. Requires explicit sign off from coach and candidate, but I'll see what I can do :)
why is it called frontier queue? Is this some kind of standard term?
I believe the term comes from BFS where we have a frontier of nodes and we expand the frontier as we go.
thank you for another amazing contents! I'll be having a mock interview using Hello Interview soon.
Sweet! Can’t wait :)
If we store Hash in URL table in DynamoDB , how does it handle a case of copied webpages which will have different URLs and same HTML ?
Check the hash before storing in s3 and putting on parsing queue
you need to store the hash of the page contents for the url and not the hash of the url itself.
I was going to ask the same question there - you can not avoid downloading by using a hash of the content 😊
You can use this hash to mark duplicates and not store the text output N times, true...
You also mentioned PK lookup before going into hash and said log(N), obvious typo.
Great content overall