Heads up! Made a silly mistake with the primary key in the submission table. The submission table's primary key should be the ID, and then we'd want to add an index on competitionId. My bad 🫣
Sadly, a candidate wouldn't have the opportunity to correct mistakes like this afterwards, which is why I consider the interview process as highly scripted and non-representative of the candidate's true ability. You cram LeetCode, you practice a bunch of stuff, then vomit those in 45 minutes. If you're not hired, you get a bs response that has no indication of what went wrong, so, you've no idea how to do better if you interview with the same company again. Rinse and repeat.
@@abhijit-sarkar Umm what? This is wrong. You would correct it during the interview because the interviewer would bring it up. This is why it's important to consistently DISCUSS with your interviewer. It should be a back and forth conversation.
Again a brilliant job. Just a couple nitpicks to touch for Staff+ inteviewees and for the write up version: - DB has to be write optimized for competition. Submission table would probably need to be on a different tech such as Cassandra. In any case, Relational or NoSQL, it probably needs to be sharded especially to take care of the write demand at the end. Best candidate for sharding is submission ID. - Submission table PK has to be submission ID not competition ID. You can have a secondary index on competition ID but it's a serious error to say it's primary key is competition ID. - In the SQL query, MAX is the right aggregator, not MIN. You want the minimum of the maximum submission timestamp. Hence ASC order on MAX submission time:) - Redis won't be able to handle 100K user pulling in near the end of the competition. So some sharding and scatter/gather is needed there too. - You want rate limiting on the submissions too. So API gateway needs to be configured to do that or alternatively can be implemented in the primary server but would be much complicated. - Finally Problem POST should return a submission ID in the body of the response and update to the URL with submission ID (instead of a page load). This is needed because if user refreshes the page they'd want to continue to poll the latest submission. These are the only nitpicks I can find. I'm just listing them for others as a reference. Overall great. Please keep these coming. I derive immense value out of these. Very very good job!
@@madhuj6912 Depends on the isolation-level on the DB. If it is the lowest isolation level, namely Read-Uncomitted and if it is sharded and connection pooled well, then I suppose it is possible to achieve very high write throughput. Note that read-uncomitted isolation is possible because every submission is unique and the submissions will be read after the competition (i.e no more writes). Sharding is not really straightforward in PostgreSQL though, as it does not support it natively. You gotta use extensions or your custom logic to achieve that. So while it is possible, it would be PITA to handle write heavy submissions traffic via PostgreSQL.
Related to the DB writes, the 100k submissions would actually be distributed across 1 or 2 hours most likely. So actual writes per second will be a much manageable number which may not need a write-optimized DB. MySQL can still handle hundreds of writes per second and even a bit more. And to avoid the Redis scatter/gather, you could have a multi-level cache where you cache the first few pages of the leader board across nodes (or even better, in mem in the service) before hitting the one with the sorted set every 5 secs.
@@TomasV247 Well if there are 100k users, you should be prepared for 100k users coming at the same time, especially near the end of the competition. I'd assume the competition is not a single question one but a multi-question one otherwise a real-time leaderboard would not make much sense. So it is possible that near the end, a lot of people submits the last solution they're working on with the hope that it would simply work. While I agree that 100k submissions coming at the last seconds is unlikely, still, it is not so unlikely that you can ignore that possibility. MySQL can probably handle bursts all the way up to 1k if you handle the connection pooling problem. But a sustained 1k write near the end on MySQL would be quite a stretch even with read committed isolation and no index other than PK. What are you going to do if that happens? People will start getting 500s and then the time runs out and they won't be able to submit anymore. That would suck big time. So you should better be prepared for it.
These are by FAR the best product architecture/system design videos out there! I also highly recommend their mock interviews as well, I did 3 of them and the feedback you get back is more helpful than anything else you’d get out there! Please keep these videos coming
I would pay $150/year if you make videos like this every 2-3 weeks as well as deep dive of famous "swiss army" system design components These videos are golden
So far best video I have seen on High level design which goes in iterative way and explains each bit and focuses on WHY part. Keep doing the good stuff !!
These videos are incredibly valuable and applicable not just for system design interviews but also for practical application in real world. Thank you for the excellent content!
Okay wow! This is by far the best system design video I've seen on YT. I actually was smiling and nodding my head throughout the whole video. THANK YOU SO MUCH!
This is the only channel I setup notifications for as this has been one of the most valuable resources for tech interviews. Coding is easy to practice but having these videos really shows you how to approach the problem. I consider these videos the best resource out there. Please keep it up!!!!!!!!!!!!
@@hello_interview Seriously! I have read the Xu books and reviewed some other resources but this really helps me have a process which greatly helps to understand how to do these problems in an interview. Thank you so much!
Just a nit: for many of the interviews posted on this channel the database choice doesn't matter, I would like to see more where database choice actually matters and diving deeper into them
@@hello_interview Thanks for the great video! Please keep posting such system design videos. One question: Do we have to provide the cron job solution and then improve it to replace it with cache OR we can directly provide cache solution without even mentioning cron job solution? Same question for AWS SQS whether we can provide this solution from the beginning itself or at the end considering interviewee has 15+ years of experience?
I think the interesting part that was omitted would be to cover test functionality. The one that runs test cases and returns the feedback, the one that runs hidden cases on submission and submission cannot proceed but the try is stored in the DB. I would cover that one instead of competitions. But out of questions, your videos are super cool! Thank you. Hope they will help me to pass my upcoming interview
First off, amazing teachings and explanation. Thank you ! Follow up question on caching strategy for the leadership data - With the sorted set, since we are sorting by just the score, how and where will we sort the users based on their completion time (for scenarios where multiple users have the same score)? I'd assume this would be a common scenario, esp. if the max allowed participants is 100k. Thank you!
There is a big problem with your videos. They are so good, they have spoilt every other resource, so for any questions that you are not posting, I feel there is no good resource any where whatsoever, please keep them coming, and keep them free, they are much needed.
This channel is a boon for software developers with upcoming interviews. Just discovered it last month and used your process in one of the interview today. Was very comfortable for me, only problem was had to make the interviewer understand that i will be doing the optimizations in the deep dive mostly. Need to still get accustomed to the infrastructure base process, although i am sure i will do it through the resources you have. Thanks a lot.
I think it makes sense to also mention that maybe we can utilize ECR to store the container images for start more containers. The reason for that is we can have multiple Python images. Like one Python image can have extra classes pre defined like TreeNode while some questions only need the most basic runtime.
I recently had a loop interview for staff at oracle, i thought i did pretty good following the same format like in here or Xu, but the interviewer was like lol..i need architecture diagram not these kind of boxes or component diagram or HLD, it really depends on the interviewer as well on the day of your interview goes as it is his expectation and his world might be a small shell. But this is amazing keep it up. Prep is important, but luck is damn important too. I was literally waiting for offer letter😂
I believe you were right in your query to select the max(submittedAt) as lastSubmissionTime because you wanted to select the submission time of the last submitted problem 🙂for each user
Great job from the team. This is by far one of the best sysdsgn resource platform on the internet. For handling asynchronous request on the client, can one also explore the use of service workers as an alternative to websockets and good-ol polling? I think this serves as a reasonable middle ground between the other options.
Hmm, you might have better knowledge of SW than I, in which case, certainly. Based on my understanding, I'm not sure where the benefit would be over polling, we don’t want to cache here, we need to poll until the submission is complete or a time limit elapses
@@hello_interview Thank you. Thinking about it again, it doesn’t seem to have much benefit. Heck, it will be the same thing if the sw allows the request go to the network except the server is capable of sending push events (which isn’t the case here)
FYI, malicious or buggy code in the container can technically bring down your host machine (EC2 instance? in this case). One deep dive could be to use MicroVMs to run the code. (What AWS Lambda uses via Firecracker).
Yah probably wouldn’t be the end of the world. Unlikely, and the host is still isolated here using ECS or something and we’ll just bring a new one up. But good call out for sure.
A great deep dive I’ve seen - what happens when the Redis sorted set grows too large for a single node? How do we partition? How do we know which partition to write to? How do we shift set entries up or down partitions? What do we do if the redis node goes down?
Great video, a questio, at 58:40, how will user be able to call /problem/problemId/submissionId when submission is not even posted in the db ? because submissionIds are generated dynamically by the database right ? until and unless they're generated and entered in the submission db, how will client have access to the submissionId ?
Hello, thank you for an awesome video explanation. 45:38 why would you say users can't submits two solution back to back, they are gonna be always different users
Can you clarify on the docker container part? We need an image to instantiate the container. Is it implicit that the image will get built based on every solution code/language and run on the container?
For the leaderboard lo latency deep dive, can we not use a Spark MapReduce logic like explained in another problem and have a separate OLAP DB to store the live leaderboard information and have the user hit that as it is read optimized and give quick results back? I guess we will need to consider how frequent the competitions are in order to justify a separate DB Would love o hear what other people say.
What do you think about using postgres triggers to build optimized tables that make querying the data simple and fast? This would simplify the architecture by removing the need for redis and the queue to update it. It wouldn't slow down the db because you can have them run after the writes are done, would be totally accurate and it would probably speed up the queries to an acceptable rate.
This channel is super helpful to all levels of developers, thank you for providing this level of guidance :) . Is there any plan to provide similar resources for ML system design?
I recently went through a full loop at Meta and had roughly ~35min to implement a solution to a system design problem similar to this one. It felt very rushed (to me based on my experience.) Do you have any advice regarding the short(er) amount of time to do this? Should we skip some of this content? Or just talk faster and make sure to write everything down as we go? Thank you for all the valuable content you share! I look forward to the next one.
These are great. You're pretty much exposing system design interviews though and raising the bar for everyone. Before it was a secret club that only a few people had access to because they had done it before. Now even junior engineers will have Google Fellow level knowledge and the usefulness of these interviews as filters will decrease. I wonder what other types of interviews they'll come up with?
With CDC between Cache and DB, Does it mean for getting leaderboard, we need to query the DB first? How about also have a separate LeaderBoard service to pull from Cache?
Is there a risk in security on the queue before parsing the submission in the docker container? What if a malicious actor wants to take down the queue? It seems to me that the reason we went with docker containers is that even if worst case some rogue code takes down the container, worst case we just error out and spin up a new one. Are there similar considerations in kafka/SQS? or for that matter in s3? it seems to me that either of these services could be compromised by naughty code
I have some question regarding the database. Sometimes I'm struggling to understand what database would be better and it always come to my mind two keys: 1. Size 2. Query complexities How do you decide about the DB based on those two points? Something like 'we'll deal with 100GB, so maybe NoSQL' or 'We're doing a lot of complex joins, so SQL'; and what to do if we have complex queries in a huge database?
Great video! Can you let me know why you didn't mention the consistency issue between SQS write and DB write for each submission but did mention the consistency issue between the Redis write and the DB write for each score? Is it less common or unnecessary to discuss the consistency between the message queue and DB during the interviews? I am looking forward to your reply. Thank you!
Great video! what if there are multiple submissions to a problem , this will change the query for leaderboard, what do you think? select userId, count(*) as passedSubmissions, MIN(submittedAt) as lastSubmissionTime from submissions where problemId in (select problem from submissions group by problem where competitionId={competition_id} AND userId={user_id} AND passed = True order by lastSubmissionTime ASC ) above query is correct?
Another great video! I just wanted to clarify one thing - I get that the docker containers are running on EC2 instances. But is the worker running on the same EC2 instances and using the a docker library API to run the submission after reading from the SQS queue? And If the worker has to wait for upto 5 seconds for each submission to get the result, we'll need close to 20K workers if there are close to 100K submissions within 5 seconds towards the end of the competition?
The polling endpoint passes submissionId as a path parameter, but how would we know which submissionId to request for if the submit solution endpoint is just returning 200 status code. The submissionId would only be available once the submission record is created in the database, right?
This is really interesting, i recently started learning the system design on your channel . I would say expalin things very clearly and other places people just confuse us while design a system and in end of design when i tried to summarize myself Its very confusing. After going through your HLDs i am now confident but can you also start putting LLD as well
Great Video, I did had a small doubt on how would we handle ranking users along with tracking the time they took to solve problems, similar to LeetCode? One approach could be to use sorted sets for ranking and store problem completion times in a Redis hash for each user per contest, with problem IDs as keys. My concern is whether this would lead to the N+1 problem when fetching the leaderboard, or if Redis is fast enough that it wouldn't significantly impact latency. What do you think?
I have 2 questions and would be most grateful if you could answer them? 1) Is there a reason in TicketMaster example you drew the API Gateway component but not here? 2) Is there a reason you mentioned CRUD Service in TIcketMaster example but not in this one?
Would love to know the suggestion of webhooks vs. polling when trying to get submission status? Sounds like the result would be equivalent and would lower the number of requests the server would receive?
Thanks for these videos, they're really super helpful. Just curious, when you say you'd want a mid-level or senior to pick up on certain things, is that always proactively or can it be in response to interviewer prompts. For example, in this one would you expect a mid level to have picked up that the get leaderboard query is slow on their own, or could they still pass if you had to nudge them and say what do you think about this query and/or how can we improve it, and they came up with a good solution as you outlined?
Should we discuss managing containers in the interview? Will users share containers or will each submission have a dedicated container? Also, will the containers remain active continuously?
maybe is worth introducing a queue for the results of the submission, and another MS that would read from this queue and update the leaderboard. this way the main service is not overloaded with not needed calculation of he scoreboard.
Thanks, very well explained!! I do have a question, why would you use competitionId as PK for the submission table? Wouldnt that make querying a specific submission very slow? Also SQS is in memory right? What if that goes down? Submissions will be dropped? Thanks
yah, this was silly mistake. submission table PK should be id and then we need an index on competitionId. Good callout. As for SQS, no one would wait more than 5s anyway, so if it does down and we lose them, it is what it is. Client shows an error saying they need to resubmit. They aren't going to wait 5 minutes for a response in any case.
The SQL query would not work as is, because I can do 100 submissions for the same problem and win the competition. We would need to make sure that submissions are only counted once per problem.
Instead of having an SQL query I think, having a LeaderBoard Table would be good, have user_id, competetion_id, and points as column, and where user submits the all questions calculate the points and store it here. as things gets written once and read multiple times, fetching the points would be better way, just pass competion_id and sort on marks, u will have ur results. because think 100k user trying to see the leader board and that complex query getting executed for each on of them.
If i'm not mistaken, AWS cloudformation give you possibility to set up autoscaling rules, policies in a numerous ways. It also includes warm up period, so I don't think queue is needed here. It can be found in usergide ec2-auto-scaling-default-instance-warmup
Lots of clouds offer autoscaling, but generally speaking they are not quick to respond so you'll probably want a queue even if it's just to handle the first 30-90 seconds, unless your system can tolerate a spike in dropped requests.
Thanks for this video and all the other ones! I love your content. One thing I failed to understand in this video is where are the Docker containers (Language code runtime services) actually running ? Looking at your diagram, it seems like they are not running on the Primary Server. But they are Docker Containers. Their images must be pulled onto some server and served there such that the Workers can actually talk to them. Please explain that if possible. Thanks again and keep rocking!
@@hello_interview Do you mean to say that the Workers are running forever as EC2 instances, and that these worker instances were initially launched with language specific docker images? But how would they auto-scale? Kubernetes orchestration perhaps? Sorry I'm struggling to understand where these docker containers are actually running; and what's the real benefit of not using Lambdas in this design and how this design is actually solving the cold-start problem of Lambda functions?
Great video as always!! Just one doubt, is it right to make competitionId as a Primary Key for Submission table as it will be NULL in case when a submission is not associated with a competition and also multiple submission will have same competitionId? I might be missing something here. Please let me know your thoughts on this.
Good content, but I found the justification for CDC over a cron job hand-wavey. How specifically would a cron job have a higher infrastructure cost as you claimed? I also don’t think CDC is necessarily more reliable. I work at a big tech company and we’ve had three downtime events for our in-house CDC in the last two months. Also: the Kubernetes setting of “concurrencyPolicy: Forbid” would prevent the problem of too-frequent job runs you mentioned. And network isolation and no host machine system calls are the default behavior of a Docker container. The latter might not be possible at all.
I am curious what would you suggest for career development between Midlevel to Senior to Staff. I know there is a useful breakdown in terms of interview expectation, but I'm thinking more from the perspective of internal promotions. Do you have any useful tips?
Any reason for going with primary monolithic service rather than micro services like in other SD keys from HelloInterview ? Would love to see Messaging service like whatsapp or messenger next.
Heads up! Made a silly mistake with the primary key in the submission table. The submission table's primary key should be the ID, and then we'd want to add an index on competitionId. My bad 🫣
Sadly, a candidate wouldn't have the opportunity to correct mistakes like this afterwards, which is why I consider the interview process as highly scripted and non-representative of the candidate's true ability. You cram LeetCode, you practice a bunch of stuff, then vomit those in 45 minutes. If you're not hired, you get a bs response that has no indication of what went wrong, so, you've no idea how to do better if you interview with the same company again. Rinse and repeat.
@@abhijit-sarkar Umm what? This is wrong. You would correct it during the interview because the interviewer would bring it up. This is why it's important to consistently DISCUSS with your interviewer. It should be a back and forth conversation.
Again a brilliant job. Just a couple nitpicks to touch for Staff+ inteviewees and for the write up version:
- DB has to be write optimized for competition. Submission table would probably need to be on a different tech such as Cassandra. In any case, Relational or NoSQL, it probably needs to be sharded especially to take care of the write demand at the end. Best candidate for sharding is submission ID.
- Submission table PK has to be submission ID not competition ID. You can have a secondary index on competition ID but it's a serious error to say it's primary key is competition ID.
- In the SQL query, MAX is the right aggregator, not MIN. You want the minimum of the maximum submission timestamp. Hence ASC order on MAX submission time:)
- Redis won't be able to handle 100K user pulling in near the end of the competition. So some sharding and scatter/gather is needed there too.
- You want rate limiting on the submissions too. So API gateway needs to be configured to do that or alternatively can be implemented in the primary server but would be much complicated.
- Finally Problem POST should return a submission ID in the body of the response and update to the URL with submission ID (instead of a page load). This is needed because if user refreshes the page they'd want to continue to poll the latest submission.
These are the only nitpicks I can find. I'm just listing them for others as a reference. Overall great. Please keep these coming. I derive immense value out of these. Very very good job!
can we use postgres for write heavy? submissions?
@@madhuj6912 Depends on the isolation-level on the DB. If it is the lowest isolation level, namely Read-Uncomitted and if it is sharded and connection pooled well, then I suppose it is possible to achieve very high write throughput.
Note that read-uncomitted isolation is possible because every submission is unique and the submissions will be read after the competition (i.e no more writes).
Sharding is not really straightforward in PostgreSQL though, as it does not support it natively. You gotta use extensions or your custom logic to achieve that.
So while it is possible, it would be PITA to handle write heavy submissions traffic via PostgreSQL.
@@je_suis_onur Thank you for detailed explanation. Thank you so much
Related to the DB writes, the 100k submissions would actually be distributed across 1 or 2 hours most likely. So actual writes per second will be a much manageable number which may not need a write-optimized DB. MySQL can still handle hundreds of writes per second and even a bit more.
And to avoid the Redis scatter/gather, you could have a multi-level cache where you cache the first few pages of the leader board across nodes (or even better, in mem in the service) before hitting the one with the sorted set every 5 secs.
@@TomasV247 Well if there are 100k users, you should be prepared for 100k users coming at the same time, especially near the end of the competition. I'd assume the competition is not a single question one but a multi-question one otherwise a real-time leaderboard would not make much sense.
So it is possible that near the end, a lot of people submits the last solution they're working on with the hope that it would simply work. While I agree that 100k submissions coming at the last seconds is unlikely, still, it is not so unlikely that you can ignore that possibility. MySQL can probably handle bursts all the way up to 1k if you handle the connection pooling problem. But a sustained 1k write near the end on MySQL would be quite a stretch even with read committed isolation and no index other than PK.
What are you going to do if that happens? People will start getting 500s and then the time runs out and they won't be able to submit anymore. That would suck big time. So you should better be prepared for it.
These are by FAR the best product architecture/system design videos out there! I also highly recommend their mock interviews as well, I did 3 of them and the feedback you get back is more helpful than anything else you’d get out there! Please keep these videos coming
Wow, hell yes! So glad to hear you’re finding everything valuable 💪
I would pay $150/year if you make videos like this every 2-3 weeks as well as deep dive of famous "swiss army" system design components
These videos are golden
So far best video I have seen on High level design which goes in iterative way and explains each bit and focuses on WHY part.
Keep doing the good stuff !!
this channel is gold standard for design interviews. the depth and transparency is awesome. thank you for all you do!
🩵
These videos are incredibly valuable and applicable not just for system design interviews but also for practical application in real world. Thank you for the excellent content!
Please keep doing this, I just graduated and I'm focusing on DS&Algos, Full Stack and System Design to pass interviews and THIS IS GOLD!!!!
Hell yah! Will do, you’ve got this 🫡
Okay wow!
This is by far the best system design video I've seen on YT. I actually was smiling and nodding my head throughout the whole video. THANK YOU SO MUCH!
So glad you like it! Checkout the others too if you like this one :) and more coming!
an interview lesson about designing an interview practice platform by a former-Meta engineer is super meta
😅
Hands down one of the best content for system design
you rock
I just understood the HelloInterview logo and my mind is blown (good video btw)
🤯 s/o upwork
This is the only channel I setup notifications for as this has been one of the most valuable resources for tech interviews. Coding is easy to practice but having these videos really shows you how to approach the problem. I consider these videos the best resource out there. Please keep it up!!!!!!!!!!!!
High praise! We’ll try not to let you down 🫡
@@hello_interview Seriously! I have read the Xu books and reviewed some other resources but this really helps me have a process which greatly helps to understand how to do these problems in an interview. Thank you so much!
@@hello_interview You guys deserve it
Please keep.making more such videos
Very useful! Thank you so much! Glad I landed to this video. YT needs more like this!
Working on it! Plenty more written on the site while you wait :)
Just a nit: for many of the interviews posted on this channel the database choice doesn't matter, I would like to see more where database choice actually matters and diving deeper into them
Watch ad click aggregator if you haven’t yet. Def matters there.
@@hello_interview Thanks for the great video! Please keep posting such system design videos. One question: Do we have to provide the cron job solution and then improve it to replace it with cache OR we can directly provide cache solution without even mentioning cron job solution? Same question for AWS SQS whether we can provide this solution from the beginning itself or at the end considering interviewee has 15+ years of experience?
I think the interesting part that was omitted would be to cover test functionality. The one that runs test cases and returns the feedback, the one that runs hidden cases on submission and submission cannot proceed but the try is stored in the DB. I would cover that one instead of competitions. But out of questions, your videos are super cool! Thank you. Hope they will help me to pass my upcoming interview
Great work buddy!
keep doing this. Loved every bit of it.
I like having the Data Flow before the High Level, it is so helpful/efficient to go about HLD after it
First off, amazing teachings and explanation. Thank you !
Follow up question on caching strategy for the leadership data - With the sorted set, since we are sorting by just the score, how and where will we sort the users based on their completion time (for scenarios where multiple users have the same score)? I'd assume this would be a common scenario, esp. if the max allowed participants is 100k. Thank you!
There is a big problem with your videos. They are so good, they have spoilt every other resource, so for any questions that you are not posting, I feel there is no good resource any where whatsoever, please keep them coming, and keep them free, they are much needed.
This channel is a boon for software developers with upcoming interviews. Just discovered it last month and used your process in one of the interview today. Was very comfortable for me, only problem was had to make the interviewer understand that i will be doing the optimizations in the deep dive mostly.
Need to still get accustomed to the infrastructure base process, although i am sure i will do it through the resources you have.
Thanks a lot.
Amazing! Yah communication is key, regardless of which framework you choose. Good to let your interview know your plan. Good luck!
very impressive and concise system design interview question. very appreciated.
Great content honestly! Keep doing this please. These are really helpful and all the tradeoffs and decision making are very insightful
Glad you enjoyed!
I think it makes sense to also mention that maybe we can utilize ECR to store the container images for start more containers. The reason for that is we can have multiple Python images. Like one Python image can have extra classes pre defined like TreeNode while some questions only need the most basic runtime.
Great video!! As always, please keep posting more
I recently had a loop interview for staff at oracle, i thought i did pretty good following the same format like in here or Xu, but the interviewer was like lol..i need architecture diagram not these kind of boxes or component diagram or HLD, it really depends on the interviewer as well on the day of your interview goes as it is his expectation and his world might be a small shell. But this is amazing keep it up. Prep is important, but luck is damn important too. I was literally waiting for offer letter😂
Oh yikes! Sorry to hear that. Total crapshoot sometimes for sure 😒
I've always enjoyed your videos the most!
I believe you were right in your query to select the max(submittedAt) as lastSubmissionTime because you wanted to select the submission time of the last submitted problem 🙂for each user
guided practise is pretty interesting too. I gave it a try and its really cool.
Great video!! Please keep sharing more videos
Great job from the team. This is by far one of the best sysdsgn resource platform on the internet.
For handling asynchronous request on the client, can one also explore the use of service workers as an alternative to websockets and good-ol polling? I think this serves as a reasonable middle ground between the other options.
Hmm, you might have better knowledge of SW than I, in which case, certainly. Based on my understanding, I'm not sure where the benefit would be over polling, we don’t want to cache here, we need to poll until the submission is complete or a time limit elapses
@@hello_interview Thank you. Thinking about it again, it doesn’t seem to have much benefit. Heck, it will be the same thing if the sw allows the request go to the network except the server is capable of sending push events (which isn’t the case here)
Keep up the good work. Don't stop!!!!!
FYI, malicious or buggy code in the container can technically bring down your host machine (EC2 instance? in this case). One deep dive could be to use MicroVMs to run the code. (What AWS Lambda uses via Firecracker).
Yah probably wouldn’t be the end of the world. Unlikely, and the host is still isolated here using ECS or something and we’ll just bring a new one up. But good call out for sure.
Amazinggggggg content! This is gold content🌟
Keep em coming!! This is like crack to me and you're my dealer.
😂🤣😂
Great content man. Keep going!
Amazing content as always!
A great deep dive I’ve seen - what happens when the Redis sorted set grows too large for a single node? How do we partition? How do we know which partition to write to? How do we shift set entries up or down partitions? What do we do if the redis node goes down?
It won’t :) even with 100k each entry is a couple hundred bytes. This is less than a gig, so no stress there
true!!
Thanks a ton for these videos. This was great as always.
🫡💙
Great video, a questio, at 58:40, how will user be able to call /problem/problemId/submissionId when submission is not even posted in the db ? because submissionIds are generated dynamically by the database right ? until and unless they're generated and entered in the submission db, how will client have access to the submissionId ?
The client can create it
Hello, thank you for an awesome video explanation. 45:38 why would you say users can't submits two solution back to back, they are gonna be always different users
Can you clarify on the docker container part? We need an image to instantiate the container. Is it implicit that the image will get built based on every solution code/language and run on the container?
Love the content! Could you guys make more infra-style system design videos, like 'Design S3 Storage System'?
For the leaderboard lo latency deep dive, can we not use a Spark MapReduce logic like explained in another problem and have a separate OLAP DB to store the live leaderboard information and have the user hit that as it is read optimized and give quick results back? I guess we will need to consider how frequent the competitions are in order to justify a separate DB Would love o hear what other people say.
What do you think about using postgres triggers to build optimized tables that make querying the data simple and fast? This would simplify the architecture by removing the need for redis and the queue to update it. It wouldn't slow down the db because you can have them run after the writes are done, would be totally accurate and it would probably speed up the queries to an acceptable rate.
These videos are of a significantly higher quality than some paid courses.
Save your money for mock interviews, we gotchu :)
Great video, thank you 🙏
This channel is super helpful to all levels of developers, thank you for providing this level of guidance :) . Is there any plan to provide similar resources for ML system design?
Yep, working on it.
I recently went through a full loop at Meta and had roughly ~35min to implement a solution to a system design problem similar to this one. It felt very rushed (to me based on my experience.) Do you have any advice regarding the short(er) amount of time to do this?
Should we skip some of this content? Or just talk faster and make sure to write everything down as we go?
Thank you for all the valuable content you share! I look forward to the next one.
As always amazing
You video is great, Thanks for sharing!
These are great. You're pretty much exposing system design interviews though and raising the bar for everyone. Before it was a secret club that only a few people had access to because they had done it before. Now even junior engineers will have Google Fellow level knowledge and the usefulness of these interviews as filters will decrease. I wonder what other types of interviews they'll come up with?
Yes and no. Certainly some truth to this. But if it plays a role is forcing companies to modernize their interview process then I’m all for it.
@@hello_interview that's interesting, I'd love to hear your thoughts on how the interview process could be modernized
With CDC between Cache and DB, Does it mean for getting leaderboard, we need to query the DB first? How about also have a separate LeaderBoard service to pull from Cache?
Is there a risk in security on the queue before parsing the submission in the docker container? What if a malicious actor wants to take down the queue? It seems to me that the reason we went with docker containers is that even if worst case some rogue code takes down the container, worst case we just error out and spin up a new one. Are there similar considerations in kafka/SQS? or for that matter in s3? it seems to me that either of these services could be compromised by naughty code
I have some question regarding the database. Sometimes I'm struggling to understand what database would be better and it always come to my mind two keys:
1. Size
2. Query complexities
How do you decide about the DB based on those two points? Something like 'we'll deal with 100GB, so maybe NoSQL' or 'We're doing a lot of complex joins, so SQL'; and what to do if we have complex queries in a huge database?
Great video! Can you let me know why you didn't mention the consistency issue between SQS write and DB write for each submission but did mention the consistency issue between the Redis write and the DB write for each score? Is it less common or unnecessary to discuss the consistency between the message queue and DB during the interviews? I am looking forward to your reply. Thank you!
Great video! what if there are multiple submissions to a problem , this will change the query for leaderboard, what do you think? select userId, count(*) as passedSubmissions, MIN(submittedAt) as lastSubmissionTime from submissions where problemId in (select problem from submissions group by problem where competitionId={competition_id} AND userId={user_id} AND passed = True order by lastSubmissionTime ASC )
above query is correct?
Another great video! I just wanted to clarify one thing - I get that the docker containers are running on EC2 instances. But is the worker running on the same EC2 instances and using the a docker library API to run the submission after reading from the SQS queue? And If the worker has to wait for upto 5 seconds for each submission to get the result, we'll need close to 20K workers if there are close to 100K submissions within 5 seconds towards the end of the competition?
for the worker that processes the submissions from the queue, how does it know if there are enough capacity in the container service to run the code?
The polling endpoint passes submissionId as a path parameter, but how would we know which submissionId to request for if the submit solution endpoint is just returning 200 status code. The submissionId would only be available once the submission record is created in the database, right?
Yeah, it should be returning a Partial where at least the submissionId is present.
This is really interesting, i recently started learning the system design on your channel . I would say expalin things very clearly and other places people just confuse us while design a system and in end of design when i tried to summarize myself Its very confusing. After going through your HLDs i am now confident but can you also start putting LLD as well
Great Video, I did had a small doubt on how would we handle ranking users along with tracking the time they took to solve problems, similar to LeetCode? One approach could be to use sorted sets for ranking and store problem completion times in a Redis hash for each user per contest, with problem IDs as keys. My concern is whether this would lead to the N+1 problem when fetching the leaderboard, or if Redis is fast enough that it wouldn't significantly impact latency. What do you think?
I have 2 questions and would be most grateful if you could answer them?
1) Is there a reason in TicketMaster example you drew the API Gateway component but not here?
2) Is there a reason you mentioned CRUD Service in TIcketMaster example but not in this one?
How about the worker after the sqs queue? Can it auto scale? Based on how many messages in the sqs queue?
Thank you bro. Seriously.
🫡
Would love to know the suggestion of webhooks vs. polling when trying to get submission status? Sounds like the result would be equivalent and would lower the number of requests the server would receive?
Way more overheard to manage and room for things to go wrong.
you mean websocket or webhook?
Thanks for these videos, they're really super helpful. Just curious, when you say you'd want a mid-level or senior to pick up on certain things, is that always proactively or can it be in response to interviewer prompts. For example, in this one would you expect a mid level to have picked up that the get leaderboard query is slow on their own, or could they still pass if you had to nudge them and say what do you think about this query and/or how can we improve it, and they came up with a good solution as you outlined?
Should we discuss managing containers in the interview? Will users share containers or will each submission have a dedicated container? Also, will the containers remain active continuously?
maybe is worth introducing a queue for the results of the submission, and another MS that would read from this queue and update the leaderboard. this way the main service is not overloaded with not needed calculation of he scoreboard.
Def reasonable if we find that it can’t handle the load.
Do you think long pooling is better than fixed rate pooling here for the submission part?
No strong preference. Either works.
Thanks, very well explained!!
I do have a question, why would you use competitionId as PK for the submission table? Wouldnt that make querying a specific submission very slow? Also SQS is in memory right? What if that goes down? Submissions will be dropped? Thanks
yah, this was silly mistake. submission table PK should be id and then we need an index on competitionId. Good callout. As for SQS, no one would wait more than 5s anyway, so if it does down and we lose them, it is what it is. Client shows an error saying they need to resubmit. They aren't going to wait 5 minutes for a response in any case.
@@hello_interview got it. Thanks
The SQL query would not work as is, because I can do 100 submissions for the same problem and win the competition. We would need to make sure that submissions are only counted once per problem.
Thank you! Do you need to use REST api notations? Can you just do methods?
Depends on company and level, but usually, method signatures is fine
@hello_interview meta principle engineer
Instead of having an SQL query I think, having a LeaderBoard Table would be good, have user_id, competetion_id, and points as column, and where user submits the all questions calculate the points and store it here. as things gets written once and read multiple times, fetching the points would be better way, just pass competion_id and sort on marks, u will have ur results.
because think 100k user trying to see the leader board and that complex query getting executed for each on of them.
If i'm not mistaken, AWS cloudformation give you possibility to set up autoscaling rules, policies in a numerous ways. It also includes warm up period, so I don't think queue is needed here. It can be found in usergide ec2-auto-scaling-default-instance-warmup
Lots of clouds offer autoscaling, but generally speaking they are not quick to respond so you'll probably want a queue even if it's just to handle the first 30-90 seconds, unless your system can tolerate a spike in dropped requests.
@@hello_interview thnx :) good to know
Thanks for this video and all the other ones! I love your content. One thing I failed to understand in this video is where are the Docker containers (Language code runtime services) actually running ? Looking at your diagram, it seems like they are not running on the Primary Server. But they are Docker Containers. Their images must be pulled onto some server and served there such that the Workers can actually talk to them. Please explain that if possible. Thanks again and keep rocking!
Something like aws ECS :)
@@hello_interview Do you mean to say that the Workers are running forever as EC2 instances, and that these worker instances were initially launched with language specific docker images? But how would they auto-scale? Kubernetes orchestration perhaps?
Sorry I'm struggling to understand where these docker containers are actually running; and what's the real benefit of not using Lambdas in this design and how this design is actually solving the cold-start problem of Lambda functions?
@@sahajarora2162 If you study ECS you might be able to answer these questions for yourself
Great video as always!! Just one doubt, is it right to make competitionId as a Primary Key for Submission table as it will be NULL in case when a submission is not associated with a competition and also multiple submission will have same competitionId?
I might be missing something here. Please let me know your thoughts on this.
yah, this was silly mistake. submission table PK should be id and then we need an index on competitionId. Good callout
Excellent work. How much do the videos differ from the written articles on your website?
Only slightly in general content. But the videos may have longer explanations in places and more commentary
Good content, but I found the justification for CDC over a cron job hand-wavey. How specifically would a cron job have a higher infrastructure cost as you claimed? I also don’t think CDC is necessarily more reliable. I work at a big tech company and we’ve had three downtime events for our in-house CDC in the last two months.
Also: the Kubernetes setting of “concurrencyPolicy: Forbid” would prevent the problem of too-frequent job runs you mentioned. And network isolation and no host machine system calls are the default behavior of a Docker container. The latter might not be possible at all.
the max submission test seems nothing wrong, we rank ascending for user’s last submission time for the last problem they just finished
I am curious what would you suggest for career development between Midlevel to Senior to Staff.
I know there is a useful breakdown in terms of interview expectation, but I'm thinking more from the perspective of internal promotions.
Do you have any useful tips?
Join our Q&A on Thursday, maybe we'll have time to talk about this.
@@hello_interview is this a livestream on UA-cam? Or on the Hello Interview website?
@@duncanwycliffe4002 On YT! ua-cam.com/users/live7SyaOty3rjk
But go to our website anyways, it's good :)
Where is data flow part?? And how to select to type of database, SQL or NoSQL
www.hellointerview.com/learn/system-design/in-a-hurry/delivery :)
@@hello_interview thankyou so much, can I use the process for developing MVP or this process is just for interview?
You are doing system design videos like how real designs are done inside Amazon - Backend SDE at Amazon.
Could you share what whiteboard tool you're using?
Excalidraw. Its really basic and free. But enough for interviews and also not too many features to confuse you.
Thanks for the assist!
The board is also linked in the description.
Awesome
I call leetcode the vain of my existence
Leetcode, Onlide Judge, Online Coding Competition, Vain of My Existence. Has a ring to it
Love from India
Hello Interview ❤️ India
Stopping video after 2 mins. The font is totally unreadable.
I’m sorry
@hello_interview no worries. Thanks for the acknowledgement
Yeah deffo down like . How can you give a presentation with such terrible font?
I think it is a bad design
Any reason for going with primary monolithic service rather than micro services like in other SD keys from HelloInterview ?
Would love to see Messaging service like whatsapp or messenger next.
Yah, nothing that needed to scale independently here beyond the containers so I opted for the single service