@31:00 I believe that the rationale for adopting SQL to maintain consistency and reference the CAP theorem is misleading. It's important to note that the concept of consistency in the CAP theorem differs significantly from the consistency defined in ACID. Consistency in CAP means "every read should see the latest write" as opposed to consistency in ACID which ensures that your DB always transits from one valid state to another valid state.
Yea, you are right. Consistent follower/read replica DB in terms of ACID can be in non-consistent state in terms of CAP, just because it wasn't yet update from the main/write replica.
Great video, thanks. One thing I'm really missing is some sort of a judgement, what went well, what was not ideal. I see that DB design wasn't really well though out, or maybe it's just me. Sorting such things out as a conclusion to video would be a great value to those who watch these videos!
Its great video but I think we missed 2 important things: 1) The file permissions were missing while considering schema design which is "must have" for any file sharing system 2) For very large files how the upload and download can be optimized to save network bandwidth instead of just redirecting to S3. Please take these inputs positively and keep sharing such videos.🙂
2) I think it would be too detail if we mention how to optimize the upload and download flow. I think using pre-signed url for upload and download would be enough for this case, s3 will handle the rest
I think the main requirement of a file-sharing system is how the edits are handled, something like every edit on a file does not sync the whole file across devices and just the data chunk that was edited, without this requirement, its same as any other design with models and data floating around. Overall it was a great design interview, But one question i have across all the design interviews is the math performed in the beginning wrt to no of users, traffic, QPS etc, how is it even used?
Is that bit about partitions in S3 accurate? S3 uses a key-based structure where each object is stored with a unique key. The key can include slashes ("/") to create a hierarchy, effectively mimicking a folder structure but there aren't any actual folders in S3; it's all based on the keys you assign to your objects. So, what does he mean by splitting into more folders when they become too large?
AWS S3 (and other cloud blob storage services) are flat storage, the console UI shows folders structures but it's just extracted from the file path (/{folder}/..). The only think that may suggests adding a DB field/table, is to store & track the available folders per user to improve performance, as listing the folders directly from SDK/API means you fetch all the blobs then extract the folder structure from them!
I am really not able to undersand the file upload usecase in realistic manner. Initially client sent a POST : File /filemetadata to the server.. In response server sent pre validated S3 storage URL as a redirect, which client request redirected to s3 bucket with file directly and write the file into S3. S3 bucket service respond back to the client with S3 bucket file location. :: Now my question is that, how server will store this file location metadata into the server Mysql DB table? Does client will make one more request to POST this metadata to the server?
Seemed legit to me for the most part, but why would you use a CDN? Unless there are lots of users with certain big files that are the exact same, what benefit would a CDN provide?
Since the video mainly used AWS, I'll use cloudfront as an example so I can be specific. In addition to some security benefits, CloudFront operates over AWS's private network globally and which is typically faster/more reliable than public backbone internet.
I think CDN can also be location optimized. So, a person living in Brazil can fetch data from a CDN located near to brazil instead of the one located in US, making it more efficient.
CDN is not only used for cache! It can defend when there is DDoS attack. Also traffic from CDN to services like S3, Ec2 is carried via a private network or backbone so your responses are faster.
What about chunking files for upload and using fingerprinting for integrity of data I think it’s totally missed , What about sync service you have offloaded it to notification and client
If the data is compressed in client side , wondering who will divide the data in blocks and send the data in block. Sending block is having advantage to deduplicate . Thought? May be make sense chunking in client side it self. Once the compressed chunks are uploaded to storage, meta data DB can be updated.
excellent presentation. did you also need a file/folder listing feature that the client launches at startup? also if the storage is hierarchical like dropbox you need the ability to create folders.
Well, from how I see it metadata is more suitable in a no-SQL setup Because then the file structure can be stored in metadata as lightweight storage for sync ups amongst new devices of the user and the JSON structure gives the capability of nested folder structure being stored. So we have the tables - user_details-personal details+meta, user_login, user_folders, and file_meta-with versions. also, NoSQL is fine as we do not have large volumes of concurrent requests updating on the same keys. So, locking and transaction atomicity is not an issue. Also, reads are easily handled.
It is not safe to store AWS credentials on the frontend(client/Browser) for direct S3 uploads. We can use multipart s3 API but it has to be done through API server. Correct?
It is not ok for the client to tell the api server that the upload is done. The client may be unable to tell the api server about the upload being finished due to network outage, so your metadata database will now be out of sync. You will want to keep that logic on the server.
@@adityasanthosh702 Trigger some job (or another async workload) that does the upload and reports to the app server via some pub sub mechanism. You can also report the client about the status in the meantime
In S3 there’s the concept of presigned URLs. The api server creates an “upload url” from s3 and returns it to the client. And the client uploads directly to s3 this cuts down the middleman and consequently the bandwidth on the api server (along with things like resumable uploads etc) Another interesting flaw in the design the fact that he’s making the client self-report the status of the upload, if the request fails we will end up with “zombie” files. A correct pattern would be to have a lambda function run on s3 upload success. Another important thing is to add debouncing to notifications otherwise if the user uploads 10 small files, they will receive 10 notifications instead of just 1
@@zaneturner5376this comes with a big trade-off on resuming uploads. If the upload time is bigger than ttl then the file will get deleted, which would mean a user has to start the upload from the start. Race conditions can also occur if the file is being deleted as you are self reporting “finished upload”. In this case the server is going to think the file exists while it points to inexistant file in S3. While this race condition is unlikely since it requires precise timing but given the number of users/uploads, it’s bound to happen. Self reporting is almost always a bad idea except in some rare situations.
@@zaneturner5376 man, you can see that any TTL can be broken (think 3G user uploading 1TB file as an example, the main non functional requirement was *trust* and ease of use). You can argue all day that a TTL is the right way but the tradeoffs are just not worth it. Rare race conditions are always hit when at scale of millions of users and 10s of millions of uploads. Also remember’s an engineer’s best rule - murphy’s law, anything that can go wrong will go wrong. Lambdas are extremely cheap to run when not consuming any significant RAM/compute e.g. doing a task like an API call. Plus using them would be the only way to guarantee not hitting an edge case in uploads (0% probability of have a problem happening is significantly better than 0.0001% probability when dealing with large scale systems where people pay money to have their data in your hands). Decisions like this are ones that separate a reliable engineer and a flimsy one. The devil is in the details.
For 100M users, each user has 15 GB storage space, shouldn't the total storage be 1.5 Exa bytes? Explanation: 100,000,000 * 15 / 1000 * 1000 = 1500 PB = 1.5 EB.
yea calculated the same thing, that is why the back of napkin/envelop is dangerous, I skip it all together as if you get it incorrect it is a big fail.. whats obvious is for this system we must scale horizontally and distribute the load, who cares how many servers is required or not its not even the scope of the interview.
I think adding the Loging Auth on the client side is not recommended for security reasons. One of the points which my Interviewer didn't seem to be happy about....and since that was for the security team I think that cost me losing the offer
Don't we need to handle the scenario in which S3 was not there? having S3 in place, directly neglects the importance of how upload and download actually works behind the scene, and that's the problem dropbox or any file management service solved, so I believe discussion should have been on that part, instead of directly using S3, and abstracting the entire part of upload and download.
Couple of notes: - back of the envelope calculations were not utilised at all - notifications part was not covered at all - the interviewee is pretty much into AWS stuff and goes too much about its specifics - upload endpoint should return file id plus some additional metadata about file, probably also 201 response - also API design didn't really correspond to what he told in the end. If we upload directly to S3 the flow should be following: a) client calls API server to get an upload link b) client uploads file directly to S3 using link c) once upload is finished, client gets some file id which it sends to another API server endpoint to record that file was actually uploaded. Or maybe there is a way how S3 can itself notify API server, idk about that. - there should be an endpoint get info about all the files in the cloud - compression on web or mobile is probably not a great idea, compressing 10gb file will eat ton of battery Overall, i'd say this system design lack quite some depths.
10GB won't be compressed as an entirety. It will be split into chunks and each client process/thread parallely compresses it. Regarding notifying the server when a client has finished downloading/uploading, I do not think that info needs to be saved by the server. The server's job is to store the latest files and provide links when clients request them. If the file download from client is unsuccessful, the server can assume that its the responsbility of clients to request new changes. Same with conflict resolution. Instead of server resolving them, it can ask clients "hey some other client changed the same file. Which one is the latest?"
all these videos are the same: load balancer with a server behind it and a database. this has no value in the real world, too high level, too basic. a junior developer could come up with this shit after watching 1 udemy course. horribly useless and misleading channel.
This guy is definitely an experienced engineer, but he didn’t prepare for this kind of interview very well, maybe he is a little bit nervous during the interview. He is trying to say a lot terms like s3 to make him sounds professional but lost many details on how to design the solution from scratch like handling big files. This is a question about designing Dropbox, not use case of s3. The performance would be rejected for any senior positions.
Get 1-on-1 coaching to ace your system design interview: igotanoffer.com/en/interview-coaching/type/system-design-interview?UA-cam&
I think that this is the best IGotAnOffer video so far. Please bring in Alex for another one - perhaps to design Google maps? Thanks.
Wow thanks yes I've already asked him to do another one, so watch this space!
@31:00 I believe that the rationale for adopting SQL to maintain consistency and reference the CAP theorem is misleading. It's important to note that the concept of consistency in the CAP theorem differs significantly from the consistency defined in ACID. Consistency in CAP means "every read should see the latest write" as opposed to consistency in ACID which ensures that your DB always transits from one valid state to another valid state.
Yea, you are right. Consistent follower/read replica DB in terms of ACID can be in non-consistent state in terms of CAP, just because it wasn't yet update from the main/write replica.
He meant consistency in the CAP theorem.
Great question asked about compression at 20:55 with a well-structured answer.
This candidate has real life experience and it shows in the interview. He starts out simple and build on top of it. I love it.
Great video, thanks. One thing I'm really missing is some sort of a judgement, what went well, what was not ideal. I see that DB design wasn't really well though out, or maybe it's just me. Sorting such things out as a conclusion to video would be a great value to those who watch these videos!
Very useful - Simple, Clear, no hurry, flow is really good
im genuinely happy to discover this beautiful channel this was very insightful. thank you and keep sharing.
Good Video! I think there is a mis-calculation, the total storage use for 100 million users is around 1,500 Pb, not 1.5pb.
not so important
Oh, no. It is uber important. 1.5PB - 1 large storage account. 1.5 EB is totally different scale of algorythms and data storage - @@moneychutney
Yes I too noticed that: 100 Mil users and 15 GB per user = 100 * 10^6 * 15* 10^9 = 1500 * 10^15 or 1500 PB
Not important in system design interview tbh.
Its great video but I think we missed 2 important things:
1) The file permissions were missing while considering schema design which is "must have" for any file sharing system
2) For very large files how the upload and download can be optimized to save network bandwidth instead of just redirecting to S3.
Please take these inputs positively and keep sharing such videos.🙂
2) I think it would be too detail if we mention how to optimize the upload and download flow. I think using pre-signed url for upload and download would be enough for this case, s3 will handle the rest
Further, there's probably skewness between read and write operations here... designer did not explore that in their design.
Great video. The way he approaches depth shows that he is very strong
Glad you think so!
I think the main requirement of a file-sharing system is how the edits are handled, something like every edit on a file does not sync the whole file across devices and just the data chunk that was edited, without this requirement, its same as any other design with models and data floating around. Overall it was a great design interview, But one question i have across all the design interviews is the math performed in the beginning wrt to no of users, traffic, QPS etc, how is it even used?
Also wondering the purpose of number calculations if theyre never used
Is that bit about partitions in S3 accurate? S3 uses a key-based structure where each object is stored with a unique key. The key can include slashes ("/") to create a hierarchy, effectively mimicking a folder structure but there aren't any actual folders in S3; it's all based on the keys you assign to your objects.
So, what does he mean by splitting into more folders when they become too large?
AWS S3 (and other cloud blob storage services) are flat storage, the console UI shows folders structures but it's just extracted from the file path (/{folder}/..).
The only think that may suggests adding a DB field/table, is to store & track the available folders per user to improve performance, as listing the folders directly from SDK/API means you fetch all the blobs then extract the folder structure from them!
I am really not able to undersand the file upload usecase in realistic manner. Initially client sent a POST : File /filemetadata to the server.. In response server sent pre validated S3 storage URL as a redirect, which client request redirected to s3 bucket with file directly and write the file into S3. S3 bucket service respond back to the client with S3 bucket file location. :: Now my question is that, how server will store this file location metadata into the server Mysql DB table? Does client will make one more request to POST this metadata to the server?
Seemed legit to me for the most part, but why would you use a CDN? Unless there are lots of users with certain big files that are the exact same, what benefit would a CDN provide?
Since the video mainly used AWS, I'll use cloudfront as an example so I can be specific. In addition to some security benefits, CloudFront operates over AWS's private network globally and which is typically faster/more reliable than public backbone internet.
I think CDN can also be location optimized. So, a person living in Brazil can fetch data from a CDN located near to brazil instead of the one located in US, making it more efficient.
CDN is not only used for cache! It can defend when there is DDoS attack. Also traffic from CDN to services like S3, Ec2 is carried via a private network or backbone so your responses are faster.
What about chunking files for upload and using fingerprinting for integrity of data I think it’s totally missed , What about sync service you have offloaded it to notification and client
If the data is compressed in client side , wondering who will divide the data in blocks and send the data in block. Sending block is having advantage to deduplicate . Thought?
May be make sense chunking in client side it self. Once the compressed chunks are uploaded to storage, meta data DB can be updated.
excellent presentation. did you also need a file/folder listing feature that the client launches at startup? also if the storage is hierarchical like dropbox you need the ability to create folders.
Indeed a great video everything from rough calculations to being communicative with the customer was great 🎉
Well, from how I see it metadata is more suitable in a no-SQL setup
Because then the file structure can be stored in metadata as lightweight storage for sync ups amongst new devices of the user
and the JSON structure gives the capability of nested folder structure being stored.
So we have the tables - user_details-personal details+meta, user_login, user_folders, and file_meta-with versions.
also, NoSQL is fine as we do not have large volumes of concurrent requests updating on the same keys.
So, locking and transaction atomicity is not an issue.
Also, reads are easily handled.
It is not safe to store AWS credentials on the frontend(client/Browser) for direct S3 uploads. We can use multipart s3 API but it has to be done through API server. Correct?
You will not be storing AWS creds on client machine. He mentioned he will get the signed URL from S3 via API server which will have TTL.
@@resistancet8the question is around uploads. You’re right on the download, though
I just wonder, if those calculations were not done then would there be any change in the design presented?
Maybe the best system design tutorial I've ever seen.
Wow, thanks!
It is not ok for the client to tell the api server that the upload is done. The client may be unable to tell the api server about the upload being finished due to network outage, so your metadata database will now be out of sync. You will want to keep that logic on the server.
Agreed x1000
And how exactly would that be implemented? I am curious
@@adityasanthosh702
Trigger some job (or another async workload) that does the upload and reports to the app server via some pub sub mechanism.
You can also report the client about the status in the meantime
What’s the drawing tool used in this video?
it's Figma
Great video
don't quite understand the workflow, shouldn't the server itself interacts with S3 to put the data in?
In S3 there’s the concept of presigned URLs. The api server creates an “upload url” from s3 and returns it to the client. And the client uploads directly to s3 this cuts down the middleman and consequently the bandwidth on the api server (along with things like resumable uploads etc)
Another interesting flaw in the design the fact that he’s making the client self-report the status of the upload, if the request fails we will end up with “zombie” files. A correct pattern would be to have a lambda function run on s3 upload success. Another important thing is to add debouncing to notifications otherwise if the user uploads 10 small files, they will receive 10 notifications instead of just 1
@@zaneturner5376this comes with a big trade-off on resuming uploads. If the upload time is bigger than ttl then the file will get deleted, which would mean a user has to start the upload from the start. Race conditions can also occur if the file is being deleted as you are self reporting “finished upload”. In this case the server is going to think the file exists while it points to inexistant file in S3. While this race condition is unlikely since it requires precise timing but given the number of users/uploads, it’s bound to happen. Self reporting is almost always a bad idea except in some rare situations.
@@zaneturner5376 man, you can see that any TTL can be broken (think 3G user uploading 1TB file as an example, the main non functional requirement was *trust* and ease of use). You can argue all day that a TTL is the right way but the tradeoffs are just not worth it. Rare race conditions are always hit when at scale of millions of users and 10s of millions of uploads. Also remember’s an engineer’s best rule - murphy’s law, anything that can go wrong will go wrong.
Lambdas are extremely cheap to run when not consuming any significant RAM/compute e.g. doing a task like an API call. Plus using them would be the only way to guarantee not hitting an edge case in uploads (0% probability of have a problem happening is significantly better than 0.0001% probability when dealing with large scale systems where people pay money to have their data in your hands).
Decisions like this are ones that separate a reliable engineer and a flimsy one. The devil is in the details.
where is the cache on client side?
what tool are they using for design?
Some data inconsistency issues between DB and Queue.
For 100M users, each user has 15 GB storage space, shouldn't the total storage be 1.5 Exa bytes? Explanation: 100,000,000 * 15 / 1000 * 1000 = 1500 PB = 1.5 EB.
yea calculated the same thing, that is why the back of napkin/envelop is dangerous, I skip it all together as if you get it incorrect it is a big fail.. whats obvious is for this system we must scale horizontally and distribute the load, who cares how many servers is required or not its not even the scope of the interview.
21:00 amazing question
Can you send the Figma template, please?
Great video!!!
I think adding the Loging Auth on the client side is not recommended for security reasons. One of the points which my Interviewer didn't seem to be happy about....and since that was for the security team I think that cost me losing the offer
always have an impression that the interviewer is trying really hard not to fall asleep 😂
Best video so far, still "not hired".
Which app is that to used to draw the diagram?
It's free version of Figma, its a called a "fig jam" I think. Very fun to use!
I think you missed out on synchronization.
I think notification part does the sync part?
Don't we need to handle the scenario in which S3 was not there? having S3 in place, directly neglects the importance of how upload and download actually works behind the scene, and that's the problem dropbox or any file management service solved, so I believe discussion should have been on that part, instead of directly using S3, and abstracting the entire part of upload and download.
Couple of notes:
- back of the envelope calculations were not utilised at all
- notifications part was not covered at all
- the interviewee is pretty much into AWS stuff and goes too much about its specifics
- upload endpoint should return file id plus some additional metadata about file, probably also 201 response
- also API design didn't really correspond to what he told in the end. If we upload directly to S3 the flow should be following:
a) client calls API server to get an upload link
b) client uploads file directly to S3 using link
c) once upload is finished, client gets some file id which it sends to another API server endpoint to record that file was actually uploaded. Or maybe there is a way how S3 can itself notify API server, idk about that.
- there should be an endpoint get info about all the files in the cloud
- compression on web or mobile is probably not a great idea, compressing 10gb file will eat ton of battery
Overall, i'd say this system design lack quite some depths.
10GB won't be compressed as an entirety. It will be split into chunks and each client process/thread parallely compresses it.
Regarding notifying the server when a client has finished downloading/uploading, I do not think that info needs to be saved by the server. The server's job is to store the latest files and provide links when clients request them. If the file download from client is unsuccessful, the server can assume that its the responsbility of clients to request new changes.
Same with conflict resolution. Instead of server resolving them, it can ask clients "hey some other client changed the same file. Which one is the latest?"
Alex is fantastic in this video. The interviewer looks like he wants no part of being in this video though.
There is no database on the high-level diagram. Then, in drill-down, Alex jumps to designing DB schema.
I believe the math is incorrect you must take 100M users * 15 GB to get to the total which is 1,500 PB
No handling of the limit requirements... not more than 10GB per file, no more than 15GB per user... lacking, too high level
For 100 million users, shoul it not be 1.5 exabytes?
exactly it is 1500 PB not 1.5 PB
I didnt really like how he just drops random stuff on the diagram and leaves it be, i like my diagrams clean precise and organized.
He looks like Kevin Spacey
all these videos are the same: load balancer with a server behind it and a database. this has no value in the real world, too high level, too basic. a junior developer could come up with this shit after watching 1 udemy course. horribly useless and misleading channel.
This guy is definitely an experienced engineer, but he didn’t prepare for this kind of interview very well, maybe he is a little bit nervous during the interview. He is trying to say a lot terms like s3 to make him sounds professional but lost many details on how to design the solution from scratch like handling big files. This is a question about designing Dropbox, not use case of s3. The performance would be rejected for any senior positions.
The interviewers blinking is fkn insane. Dude has issues