Thanks for the video! There is a little error about BOSH and long polling explanation BOSH and long polling is the same and what you describe as long polling is regular polling
Very helpful tutorial. Covers quite a few aspects of the problem. One tutorial I would like is how do you give estimates of server request and memory needed for the design question.
Question - Is there a reason why we cannot collapse conversation table and unread table into one? It would seem to me that we could simply keep a status flag and timestamp flag in conversation table to indicate whether/when a message has been read. That would eliminate the need of having to read from both unread and conversation table when the system tries to resend messages. I could see that one pro to keeping them separated is that it will keep the # of entries in unread table at a minimum and therefore improve the query performance. However, I think this may be offset by the fact that to retrieve an unread message, we would have to query both the unread table and the conversation table. Just a thought.
The video is really useful. I liked it. However, I'd suggest a small change. Instead of having two separate tables for unread and read messages, I'd just create a column ReadFlag in the Conversation table. The boolean value in the column will indicate if the message was read or unread.
Some very interesting designs there Tushar. Nice! I have some questions and suggestions though. How can N2 talk to N3 when there is no web socket connection with them? How does N2 know to send N3 a message? Is the communication between N2 and N3 handled at application level? I think there should be MQ implementation here because server can't be used to send messages to other servers. Do you think it can scale well?
Sorry for multiple posts, but I am working on a similar design and I have ran into use cases that these did not address. In many cases, it is possible that same user A is connected to our service using two different clients. E.g. Desktop/Phone. In those cases, the service needs to handle such scenarios where you need to add a deviceId or similar to redis. Also, you need to echo any message you receive to all users connected in conversation. A hi sent by user A to user B should send hi to all devices of userA and B. This way, a message typed by userA in desktop will show up in his phone as message sent as well.
Very nice, wonderfully explained. Thanks a lot for sharing it Tushar. One alternative thought, do you think there could be a bottleneck when a Node gets a message from the client, as it is supposed to do two things a. Write the message into the DB. b. Send the message to the right node (who is currently handling the recipient client). Could this be an issue in case of scale like FB messenger. Instead of directly handle the incoming message what if the Node1 just write the incoming message to the enterprise stream (say Kakfa). The alternative workflow would be UserA -> Node1 -> ChatStream -> [Node2 picks up cause it was listening to this specific topic] Node2 -> UserB There would be a separate service altogether which will pick the messages and write into the Conversation table, similarly Node2 picks up the message as it was listening to this specific Topic and send to the recipient client (using WebSocket). The Node1 and Node2 will just communicate with the client and with the ChatStream and nothing else. It will be good to know your thought over this alternative approach. Note: I can already see the complexity of the deciding the Topic, because if we are going to create Topic for every user that would an billions of Topic, hence we have to group the active users or using other better way to create a smaller number of Topics, so the stream doesn't become insane.
I think this design is more in sync with the recommendations of scalability, as we should refrain from allowing two nodes to talk to each other. When the node2 was trying to send the message to node3, node3 could have died, customer could have left and reconnected a websocket connection somewhere else (potentially moving it to node1), etc. Using a chatstream of sorts makes this easy to manage.
@@haseebshahid5148 This solution in the video is highly scalable. A message queue like Kafka would not work for two reasons (1) messages would be delivered unordered (unless each user has their own shard which for 1bn users is impossible) and (2) communication would be async so if user1 wrote a message to user2 there would be some delay (perhaps miniscule but perhaps not) before user2 sees it. I don't see anything wrong with nodes talking to each other? If N2 needs to communicate with N1 and N1 goes down, it can either return an error to the user (failed to send) or retry with redis to see the new server managing the user to send the message to.
Hey Tushar first of all thank you very much for these videos, they're invaluable. I'm having a interview in the coming month that I know for a fact will feature a systems design question and your series have helped me shaped my thinking about design. Also if you're looking for further ideas, I think something involving geo-spacial data like Uber will be a fun video to do
Hey Tushar love your videos man. Important thing, as per this design, we cannot send a notification. (I know notification is not a part of our feature yet, but could be thought of as an important feature to be added as the project evolves) Explanation: -> We are tying up "online" and "socket being open". So web-socket is open only when you are online. -> I think even if your app is closed and you are not online, your web-socket should be open. at 12:50, user A tries to send message to user C but user is not online and can't get a notification. He will directly receive message when he opens app and a socket connection is established.
Thanks for the video. BOSH is the same as HTTP long polling, I think your explanation of Long pulling is incorrect. You explained polling for long polling. :)
In the case of WebSockets, In order to transfer the request to the same machine, Load Balancers will have to keep a map for the user-machine pair in memory. Or it can also query redis, but this needs to be done every single time. even for ping operations and that would be really heavy!
Some thoughts: Did you consider how to keep multiple different devices in sync? Is the device storing state? What if Someone has 10 years of messages in the DB and then gets a new phone? How does the DB deal with that read request? What about storms of abnormal queries like New Years Eve when everyone is activating conversations that are a year old and skips the cache? When does data get tombstoned from Redis so that you can determine that a user is no longer online? Immediately after failed heartbeats? Or is there a job that does it every N hours based on last heartbeat?
I think better approach is: when UserA wants to talk to UserB, A initiates conversation using a separate Conversation Service which records participants and returns Conversation Id. Then this conversation Id can be used for Redis Pub/Sub. Servers of both user A and user B subscribe there and start listen. In this case it doesnt matter at all to where each client is connected to and we dont have to keep that weird server names table. User A sends a message -> it is stored in DB and added to Pub/Sub. Server of User B reads the message from Pub/Sub and send to user B. If server of User B went offline - subscription is lost, same as user connection. It's fine. Upon reconnect user will send us last offset and we will deliver unread messages and subscribe to pub/sub again. In similar way we can think about push notifications..
Thanks for the great explanation. Question - In the picture sending flow (23:30), if UserA sends an encrypted picture to N2, how would N2 get a thumbnail from it without looking in to it?
Hello Tushar, great explanation. Can you please explain a bit more about the communication between the nodes? Like when N2 receives info from user A, how does it talk to N3? Since all these nodes are behind the load balancer, these are all multiple instances of the same application. Could you detail the internal design in this area?
This video incorrectly talks about a design which is highly coupled between the Servers. In reality, a session's microservice is responsible for sending the messages to those said servers which just holds a TCP connection. These servers' only job is to just send information to the clients it is currently holding on. Evidently this is a cleaner architecture which removes the de coupling. No systems are designed where it is important one server talks to another server which does the "same thing".
A question here at 11:43, how server2 will send data to the server3 or vise versa? does it require another socket connection between these servers to send/receive a message or else?
Hi Tushar, Its great video thanks, but i have one question why do we need an unread and read table or we can add a new column (status) along with text and save the current status of the message (read or unread)?
Video is great no doubt !! I have small doubt regarding user and conversation table i.e from 17:00 I think if user_1 (server_1) become friend with user_2 (server_2) and vice versa corresponding server should create respective entries with unique conversation_id, i.e columns like (user | fiend | conversation_id) Entries Like user_1 | user_2 | 1001 user_2 | user_1 | 1001 Please correct me if there exist any trade-off or general contention that may occur ?
Thank you Tushar - Super interesting talk. One thing I was puzzled with: How can I do a search if the messages are encrypted. Per facebook they do not have the applications' private keys
Hi Tushar thanks for very informative content. My question is why did you choose Cassandra? Is there any specific use case that Cassandra handles efficiently compared to other database?
+Rajat Khandelwal great question. I should have clarified on that. First it's distributed and scales well. Second it's amazing for storing time series data like we have in this case. It's easy to do query like give me everything for given conversation id since time t5.
Thanks for the video, very information. In your introduction to System design video you mentioned about API design. I was hoping in your next video you first define the API supported by this design before going into the details.
Hi @Tushar, thanks for clarifying steps in thinking through system design process - very helpul for interview prep. You did mention right at the outset that you will not be going into security aspects of the design... are you planning to make new videos on this topic? I'm hoping: do interviewers still avoid such complicated issues as security during interviews? I'm concerned there are so many layers to consider - between middleware components, in transit, user/device level, coord of Auth/Login/session-init, mobile vs web app, ubiquitous handoff between devices/browsers in continuing/rejoining chat sessions, also considering storage of session details in cache and sync with cassandra - there are so many more complicated requirements these days... would appreciate your thoughts and advice on how to streamline all these various scenarios into suitable interview-timeline-constrained solution presentation.
Hi Tushar, At 17.40, you said conversation Id is foreign key in conversation table and it is primary key in user table? Is it right? And you are writing duplicates in three conversation table?
One question on unread table, it may be too much by using delete, we should be easily maintain a last online timestamp for every user, and read from Cassandra once user gets online since last online timestamp. Also I think you can partition the receiving message and send messages in database. Which means every message can be replicated to sent user and received user. Then when do search, the search should be faster
@TusharRoy-> many aspects are left in open state.Can you explain 1)how will N1 communicate with N2(Basically communication between instances of same server). 2)For communicating to offline users, how about considering notification?
This design seems like it would have a problem with broadcasting chat messages to multiple devices that a user is logged in with. Redis Pub/Sub would be a good solution for broadcasting messages in real-time.
Thanks for the video, I have a doubt at 09:25, when N2 dies and N3 takes over the session will be invalidated (remember we are using sticky sessions here). The user will be logged out. Can we handle this by storing session info in redis instead of using sticky sessions?
i think cassandra or a nosql database is the only way to go. If we have say 500 million users and 20 messages from each user per day, 1 billion messages, that would destroy a standard postgres database. Cassandra provides us with replication and sharding pretty much built in, so cassandra imo is the best choice which you picked
isn't distribution system very critical part of this? how DB is going to sharded and how to maintain the consistency and speed from DB? how the system needs to scale out? what you are doing (which i am learning a lot) is how the system works like your algo tutor which I really appreciate a lot.
Hi Tushar! Thanks a lot for your videos! What about this case: 1. UserA initiates conversation by sending ‘hi’. 2. New item is created in Conversation table. 3. N2 looks at redis and finds N3 to send a message. 4. N2 goes down. As I see, in this case we have a lost message. Isn’t better to put a record into unread table every time and ask N3 node to move it to Read table as soon as it received? (Do the same workflow as with offline user)
Hey!!! I believe that he said if UserB is not online, the message will go into Unread table and when the user comes online, it looks for message in unread table and deletes it...
Very nice video. Can you please clarify how User B "tells" User A that a message is read? Also should there be a column in conversation table to capture the read time so that this info is available for historical reasons?
Good explanation, searching chat is something i'd like to know more, suppose if there are a ton of messages from a lot of users in general, bringing them all to the client and decrypting and then searching on the client is not efficient.
I think we can discuss more about the concurrency aspect and fault tolerance. It would great to know about how a chat system is essentially a multi-leader distributed systems with each chat thread being more like a table being updated concurrently by two or more chat participants. Concurrency for the redis cache/ cassandra, would inevitably bring up things like partitioning / sharding strategies for the database. There can be more on the discussions on blocking users, monetisation aspect including push notifications. The recently implemented feature of businesses using the messenger platform to push ads and reach with consumers is very innovative. In these cases there are broadcast messages, and can be dealt with in different ways. Chat bots are one more interesting thing coming up with Messenger. Overall I believe this video is very good and interesting, and is helping me a lot to understand how to tackle system design interviews. Thanks a lot for keeping these videos coming..!
+Anish Chakraborty good feedback. Cassandra does inherent partitioning based of partition key which in this case is user IDs and conversation id. Reason I stayed away from that was to keep things somewhat simple. Your observations are very accurate and you are ready to take this to next level.
Hey @Tushar, how about this approach , where we maintain only one table called conversations table, and in that one field could be state (READ, UNREAD, etc). For example, let's say we have a table with following schema (sender, receiver, encrypted msg, key, time, state, .... ). When someone opens a new chat window, the server makes a call to the db, select all those conversations between these 2 where time > t or may be keep an upper count of the messages sorted by time. Do you see any bottlenecks here.
I really like your videos @Tushar and they are very comprehensive. Had one suggestion on the messaging design though, can you look into the design aspect of using Kafka or any other message Service , can be hosted in a cloud etc.
N2, N3 (app serving nodes) should never talk to each other as those instances of an app can be in any transition state (going down, coming up, overloaded etc) and they should purely act as a stateless nodes, while state and other required data is maintain in Redis, Kafka or RabbitMQ kind of pub-sub mechanism.
If N* maintains the connection for users, they are not stateless nodes and they could build TCP connection to each other. But I think when we have 1000+ servers, we do not let server connects each other directly. A proxy is needed in each datacenter.
if you use queue, how will you make sure the message went to the target server and not pulled by wrong server? For example in kafka, you can't create topic for each user. There are billions user there.
@@takshakc This is a very valid comment. Tushar, I loved the video but the design is lacking a task queue system on how N2 and N3 will talk to each other and follow up on their jobs in a queue ....pls let me know your thoughts
Use websocket, similar to sticky session, is less desirable for such high-traffic system. HTTP long polling is less sexy but can be horizontally-scaled much better since we can make it stateless. Each Node exposes a webhook for other Nodes to post to: hey right now UserB is connecting to you and there is indeed a message for UserB. The UserB-Node addressing can be done by Redis like what you proposed here.
Great video Tushar, but I have a question about the search mechanism. If we're turning all old messages into blobs, I dont think its possible to go and search through them afterwards.
It's very doable. Suppose you have 100mb of data which is a lot of messages. It takes 2s to download that kind of data in DC. Another 100ms of searching and stuff. At worst for most users results will come back in 3secs. Now considering how infrequently search is done I think that's reasonable time. To improve performance even more we could start streaming search results as and when found without waiting for entire blob to be downloaded.
Thanks Tushar for coming up with this new design algo series.. if possible can you also make a video on designing stock application and uber service.. I have seen these questions being asked in some of the tech giant companies in SF bay area..
What would be some alternatives in this case? Utilizing Kafka and then having the other Nodes get the data if they're subscribed to that topic? Feel like coming up with the topic name would be difficult in that situation.
Yes, I have the same doubt. The problem here is N2 may die sometime and we never know N2 received the message or not. I think it's better persist the message and its status (unread) to database first, and then N3 asynchronously find unread messages for its users and send to them.
@@TheRafaelnadalrock I think solution is to make sure user B always reaches to same group of server(s) i.e. N3(x-y) . Kind of shading. Now messages can be published vis a broker (can kafka as you mentioned)
@@vikrant133 i was thinking the same but, instead of kafka use redis pub/sub , the pub load the user id receiver and then sub load a channel where the user id receiver is the same user id , update the chat
It does. But different xmpp server implementations have different segments of xmpp protocol implemented. So not all xmpp servers support the same features. Then again I haven't seen xmpp implementations in 5 years so things might be different now
Thanks for video, Tushar. One question: you say Redis heartbeat time tells when people were last online (could be one week ago?). So if userA wants to send message to userC, it will see a record in cache. So how does server know userC is offline since there will be an old timestamp of when he was last online?
Good question. There are few options. 1) Never remove the user. As there are a limited number of users like 1 billion if we want we can keep them there permanently. 2) ttl(time to live) based removal. Remove it if user has not heartbeated in 1 hour. Very easy to do on cache like redis. 3) Let redis LRU cache do its removal. It will remove least used stuff first which is basically user who has heartbeat oldest.
as per current cases in whatsapp, by default we can go with option 1 when user A goes offline, user B should still be able to see last seen time. If user had not subscribed for last seen option, then remove with option 2. Also, we need to enable catch write to disk in case any of the redis node goes down.
Good explaination of various parts, but I would ask 2 things here. 1. N1, N2, N3 are too tightly coupled, and what is the communication protocol here between the servers of N1, N2, N3? 2. How the user get the data from Cassandra ? Is it not clear here.
There is no need to store fromUser in Conversation table because you have 2 columns user1 and user2( which should act as sender and receiver). Am I wrong?
But how will you know who has sent the message and who is the receiver? Coz there will be an entry in the user table where suer1 is B and user2 is A to indicate friends of user B. So in this case you will need a from_user in conversation table to just keep a tab of who is sending the messages.
Hi Tushar, great video thanks for doing this! In your example, for the conversation table, do you mean Cassandra would partition based on conversationID as partition key? Meaning the conversation table is shared and would be accessed by all nodes. Any conversation will be be shown to all users who are on that conversation. I presume this mean no user has their own conversation table for him/herself. Does the conversation table need FromUser and ToUser column? If, not where do you keep a mapping of a user and all the conversations they are part of? Also I presume the unread/read table are partitioned based on recipient user key?
I see lots of people suggest the use of Messaging infrastructure for communicating between N1 and N2. Thinking in this direction, lets say all Nodes communicate via a messaging infrastructure, Just wondering when N1 publishes into a messaging topic saying something like {"message": "hello" Destination :userB}. Now since all the nodes N1... to Nn would have subscribed to the topic, will the logic involve all these nodes receiving the message and then checking to see if they have a socket connection open with UserB device and ignoring the message if they dont ... (in reality only one Node should really get the message) This might cause un-necessarily having all different Nodes (N1 to Nn ) process the message , but only one will take action... Would this be a bottleneck? I thought applying a topic filter saying inform only "if Destination= userB" to the topic might help avoiding sending the message to the subscriber nodes that dont deal with userB, but how can we create a dynamic filter as we wont know at design time about this userB.
Hi Tushar, Thanks for the wonderful video. I see that the tables like unread or read should contain message encryption id for a specific message is read or not.
Redis: How did you arrive at not more than 10 hosts? What is the memory footprint per user? These are all simple tables in the in memory cache Redis? Can they be called relational tables?
a load balancer will only play a role before the initial HTTP GET (Upgrade) request is made, i.e. before the one and only TCP connection involved in said WebSocket connection creation is established between the two communication end points. Thereafter, the TCP connection stays established and cannot become "redirected" by a network device in between.
Great video! You mentioned only the encryption key is sent and will be reused for reply, then how would the encryption and decryption happen on FE(device)? I imagine there must be persisted somewhere to be looked up through encryption key?
Tushar, thanks for this awesome video. However i have same query like, 1. how group chat can be implemented 2. how can we use more than one instance of redis for availability purpose. 3. The server name you stored in redis , is this for scalability purpose of websocket connections ? Thanks.
1) in User table you can have userId, groupId combination instead of userA, userB combination. It also stores conversation_id and encryption key. In addition we can have group table for membership info about the group. Does that make sense? 2) Redis cluster comes out of box with all that. Nothing special needs to be done. redis.io/topics/cluster-tutorial 3) Server info is to find which server is managing the client.
I haven't even finished the video, and I have to thank you. This has to be the best explanation on chat messages.
one of the best ones on chat services
Thanks for the video! There is a little error about BOSH and long polling explanation BOSH and long polling is the same and what you describe as long polling is regular polling
yeah ... exactly, the explanation for long polling is wrong in the video.
yup
Tushar, you are a good teacher. Easy to understand. Thank you.
This is the best video till date on chat service. Thank you
Very helpful tutorial. Covers quite a few aspects of the problem. One tutorial I would like is how do you give estimates of server request and memory needed for the design question.
way better than grokking system design and some other chat messaging system design videos! Thank you so much!
Question - Is there a reason why we cannot collapse conversation table and unread table into one? It would seem to me that we could simply keep a status flag and timestamp flag in conversation table to indicate whether/when a message has been read. That would eliminate the need of having to read from both unread and conversation table when the system tries to resend messages. I could see that one pro to keeping them separated is that it will keep the # of entries in unread table at a minimum and therefore improve the query performance. However, I think this may be offset by the fact that to retrieve an unread message, we would have to query both the unread table and the conversation table. Just a thought.
you really make design and coding simple, very nicely explained.
The video is really useful. I liked it. However, I'd suggest a small change. Instead of having two separate tables for unread and read messages, I'd just create a column ReadFlag in the Conversation table. The boolean value in the column will indicate if the message was read or unread.
Could have been done that way because of performance reasons.
Great explanation.. It all just fit in my mind.. The teaching style is so nice.. that everything just slips in brain.
I am going to use Cassandra because this is my favorite database. Great.. :)
Some very interesting designs there Tushar. Nice! I have some questions and suggestions though.
How can N2 talk to N3 when there is no web socket connection with them? How does N2 know to send N3 a message? Is the communication between N2 and N3 handled at application level? I think there should be MQ implementation here because server can't be used to send messages to other servers. Do you think it can scale well?
yes servers will talk to each other using MQ.
Sorry for multiple posts, but I am working on a similar design and I have ran into use cases that these did not address. In many cases, it is possible that same user A is connected to our service using two different clients. E.g. Desktop/Phone. In those cases, the service needs to handle such scenarios where you need to add a deviceId or similar to redis. Also, you need to echo any message you receive to all users connected in conversation. A hi sent by user A to user B should send hi to all devices of userA and B. This way, a message typed by userA in desktop will show up in his phone as message sent as well.
Very nice, wonderfully explained. Thanks a lot for sharing it Tushar.
One alternative thought, do you think there could be a bottleneck when a Node gets a message from the client, as it is supposed to do two things
a. Write the message into the DB.
b. Send the message to the right node (who is currently handling the recipient client).
Could this be an issue in case of scale like FB messenger.
Instead of directly handle the incoming message what if the Node1 just write the incoming message to the enterprise stream (say Kakfa). The alternative workflow would be
UserA -> Node1 -> ChatStream -> [Node2 picks up cause it was listening to this specific topic] Node2 -> UserB
There would be a separate service altogether which will pick the messages and write into the Conversation table, similarly Node2 picks up the message as it was listening to this specific Topic and send to the recipient client (using WebSocket).
The Node1 and Node2 will just communicate with the client and with the ChatStream and nothing else.
It will be good to know your thought over this alternative approach.
Note: I can already see the complexity of the deciding the Topic, because if we are going to create Topic for every user that would an billions of Topic, hence we have to group the active users or using other better way to create a smaller number of Topics, so the stream doesn't become insane.
Its a fair point. But remember you can add as many nodes(N1, N2) as you like. Its totally horizontally scalable in that sense.
i was about to comment the same approach of using some distributed queue like Kafka or AWS SQS.
I think this design is more in sync with the recommendations of scalability, as we should refrain from allowing two nodes to talk to each other. When the node2 was trying to send the message to node3, node3 could have died, customer could have left and reconnected a websocket connection somewhere else (potentially moving it to node1), etc. Using a chatstream of sorts makes this easy to manage.
Completely agree with you. The approach in the video is not scalable.
@@haseebshahid5148 This solution in the video is highly scalable. A message queue like Kafka would not work for two reasons (1) messages would be delivered unordered (unless each user has their own shard which for 1bn users is impossible) and (2) communication would be async so if user1 wrote a message to user2 there would be some delay (perhaps miniscule but perhaps not) before user2 sees it. I don't see anything wrong with nodes talking to each other? If N2 needs to communicate with N1 and N1 goes down, it can either return an error to the user (failed to send) or retry with redis to see the new server managing the user to send the message to.
Interesting one, you are a good teacher
Very nice
Hey Tushar first of all thank you very much for these videos, they're invaluable. I'm having a interview in the coming month that I know for a fact will feature a systems design question and your series have helped me shaped my thinking about design.
Also if you're looking for further ideas, I think something involving geo-spacial data like Uber will be a fun video to do
+Netherblood I m glad this is helping and believe me that question is in my list.
very good sir, thank you
Hey Tushar love your videos man. Important thing, as per this design, we cannot send a notification. (I know notification is not a part of our feature yet, but could be thought of as an important feature to be added as the project evolves)
Explanation:
-> We are tying up "online" and "socket being open". So web-socket is open only when you are online.
-> I think even if your app is closed and you are not online, your web-socket should be open.
at 12:50, user A tries to send message to user C but user is not online and can't get a notification. He will directly receive message when he opens app and a socket connection is established.
Great video, a bonus for the links too.
One more system design question request - design a system where alexa orders a list of products on Amazon
The idea of storing sessions is really cool.
Thanks for the video.
BOSH is the same as HTTP long polling, I think your explanation of Long pulling is incorrect.
You explained polling for long polling.
:)
In the case of WebSockets, In order to transfer the request to the same machine, Load Balancers will have to keep a map for the user-machine pair in memory. Or it can also query redis, but this needs to be done every single time. even for ping operations and that would be really heavy!
Some thoughts: Did you consider how to keep multiple different devices in sync? Is the device storing state? What if Someone has 10 years of messages in the DB and then gets a new phone? How does the DB deal with that read request? What about storms of abnormal queries like New Years Eve when everyone is activating conversations that are a year old and skips the cache? When does data get tombstoned from Redis so that you can determine that a user is no longer online? Immediately after failed heartbeats? Or is there a job that does it every N hours based on last heartbeat?
I think better approach is:
when UserA wants to talk to UserB, A initiates conversation using a separate Conversation Service which records participants and returns Conversation Id.
Then this conversation Id can be used for Redis Pub/Sub.
Servers of both user A and user B subscribe there and start listen.
In this case it doesnt matter at all to where each client is connected to and we dont have to keep that weird server names table.
User A sends a message -> it is stored in DB and added to Pub/Sub.
Server of User B reads the message from Pub/Sub and send to user B.
If server of User B went offline - subscription is lost, same as user connection. It's fine.
Upon reconnect user will send us last offset and we will deliver unread messages and subscribe to pub/sub again.
In similar way we can think about push notifications..
Thanks for the great explanation. Question - In the picture sending flow (23:30), if UserA sends an encrypted picture to N2, how would N2 get a thumbnail from it without looking in to it?
could be another blob url, or byte-array. It's design
Images are not encrypted!
@@karthikmucheli7930 If a small image is base64 encoded, couldn't it be encrypted then?
This one deserves a thumb up!
Great job. One of the best design videos I've seen.
Best chat system design I saw so far! Thank you@Tushar. Will it be possible for you to design how Facebook notification system works?
Hello Tushar, great explanation. Can you please explain a bit more about the communication between the nodes? Like when N2 receives info from user A, how does it talk to N3? Since all these nodes are behind the load balancer, these are all multiple instances of the same application. Could you detail the internal design in this area?
This video incorrectly talks about a design which is highly coupled between the Servers. In reality, a session's microservice is responsible for sending the messages to those said servers which just holds a TCP connection. These servers' only job is to just send information to the clients it is currently holding on. Evidently this is a cleaner architecture which removes the de coupling. No systems are designed where it is important one server talks to another server which does the "same thing".
A question here at 11:43, how server2 will send data to the server3 or vise versa? does it require another socket connection between these servers to send/receive a message or else?
This is sooooo much better than Grokking
Hi Tushar, Its great video thanks, but i have one question why do we need an unread and read table or we can add a new column (status) along with text and save the current status of the message (read or unread)?
Great to see you doing these system design questions. !! Keep them coming :)
definitely
Very nice explaination
Video is great no doubt !!
I have small doubt regarding user and conversation table i.e from 17:00 I think if user_1 (server_1) become friend with user_2 (server_2) and vice versa corresponding server should create respective entries with unique conversation_id, i.e columns like
(user | fiend | conversation_id)
Entries Like
user_1 | user_2 | 1001
user_2 | user_1 | 1001
Please correct me if there exist any trade-off or general contention that may occur ?
Tushar, your videos are awesome.
Thanks for the good work Tushar. One clarification, it is not clear from the video what is the definition of conversation and thus conversationID.
hi @tushar - please tell me what will be the partition keys and clustering keys in cassandra tables
@Tushar, Why can't we use a flag in conversation table to denote sent/unread/ read?
Do we really need read and unread tables here?
same question here
We can
You most certainly can
Thank you Tushar - Super interesting talk. One thing I was puzzled with: How can I do a search if the messages are encrypted. Per facebook they do not have the applications' private keys
Messages would be decrypted at our end
Hi Tushar thanks for very informative content. My question is why did you choose Cassandra? Is there any specific use case that Cassandra handles efficiently compared to other database?
+Rajat Khandelwal great question. I should have clarified on that. First it's distributed and scales well. Second it's amazing for storing time series data like we have in this case. It's easy to do query like give me everything for given conversation id since time t5.
Tushar Roy - Coding Made Simple Thanks 😃
hi @tushar, I just have a problem with chat search as the chat is encrypted so how does it search in encrypted chat in server side.
nice point. you got an answer for this?
Thanks for the video, very information. In your introduction to System design video you mentioned about API design. I was hoping in your next video you first define the API supported by this design before going into the details.
Hi @Tushar, thanks for clarifying steps in thinking through system design process - very helpul for interview prep. You did mention right at the outset that you will not be going into security aspects of the design... are you planning to make new videos on this topic? I'm hoping: do interviewers still avoid such complicated issues as security during interviews? I'm concerned there are so many layers to consider - between middleware components, in transit, user/device level, coord of Auth/Login/session-init, mobile vs web app, ubiquitous handoff between devices/browsers in continuing/rejoining chat sessions, also considering storage of session details in cache and sync with cassandra - there are so many more complicated requirements these days... would appreciate your thoughts and advice on how to streamline all these various scenarios into suitable interview-timeline-constrained solution presentation.
Hi Tushar,
At 17.40, you said conversation Id is foreign key in conversation table and it is primary key in user table? Is it right? And you are writing duplicates in three conversation table?
One question on unread table, it may be too much by using delete, we should be easily maintain a last online timestamp for every user, and read from Cassandra once user gets online since last online timestamp. Also I think you can partition the receiving message and send messages in database. Which means every message can be replicated to sent user and received user. Then when do search, the search should be faster
@TusharRoy-> many aspects are left in open state.Can you explain
1)how will N1 communicate with N2(Basically communication between instances of same server).
2)For communicating to offline users, how about considering notification?
This design seems like it would have a problem with broadcasting chat messages to multiple devices that a user is logged in with. Redis Pub/Sub would be a good solution for broadcasting messages in real-time.
Thanks for the video, I have a doubt at 09:25, when N2 dies and N3 takes over the session will be invalidated (remember we are using sticky sessions here). The user will be logged out. Can we handle this by storing session info in redis instead of using sticky sessions?
I guess there will be session replication to handle the case u described. I am just presuming
i think cassandra or a nosql database is the only way to go. If we have say 500 million users and 20 messages from each user per day, 1 billion messages, that would destroy a standard postgres database. Cassandra provides us with replication and sharding pretty much built in, so cassandra imo is the best choice which you picked
isn't distribution system very critical part of this? how DB is going to sharded and how to maintain the consistency and speed from DB? how the system needs to scale out? what you are doing (which i am learning a lot) is how the system works like your algo tutor which I really appreciate a lot.
Hi Tushar! Thanks a lot for your videos! What about this case:
1. UserA initiates conversation by sending ‘hi’.
2. New item is created in Conversation table.
3. N2 looks at redis and finds N3 to send a message.
4. N2 goes down.
As I see, in this case we have a lost message. Isn’t better to put a record into unread table every time and ask N3 node to move it to Read table as soon as it received? (Do the same workflow as with offline user)
Correct!!! As he mentioned there might be many race conditions, but this is an initial design to get started off!!!
Hey!!! I believe that he said if UserB is not online, the message will go into Unread table and when the user comes online, it looks for message in unread table and deletes it...
I think the node can go down even when writing to db, it's better to pile messages to queue first
Very nice video. Can you please clarify how User B "tells" User A that a message is read? Also should there be a column in conversation
table to capture the read time so that this info is available for historical reasons?
Complex but very informative. I was thinking the same about storing a timestamp to know when a user has unread messages or not.
Video is very helpful . Could you please make a video on 1) System Design of Stock exchange. 2) System Design of Payment Gateways like payu...Thanks
superb video man. hats off to you
Good explanation, searching chat is something i'd like to know more, suppose if there are a ton of messages from a lot of users in general, bringing them all to the client and decrypting and then searching on the client is not efficient.
Excellent video. Loved it,.
Thank you Sir. This is a great explanation for beginners like me.
I think we can discuss more about the concurrency aspect and fault tolerance. It would great to know about how a chat system is essentially a multi-leader distributed systems with each chat thread being more like a table being updated concurrently by two or more chat participants. Concurrency for the redis cache/ cassandra, would inevitably bring up things like partitioning / sharding strategies for the database.
There can be more on the discussions on blocking users, monetisation aspect including push notifications.
The recently implemented feature of businesses using the messenger platform to push ads and reach with consumers is very innovative. In these cases there are broadcast messages, and can be dealt with in different ways.
Chat bots are one more interesting thing coming up with Messenger.
Overall I believe this video is very good and interesting, and is helping me a lot to understand how to tackle system design interviews. Thanks a lot for keeping these videos coming..!
+Anish Chakraborty good feedback. Cassandra does inherent partitioning based of partition key which in this case is user IDs and conversation id. Reason I stayed away from that was to keep things somewhat simple. Your observations are very accurate and you are ready to take this to next level.
How BOSH different than long polling http that you outlined. Based on your description, they seemed same. Can you elaborate more on this.
Hey @Tushar, how about this approach , where we maintain only one table called conversations table, and in that one field could be state (READ, UNREAD, etc). For example, let's say we have a table with following schema (sender, receiver, encrypted msg, key, time, state, .... ). When someone opens a new chat window, the server makes a call to the db, select all those conversations between these 2 where time > t or may be keep an upper count of the messages sorted by time. Do you see any bottlenecks here.
I really like your videos @Tushar and they are very comprehensive. Had one suggestion on the messaging design though, can you look into the design aspect of using Kafka or any other message Service , can be hosted in a cloud etc.
This guy knows signal in 2017, we are using it in 2021.
Great Video, we want more people like you.
awesome. Can you please do more system design ?
Thank You Tushar Sir and you are the hope for people like me.
excellent tushar.... becoming an hard-core fan of you :P
BTW: How do N2 and N3 talk to eachother? RabbitMQ/Kafka ?
+Netherblood either TCP or http
N2, N3 (app serving nodes) should never talk to each other as those instances of an app can be in any transition state (going down, coming up, overloaded etc) and they should purely act as a stateless nodes, while state and other required data is maintain in Redis, Kafka or RabbitMQ kind of pub-sub mechanism.
If N* maintains the connection for users, they are not stateless nodes and they could build TCP connection to each other. But I think when we have 1000+ servers, we do not let server connects each other directly. A proxy is needed in each datacenter.
if you use queue, how will you make sure the message went to the target server and not pulled by wrong server? For example in kafka, you can't create topic for each user. There are billions user there.
@@takshakc This is a very valid comment. Tushar, I loved the video but the design is lacking a task queue system on how N2 and N3 will talk to each other and follow up on their jobs in a queue ....pls let me know your thoughts
Use websocket, similar to sticky session, is less desirable for such high-traffic system. HTTP long polling is less sexy but can be horizontally-scaled much better since we can make it stateless. Each Node exposes a webhook for other Nodes to post to: hey right now UserB is connecting to you and there is indeed a message for UserB. The UserB-Node addressing can be done by Redis like what you proposed here.
Great video Tushar, but I have a question about the search mechanism. If we're turning all old messages into blobs, I dont think its possible to go and search through them afterwards.
It's very doable. Suppose you have 100mb of data which is a lot of messages. It takes 2s to download that kind of data in DC. Another 100ms of searching and stuff. At worst for most users results will come back in 3secs. Now considering how infrequently search is done I think that's reasonable time. To improve performance even more we could start streaming search results as and when found without waiting for entire blob to be downloaded.
Hi. Long polling that you described is more like "periodic pulling". Long pulling needs to be hold by the server.
Thanks Tushar for coming up with this new design algo series.. if possible can you also make a video on designing stock application and uber service.. I have seen these questions being asked in some of the tech giant companies in SF bay area..
really good one.
Hm, is it just me, or allowing servers to directly communicate with each other ( like N2 talking to N3 ) is not the best design practice?
What would be some alternatives in this case? Utilizing Kafka and then having the other Nodes get the data if they're subscribed to that topic? Feel like coming up with the topic name would be difficult in that situation.
Yes, I have the same doubt. The problem here is N2 may die sometime and we never know N2 received the message or not. I think it's better persist the message and its status (unread) to database first, and then N3 asynchronously find unread messages for its users and send to them.
@@TheRafaelnadalrock I think solution is to make sure user B always reaches to same group of server(s) i.e. N3(x-y) . Kind of shading. Now messages can be published vis a broker (can kafka as you mentioned)
@@vikrant133 i was thinking the same but, instead of kafka use redis pub/sub , the pub load the user id receiver and then sub load a channel where the user id receiver is the same user id , update the chat
How can we use XMPP in this? Does it make implementations of features like online/available/typing easier?
It does. But different xmpp server implementations have different segments of xmpp protocol implemented. So not all xmpp servers support the same features. Then again I haven't seen xmpp implementations in 5 years so things might be different now
Thanks for video, Tushar. One question: you say Redis heartbeat time tells when people were last online (could be one week ago?). So if userA wants to send message to userC, it will see a record in cache. So how does server know userC is offline since there will be an old timestamp of when he was last online?
If the user is online the heart beat time would be recent one i.e. within the configured heart beat interval
Tushar thanks for the Video. It’s awesome as always.
One query, on what basis we remove user info from redis..
Good question. There are few options.
1) Never remove the user. As there are a limited number of users like 1 billion if we want we can keep them there permanently.
2) ttl(time to live) based removal. Remove it if user has not heartbeated in 1 hour. Very easy to do on cache like redis.
3) Let redis LRU cache do its removal. It will remove least used stuff first which is basically user who has heartbeat oldest.
as per current cases in whatsapp, by default we can go with option 1 when user A goes offline, user B should still be able to see last seen time. If user had not subscribed for last seen option, then remove with option 2.
Also, we need to enable catch write to disk in case any of the redis node goes down.
Nice video
Thanks for the good video! Do you think that use RethinkDB could be a good option as well?
did not get the encryption part, how is the encryption key shared between 2 users?
What happens if one of N1/N2/N3 crashes? how do incorporate resiliency here?
nice job, keep making more. they are very helpful.
Good explaination of various parts, but I would ask 2 things here.
1. N1, N2, N3 are too tightly coupled, and what is the communication protocol here between the servers of N1, N2, N3?
2. How the user get the data from Cassandra ? Is it not clear here.
@tushar But load balancing purpose is defeated if we go for a sticky session
Great walk through. Would've liked to see where cassandra fits in?
the tables he mentioned are in cassandra only
There is no need to store fromUser in Conversation table because you have 2 columns user1 and user2( which should act as sender and receiver). Am I wrong?
But how will you know who has sent the message and who is the receiver? Coz there will be an entry in the user table where suer1 is B and user2 is A to indicate friends of user B. So in this case you will need a from_user in conversation table to just keep a tab of who is sending the messages.
@Tushar What is the rowKey for tables : Conversation , Unread & Read ?
Hi Tushar, great video thanks for doing this! In your example, for the conversation table, do you mean Cassandra would partition based on conversationID as partition key? Meaning the conversation table is shared and would be accessed by all nodes.
Any conversation will be be shown to all users who are on that conversation. I presume this mean no user has their own conversation table for him/herself. Does the conversation table need FromUser and ToUser column? If, not where do you keep a mapping of a user and all the conversations they are part of? Also I presume the unread/read table are partitioned based on recipient user key?
is the conversion table kept in Casandara? isn't the read costly and impact the speed?
can't you store it in Redis as hash key value?
I see lots of people suggest the use of Messaging infrastructure for communicating between N1 and N2. Thinking in this direction, lets say all Nodes communicate via a messaging infrastructure, Just wondering when N1 publishes into a messaging topic saying something like {"message": "hello" Destination :userB}.
Now since all the nodes N1... to Nn would have subscribed to the topic, will the logic involve all these nodes receiving the message and then checking to see if they have a socket connection open with UserB device and ignoring the message if they dont ... (in reality only one Node should really get the message)
This might cause un-necessarily having all different Nodes (N1 to Nn ) process the message , but only one will take action... Would this be a bottleneck?
I thought applying a topic filter saying inform only "if Destination= userB" to the topic might help avoiding sending the message to the subscriber nodes that dont deal with userB, but how can we create a dynamic filter as we wont know at design time about this userB.
For the unread table, don't you think you need conversation id to get the message?
Hi Tushar, Thanks for the wonderful video. I see that the tables like unread or read should contain message encryption id for a specific message is read or not.
Redis: How did you arrive at not more than 10 hosts? What is the memory footprint per user?
These are all simple tables in the in memory cache Redis? Can they be called relational tables?
Thanks for the great video, Tushar!
a load balancer will only play a role before the initial HTTP GET (Upgrade) request is made, i.e. before the one and only TCP connection involved in said WebSocket connection creation is established between the two communication end points. Thereafter, the TCP connection stays established and cannot become "redirected" by a network device in between.
Great video! You mentioned only the encryption key is sent and will be reused for reply, then how would the encryption and decryption happen on FE(device)? I imagine there must be persisted somewhere to be looked up through encryption key?
Tushar, thanks for this awesome video. However i have same query like,
1. how group chat can be implemented
2. how can we use more than one instance of redis for availability purpose.
3. The server name you stored in redis , is this for scalability purpose of websocket connections ?
Thanks.
1) in User table you can have userId, groupId combination instead of userA, userB combination. It also stores conversation_id and encryption key. In addition we can have group table for membership info about the group. Does that make sense?
2) Redis cluster comes out of box with all that. Nothing special needs to be done. redis.io/topics/cluster-tutorial
3) Server info is to find which server is managing the client.
Sure, its very helpful, appreciated the help!
make sense ..