Falling in love with your Content Arpit - Breaking complex topics intro smaller chunks and explaining it like explaining to a beginner is your strength. Keep bringing such content!
feeling proud of myself after watching this i implemented a similar thing a few days back on dynamodb, i found the dynamodb docs helpful where they suggested use of global secondary index for filtering
Dynamodb is eventual consistent and not strongly consistent right? If they needed strong consistency they would want to shift to postgres or something right?
a small doubt, user_id_gsi is stored with user_id when placing an order, what if there are 2 orders from the same user (maybe from different devices or even the same), won't GSI have 2 duplicate entries even if the orders are different
We can have multiple orders for same index and it’s not a primary key… It’s just to index the data… when you fetch ongoing orders of that user you need to fetch both the orders and through with this index it can fetch those 2 quickly.
AFAIK If we are not able to process an SQS message, it goes to DLQ. If SQS is down DLQ will also be down and we will not be able to publish the message there. Great video btw!
Read the blog and they have mentioned same, "When the producer fails, we will store the message in an Amazon Simple Queue Service (SQS) and retry. If the retry also fails, it will be moved to the SQS dead letter queue (DLQ), to be consumed at a later time.". So I think we will not be able to do anything if both Kafka and SQS are down (It might be a very rare event though)
@@architbhatiacodes So, we reduced the impact of missing data state but not zeroed. How would you handle the messages that further change state but the previous state is unapplied bcoz it is in DLQ
Amazing content. Quick question: Updating the table based on timestamp will not be reliable right - in a distributed system we cannot rely on system clocks for ordering messages.
One question: If they are using SQS with DLQ(100% SLA guaranteed by AWS) in data ingestion, what could be the reason of using Kafka in the first place? Why can't they just use SQS(with DLQ) only?
I think Kafka maintains order when compared to queue and only when Kafka is down we will be using SQS… SQS doesn’t maintain order that’s why we have two edge cases to handle upserts and updates with fresh time stamps
There are standard and fifo queue in sqs. Standard have high throughput. Also DLQ is not for the purpose of fallback of Primary queue, instead that is for the consumer that if message fails to successfully consumed X number of times. Then it should be moved to DLQ. I think the explanation for what if SQS is down is not correct.
Hi Arpit, just wanted to notify, I didn't find the blog link of "Grab" in your description. I might have missed it, could you please send the link again if possible.
For Grab SQS is just a fallback option in case Kafka goes down. Given that it charges on demand you won't pay anything if Kafka never goes down. Also, SQS and Kafka don't serve the "exact" same purpose. SQS, or any msgs queue offer better elasticity than msg stream like Kafka. You can scale up and down consumers easily while concurrency in Kafka is limited by the number of partition.
@@AsliEngineering I quite didn't get this part. I can see both kafka and sqs will be working in same method, why would someone need to use both kafka and sqs and why not just sqs (with DLQ)?
@@prerakchoksi2379 SQS plus DLQ would also work but it would limit that one message/event can be processed once. Kafka can have multiple consumer groups allowing you to do multiple things on same message/event.
Thanks very much for the video. This is really helpful in understanding how Grab can handle both large and spiky requests coming during rush hour. I just wonder in our company case, we also need to use the historical data to validate the promotion of customers based on their order history. For example, one promotion is only applicable for first time customer (simplest case). In that case, do we need to use the analytical data to calculate this?
How would they get to know just by looking by the former timestamp that its not the latest one? (1. newer update hasn't came yet. 2. would they query transactional db just for that?)
My qa team checks the orders page(use analytics query which is eventual consistent) immediately after placing order and say bug 😂 I observe this for aws payments page. Myself felt confused as the payments page take 60 seconds to show the payment revord itself...
Thanks for the informative video Arpit. I have a doubt on handling out of order messages at the end of your video. While deciding upon update #1 to be processed first over update #2 using timestamp, how does the consumer know that the message is the oldest one over the other as consumer #1 may have update #1 and consume #2 may have update #2? I would think of versioning the data and make sure the data we update is the next available version of the one present in the analytical database. Is this approach correct?
I got the similar question in a recent interview, question: consider a order mgmt system, with multiple instances same order service responsible for handling updates of orders(on a MySQL db). now three updates arrive in a sequence at t1 < t2 < t3 timestamps, at three different order service instances. Now how do we ensure the updates u1, U2 and u3 are applied in the same sequential order. any thoughts
@@srinish1993 maybe can we send the timestamp along with the data? so that when it is processed by the consumer, it can create SQL query to update the data WHERE updated_at < data.timestamp and id=data.order_id.
After some reading, I discovered that updating data based on timestamps is not a reliable method since there are inconsistencies in the system clock. A more effective approach is to utilize end-to-end partitioning. In this method, all messages related to a specific partition key are written to the same kafka partition. As long as the producer sends these messages in order to kafka, kafka will maintain their order within the partition, although ordering is not maintained across different partitions. This specific partition is then consumed by only a single consumer instance, ensuring that related events are process by the same consumer. for example, suppose we have two messages: t1( create profile 'A' ), t2( update profile 'A' ) The same consumer will receive and process t1 and t2 sequentially. This approach ensures that order is maintained in event-driven architecture. And this approach can also handle concurrency.
@@javeedbasha6088 What about comparing version of the order document ? Because if we keep version then we can avoid the need of kafka , simple SQS will work
Hi Arpit, 1 quick question. So when we do upsert, 1. operation type whether it would be insert / update is always decided based on availability of primary key in database? 2. when we are doing upsert, we always need to provide all mandatory parameters in SQL query?
1. You can define what is the update field/fields in your query or ORM layer. Doesn't need to be primary key. 2. While doing upsert, always provide ALL the parameters, even those that are not mandatory but have a non-null value, because every field will be overwritten.
@@dhruvagarwal7854 Hi Dhruv Thank you for clarifying. So if I understood correctly, upsert can be decided based on any field but that field should be unique in nature. Is this correct? 2nd point I am clear now. it makes sense to provide all parameters as every parameter is going to be overwritten and we won't want any parameter to be lost.
Just a small doubt. As transactional queries are handled synchronously, won't there be any issue while handling huge number(millions) of synchronous writes to db during peak traffic hours.Like there is a possibility that the DB servers can choke while handling them right? BTW loving watching ur videos, great content!
@@AsliEngineering For transaction queries we will be hitting the db server directly right as they need to be synchronous? You said there won't be any messaging queue involved in such cases.Crct me if I am wrong
Thanks for the content Arpit. Have a small doubt, how are we handling the huge spikes? Does dyanmo hot key partioning + lean gsi index do the job? I am assuming the peak duration will last for sometime since order delivery isn't just done in a 10 minute window. So at peak time even the index would start piling up. Would you say using dyanmo is the cost effective solution here? [I am assuming team wanted a cloud native solution and cost effectiveness involved calculating maintenance cost for in house solution]
DDB is cost effective as well as great at balancing the load. It can easily handle a huge load given how it limits the queries to near KV use case. Also, indexing will not be hefty because the index is lean.
Hi Arpit, First of all thanks for the awesome video. I am wondering how they are able to support Strong Consistency over GSI. They classify Get ongoing orders by passenger id query pattern as transactional but they are leveraging GSI to support this query pattern and GSI is eventually consistent and when we are talking about a high TPS system serving millions of transactions per day the error rate scenarios becomes very high in number.
It was okay for them if data was eventually consistent. Grab only needed strong consistency (if any) for this main PK which is orders and not while getting all orders for a pax. So, it was totally fine for them to relax the constraint here. Also practically, the replication lag in GSI is not too high. Assume a lag of 5 seconds (relatively high, AWS says it would be between 0.5 to 2.5 that too across geographies), still after the user has created an order, the user is very likely to remain on order page to live track it or minimize the app. How many of them would even go to the orders page to see all orders. Even if they do, it will take them atleast 2 second to tap that icon and reach the screen. Enough time gap for the updates to catch-up. So, even when GSIs are eventually consistent it is totally okay UX.
@@AsliEngineering Yeah even that is what i was thinking that their use-case might not be of a strong consistency but their article got me confused where they mentioned otherwise. It makes sense if it is purely required from UX perspective.
One pain point of dynamodb is handling pagination during filter and search. It skips the record if it doesn't match the query criteria and we have to run a recursive loop to meet the page limit. Ex. You are running a query on 1000 records with page limit 10 and filter status for in-active users and there are only 2 in-active records and first record is in first row and 2nd record is at 1000th row, now you have to run 100 query iteration in order to get just 2 records As per my understanding this is the biggest disaster on dynamodb. Does anyone have any solutions here? Hi Arpith have you come across this limitation in dynmodb?
There are standard and fifo queue in sqs. Standard have high throughput. Also DLQ is not for the purpose of fallback of Primary queue, instead that is for the consumer that if message fails to successfully consumed X number of times. Then it should be moved to DLQ. I think the explanation for what if SQS is down is not correct.
Generally, I spend my weekends for learning something new and your content helped me a lot, thanks arpit sir 🫡 🔥 I was totally amazed by the Transaction DB part 😎
Falling in love with your Content Arpit - Breaking complex topics intro smaller chunks and explaining it like explaining to a beginner is your strength. Keep bringing such content!
feeling proud of myself after watching this i implemented a similar thing a few days back on dynamodb, i found the dynamodb docs helpful where they suggested use of global secondary index for filtering
seriously your videos are so informative for a software engineer, it is really a gold mine for us please continuing making such amazing videos.
Amazing Arpit. Easy and powerful!
Thank you!
Fantastic design by Grab.. really loved it.. and most importantly thank you for presenting it in such a simplified way. Love your content
Glad you found it interesting and helpful 🙌
Thanks Arpit fr explaining such a brilliant architecture!!
Liking this video half way. Just brilliant explanation. Thanks
Simple and great explanation! Thanks
Implementing GSI and updating the data in OLAP was amazing.
Brilliant architecture. Thanks for explaining.
Amazing analysis...mza aa gya 😃
Awsome from Grab, Now I can use the same concept in interview and tell this will works
Great explanation Arpit💯
Awesome explanation of the concept.
One question Arpit, Since GSI is eventually consistent, would we get a consistent view of Ongoing orders at any point in time?
Seems exciting
Dynamodb is eventual consistent and not strongly consistent right? If they needed strong consistency they would want to shift to postgres or something right?
Thank you so much for this Banger on 1st, 2023.❤
where is the grab doc ??
a small doubt,
user_id_gsi is stored with user_id when placing an order, what if there are 2 orders from the same user (maybe from different devices or even the same), won't GSI have 2 duplicate entries even if the orders are different
We can have multiple orders for same index and it’s not a primary key… It’s just to index the data… when you fetch ongoing orders of that user you need to fetch both the orders and through with this index it can fetch those 2 quickly.
Order Ids are always unique for each user
No practical application for demo ?
Great 👍
Great video. What if DLQ on aws down?
AFAIK If we are not able to process an SQS message, it goes to DLQ. If SQS is down DLQ will also be down and we will not be able to publish the message there.
Great video btw!
Read the blog and they have mentioned same, "When the producer fails, we will store the message in an Amazon Simple Queue Service (SQS) and retry. If the retry also fails, it will be moved to the SQS dead letter queue (DLQ), to be consumed at a later time.".
So I think we will not be able to do anything if both Kafka and SQS are down (It might be a very rare event though)
@@architbhatiacodes So, we reduced the impact of missing data state but not zeroed.
How would you handle the messages that further change state but the previous state is unapplied bcoz it is in DLQ
Amazing content.
Quick question: Updating the table based on timestamp will not be reliable right - in a distributed system we cannot rely on system clocks for ordering messages.
Can you give an example where this scenario could occur? Unable to understand this.
One question: If they are using SQS with DLQ(100% SLA guaranteed by AWS) in data ingestion, what could be the reason of using Kafka in the first place? Why can't they just use SQS(with DLQ) only?
could be due to cost. kafka is open source.
I think Kafka maintains order when compared to queue and only when Kafka is down we will be using SQS… SQS doesn’t maintain order that’s why we have two edge cases to handle upserts and updates with fresh time stamps
Sqs maintains order but Kafka provides higher write and read throughput.
@@AsliEngineering thanks for correcting… the architecture explanation is great…
There are standard and fifo queue in sqs. Standard have high throughput.
Also DLQ is not for the purpose of fallback of Primary queue, instead that is for the consumer that if message fails to successfully consumed X number of times. Then it should be moved to DLQ.
I think the explanation for what if SQS is down is not correct.
Where will the query goes if the user wants to see his last 10 or n orders? Analytics Db ? as they have removed entries from gsi on ddb?
nice explanation! But how orders svc writes in database and in kafka is it async for both or sync?
Hi Arpit, just wanted to notify, I didn't find the blog link of "Grab" in your description. I might have missed it, could you please send the link again if possible.
You can find the blog right at the bottom of the description and in the i-card at the top right of the video.
Why do they use both kafka and SQS if both serve the same purpose, they could use SQS with a dead letter queue only
For Grab SQS is just a fallback option in case Kafka goes down. Given that it charges on demand you won't pay anything if Kafka never goes down.
Also, SQS and Kafka don't serve the "exact" same purpose. SQS, or any msgs queue offer better elasticity than msg stream like Kafka. You can scale up and down consumers easily while concurrency in Kafka is limited by the number of partition.
please make video on what kind of database to use in what situation that will be very helpful thanks
I cover that in my course, hence cannot put out a video on it. I hope you understand the conflict, but thanks for suggesting.
Why SQS and kafka both ? Why can't only SQS with DLQ for high availability ?
Because the usecase was of a message stream and not a message queue.
Hence Kafka being preferred. SQS is just a fall back to ensure no loss of events.
@@AsliEngineering I quite didn't get this part. I can see both kafka and sqs will be working in same method, why would someone need to use both kafka and sqs and why not just sqs (with DLQ)?
@@prerakchoksi2379 SQS plus DLQ would also work but it would limit that one message/event can be processed once.
Kafka can have multiple consumer groups allowing you to do multiple things on same message/event.
Thanks very much for the video. This is really helpful in understanding how Grab can handle both large and spiky requests coming during rush hour. I just wonder in our company case, we also need to use the historical data to validate the promotion of customers based on their order history. For example, one promotion is only applicable for first time customer (simplest case). In that case, do we need to use the analytical data to calculate this?
if aws gurantees that dlq will be available if sqs isn't, why do we need kafka? Can't we directly use sqs for the insertions?
Extensibility of usecase. You can consume Kafka message and reprocess then to drive something else, if not anything than just archive for audit.
How would they get to know just by looking by the former timestamp that its not the latest one? (1. newer update hasn't came yet. 2. would they query transactional db just for that?)
no. You just discard the updates you receive with older timestamp. No need to query anything.
My qa team checks the orders page(use analytics query which is eventual consistent) immediately after placing order and say bug 😂
I observe this for aws payments page. Myself felt confused as the payments page take 60 seconds to show the payment revord itself...
Thanks for the informative video Arpit. I have a doubt on handling out of order messages at the end of your video. While deciding upon update #1 to be processed first over update #2 using timestamp, how does the consumer know that the message is the oldest one over the other as consumer #1 may have update #1 and consume #2 may have update #2? I would think of versioning the data and make sure the data we update is the next available version of the one present in the analytical database. Is this approach correct?
I got the similar question in a recent interview,
question:
consider a order mgmt system, with multiple instances same order service responsible for handling updates of orders(on a MySQL db). now three updates arrive in a sequence at t1 < t2 < t3 timestamps, at three different order service instances.
Now how do we ensure the updates u1, U2 and u3 are applied in the same sequential order.
any thoughts
@@srinish1993 maybe can we send the timestamp along with the data? so that when it is processed by the consumer, it can create SQL query to update the data WHERE updated_at < data.timestamp and id=data.order_id.
@@javeedbasha6088 yeah correct, usually its a good practice to send event_timestamp field in the kafka msgs, to decide the order of msgs.
After some reading, I discovered that updating data based on timestamps is not a reliable method since there are inconsistencies in the system clock.
A more effective approach is to utilize end-to-end partitioning. In this method, all messages related to a specific partition key are written to the same kafka partition. As long as the producer sends these messages in order to kafka, kafka will maintain their order within the partition, although ordering is not maintained across different partitions. This specific partition is then consumed by only a single consumer instance, ensuring that related events are process by the same consumer.
for example, suppose we have two messages: t1( create profile 'A' ), t2( update profile 'A' )
The same consumer will receive and process t1 and t2 sequentially. This approach ensures that order is maintained in event-driven architecture. And this approach can also handle concurrency.
@@javeedbasha6088 What about comparing version of the order document ?
Because if we keep version then we can avoid the need of kafka , simple SQS will work
What if we use CDC (Dynamo Streams) to achieve the same?
instead of order service pushing the data to kafka?
Yo. What if instead of timestamp difference, we just use versioning + counters ?
maintaining version / vector clocks is a pain and in most cases an overkill. TS works well for the majority of workloads.
The analytics DB can be a write only, right?
What's the use of write if we never read?
Hi Arpit, 1 quick question.
So when we do upsert,
1. operation type whether it would be insert / update is always decided based on availability of primary key in database?
2. when we are doing upsert, we always need to provide all mandatory parameters in SQL query?
1. You can define what is the update field/fields in your query or ORM layer. Doesn't need to be primary key.
2. While doing upsert, always provide ALL the parameters, even those that are not mandatory but have a non-null value, because every field will be overwritten.
@@dhruvagarwal7854 Hi Dhruv
Thank you for clarifying.
So if I understood correctly, upsert can be decided based on any field but that field should be unique in nature. Is this correct?
2nd point I am clear now. it makes sense to provide all parameters as every parameter is going to be overwritten and we won't want any parameter to be lost.
Just a small doubt.
As transactional queries are handled synchronously, won't there be any issue while handling huge number(millions) of synchronous writes to db during peak traffic hours.Like there is a possibility that the DB servers can choke while handling them right? BTW loving watching ur videos, great content!
I believe this will be handled by dynamo DB itself by creating partitions internally when the tiers are hot
Yes. They would slowdown. Hence consumers will slow their consumption
@@AsliEngineering For transaction queries we will be hitting the db server directly right as they need to be synchronous? You said there won't be any messaging queue involved in such cases.Crct me if I am wrong
@@imdsk28 Is it true for the RDS as well?? I don't think soo but not sure
@@yuvarajyuvi9691 not completely sure… need to deep dive
Thanks for the content Arpit. Have a small doubt, how are we handling the huge spikes? Does dyanmo hot key partioning + lean gsi index do the job? I am assuming the peak duration will last for sometime since order delivery isn't just done in a 10 minute window. So at peak time even the index would start piling up.
Would you say using dyanmo is the cost effective solution here? [I am assuming team wanted a cloud native solution and cost effectiveness involved calculating maintenance cost for in house solution]
DDB is cost effective as well as great at balancing the load.
It can easily handle a huge load given how it limits the queries to near KV use case.
Also, indexing will not be hefty because the index is lean.
Hi Arpit, First of all thanks for the awesome video.
I am wondering how they are able to support Strong Consistency over GSI. They classify Get ongoing orders by passenger id query pattern as transactional but they are leveraging GSI to support this query pattern and GSI is eventually consistent and when we are talking about a high TPS system serving millions of transactions per day the error rate scenarios becomes very high in number.
It was okay for them if data was eventually consistent. Grab only needed strong consistency (if any) for this main PK which is orders and not while getting all orders for a pax. So, it was totally fine for them to relax the constraint here.
Also practically, the replication lag in GSI is not too high. Assume a lag of 5 seconds (relatively high, AWS says it would be between 0.5 to 2.5 that too across geographies), still after the user has created an order, the user is very likely to remain on order page to live track it or minimize the app. How many of them would even go to the orders page to see all orders.
Even if they do, it will take them atleast 2 second to tap that icon and reach the screen. Enough time gap for the updates to catch-up. So, even when GSIs are eventually consistent it is totally okay UX.
@@AsliEngineering Yeah even that is what i was thinking that their use-case might not be of a strong consistency but their article got me confused where they mentioned otherwise. It makes sense if it is purely required from UX perspective.
One pain point of dynamodb is handling pagination during filter and search. It skips the record if it doesn't match the query criteria and we have to run a recursive loop to meet the page limit.
Ex. You are running a query on 1000 records with page limit 10 and filter status for in-active users and there are only 2 in-active records and first record is in first row and 2nd record is at 1000th row, now you have to run 100 query iteration in order to get just 2 records
As per my understanding this is the biggest disaster on dynamodb.
Does anyone have any solutions here?
Hi Arpith have you come across this limitation in dynmodb?
Yes. DDB is not meant for such queries, hence should not be used for such cases, unless you can manipulate indexes (LSI and GSI).
There are standard and fifo queue in sqs. Standard have high throughput.
Also DLQ is not for the purpose of fallback of Primary queue, instead that is for the consumer that if message fails to successfully consumed X number of times. Then it should be moved to DLQ.
I think the explanation for what if SQS is down is not correct.
Yes ur crt, if consumer is unable consume even after retries, then it moves it to DLQ.
Generally, I spend my weekends for learning something new and your content helped me a lot, thanks arpit sir 🫡 🔥 I was totally amazed by the Transaction DB part 😎