Great video Arpit!! One more step is missing after cutover, before we prune old records from source database, either we need to stop Kafka consumer listening to delete CDC events or make an entry in consumer to filter/ignore these records otherwise we may end up applying these delete on target database as well :)
Hahhahaha 😂😂😂 true. We should definitely switchoff the CDC once everything caught up 😀😀 It once happened with me and I was thinking where did the data go???? 😂😂😂😂😂
Yeah, this can be a good option only if there are fewer number of cold shards to be moved to a different pod. Otherwise, it adds a huge operational complexity.
Hello @arpit, Thanks for the great video won't they have a replication/ data warehouse to store the data incase of some failure if yes cant we do bulk load from the place instead of reading from the current DB?
Wondering what would happen during the Black Friday sale, multiple stores (or shops) will be hot. Additionally, a very small downtime (even half a minute) would be of great business loss
good stuff but shouldn't this be in phased manner like first stopping read from DB1(write still going on) and then stop writes or having some sort of phased treatment/dialup?
Yes, you're right. It should be something along similar lines. Howevern if we are assuming read to be routed later too, there would a few seconds of read misses that should be taken care by application retrying.
Lag has to ear off it is a function of CPU utilisation. There is a possibility that lag is increasing because.of an underlying hardware failure and in that case you create a new replica with a recent snapshot requiring you a minimal time to catch up.
@@AsliEngineering I understand the data pump would transfer the data from the new db to the old one. However, the NGINX is still sending the requests for shop 2 to the old db. So there seems to be no guarantee that there wouldn't be any writes to the old db. Or is there something which I am completely missing?
@Swagato Chatterjee writes are cutoff from old db for sometime through a knob. Covered in video. You take a small downtime (not really a downtime but a glitch with retries) and stop accepting writes for that shop altogether so that other DB could catchup.
I absolutely love the content of this channel. Thank you, Arpit. Do we have any channel out there that posts similar content talking about engineering challenges and how they are solved? If we have any please comment them. Thanks in advance.
Hi arpit, great video and really appreciate your efforts... have a few doubts it will be great if you or anyone here can clearify those. 1. If we move a particular shop(which was making the shard hot) from one shard to another, how will it solve the issue? I mean wouldn't it just make our other shard(where we copied data) hot? 2. If we batch copy data from hot shard wouldn't that increase the load on the shard and might bring it down? I don't exactly know how much load the batch copying will bring in... 3. What would be ideal threshold, of load on shard(e.g. 80%, 70% resource utilisation), after which we should think about moving data? or it is when we see some large difference between resource utilisation among various shard present
A hot partition problem is a subjective problem and is often related to Database Internals. - It involves dealing with any kind of mandatory operation that couldn't wait any longer say "backup, compaction/VACCUM, Write Repair, Read Repair ", and these operations often add additional IO networking to the already occupied CPU core. And following that DB goes into an imbalanced state. - Shiting of shards involves the redistribution of the data volume as well as offloading of mentioned operation to other cores/NUMA sockets. - So rectifying the hot partition problem is more about rectifying the imbalance. - Don't rely on a database unless its shard balance mechanism is identical to the application need From the perspective of DB, only replicas can suffer from hot partition problems. However, this is not true for many applications and hence Shopify has an application-space driver, and "Batch copying" is a lot more economical especially when you have more control over the driver that triggers the same batch operation. This is in contrast with DB tooling which can only offer some amount of abstract control. "There is no threshold until you start having one" i.e. application-space driver for shard migration would work on application-specific load pattern.
Firstly thanks Arpit bhaiya for explaining in such ease!! Now even I wanted to ask about the second point which Sagar Nikam mentioned - won't it put more load on our initial shard when we copy data from it to another one?
1. the non-hot shop are being moved to reduce the load on the pod x db which has higher load due to shop m 2. This is via bin-log so it is mysql all transactions are logged in a "file" which can be used to replicate the data to different databases or messaging system (see. debezium) 3. That really depends on the use cases and the impact on the system due to the load, which is very much to 'trying' out different configurations (and as mentioned in the video, its driven by analytics team)
Hello @arpit, I see you are talking only about the data but how about indexes ? so shopify doesn't use indexes on those table ? i don't think that could be a case.
@@tesla1772 Yes, we do copy that binlog up until the current change to new replica + new request would also be queued up AFTER binlog to be applied on new replica. Because if we send new database requests(insert, delete) before applying binlog, data integrity would be lost.
Replica will have some implicit replication lag (assuming the replication is done in an async way for heavy traffic production use-case) In that case, it will be pretty hard to say when the older DB is similar to the new DB to start the cutover. Hence, doing it from the primary makes it much more determinstic to enable that cut over.
Great video Arpit!!
One more step is missing after cutover, before we prune old records from source database, either we need to stop Kafka consumer listening to delete CDC events or make an entry in consumer to filter/ignore these records otherwise we may end up applying these delete on target database as well :)
Hahhahaha 😂😂😂 true. We should definitely switchoff the CDC once everything caught up 😀😀
It once happened with me and I was thinking where did the data go???? 😂😂😂😂😂
@@AsliEngineering 😅
Another approach could be that instead of moving the shop which was making the shard hot we can move the other shops' rows to rebalance the load !
Yeah, this can be a good option only if there are fewer number of cold shards to be moved to a different pod. Otherwise, it adds a huge operational complexity.
Very well explained👏🏼. Thank you Arpit.
Great notes and content. Awesome bro. Keep rocking
Great video Aprit. Thanks 🙏🏻
Very well explained
Hello @arpit, Thanks for the great video won't they have a replication/ data warehouse to store the data incase of some failure if yes cant we do bulk load from the place instead of reading from the current DB?
Wondering what would happen during the Black Friday sale, multiple stores (or shops) will be hot. Additionally, a very small downtime (even half a minute) would be of great business loss
I would assume, they pre-request over-provisioning of the database to avoid the noisy-neighbour problems.
good stuff but shouldn't this be in phased manner like first stopping read from DB1(write still going on) and then stop writes or having some sort of phased treatment/dialup?
Yes, you're right. It should be something along similar lines. Howevern if we are assuming read to be routed later too, there would a few seconds of read misses that should be taken care by application retrying.
What if the lag never dies? That is the there are always new writes to the old DB.
Lag has to ear off it is a function of CPU utilisation.
There is a possibility that lag is increasing because.of an underlying hardware failure and in that case you create a new replica with a recent snapshot requiring you a minimal time to catch up.
@@AsliEngineering I understand the data pump would transfer the data from the new db to the old one. However, the NGINX is still sending the requests for shop 2 to the old db. So there seems to be no guarantee that there wouldn't be any writes to the old db. Or is there something which I am completely missing?
@Swagato Chatterjee writes are cutoff from old db for sometime through a knob. Covered in video.
You take a small downtime (not really a downtime but a glitch with retries) and stop accepting writes for that shop altogether so that other DB could catchup.
@@AsliEngineering thanks it makes sense now.
I absolutely love the content of this channel. Thank you, Arpit.
Do we have any channel out there that posts similar content talking about engineering challenges and how they are solved? If we have any please comment them. Thanks in advance.
Hi arpit, great video and really appreciate your efforts...
have a few doubts it will be great if you or anyone here can clearify those.
1. If we move a particular shop(which was making the shard hot) from one shard to another, how will it solve the issue? I mean wouldn't it just make our other shard(where we copied data) hot?
2. If we batch copy data from hot shard wouldn't that increase the load on the shard and might bring it down? I don't exactly know how much load the batch copying will bring in...
3. What would be ideal threshold, of load on shard(e.g. 80%, 70% resource utilisation), after which we should think about moving data? or it is when we see some large difference between resource utilisation among various shard present
A hot partition problem is a subjective problem and is often related to Database Internals.
- It involves dealing with any kind of mandatory operation that couldn't wait any longer say "backup, compaction/VACCUM, Write Repair, Read Repair ", and these operations often add additional IO networking to the already occupied CPU core. And following that DB goes into an imbalanced state.
- Shiting of shards involves the redistribution of the data volume as well as offloading of mentioned operation to other cores/NUMA sockets.
- So rectifying the hot partition problem is more about rectifying the imbalance.
- Don't rely on a database unless its shard balance mechanism is identical to the application need
From the perspective of DB, only replicas can suffer from hot partition problems. However, this is not true for many applications and hence Shopify has an application-space driver, and "Batch copying" is a lot more economical especially when you have more control over the driver that triggers the same batch operation. This is in contrast with DB tooling which can only offer some amount of abstract control.
"There is no threshold until you start having one" i.e. application-space driver for shard migration would work on application-specific load pattern.
Firstly thanks Arpit bhaiya for explaining in such ease!!
Now even I wanted to ask about the second point which Sagar Nikam mentioned - won't it put more load on our initial shard when we copy data from it to another one?
1. the non-hot shop are being moved to reduce the load on the pod x db which has higher load due to shop m
2. This is via bin-log so it is mysql all transactions are logged in a "file" which can be used to replicate the data to different databases or messaging system (see. debezium)
3. That really depends on the use cases and the impact on the system due to the load, which is very much to 'trying' out different configurations (and as mentioned in the video, its driven by analytics team)
@@dharins1636 ohh thanks, didn't knew about Debezium
Hello @arpit, I see you are talking only about the data but how about indexes ? so shopify doesn't use indexes on those table ? i don't think that could be a case.
Indexes are implicitly managed by the database.
@@AsliEngineering So, you are saying, the database takes care of creating required indexes while we batch copy in to the new database. Right ?
What if some row gets updated after we copied that row
The big log would take care of that
bin-log is serial hence it tracks all the changes with respect to the changes done in time :)
@@dharins1636 okay so we have to replicate that entire query on new db. Is that right?
@@tesla1772 Yes, we do copy that binlog up until the current change to new replica + new request would also be queued up AFTER binlog to be applied on new replica. Because if we send new database requests(insert, delete) before applying binlog, data integrity would be lost.
I've one question. Won't they already have replicas? Can't they copy from replicas, why put load on master?
Replica will have some implicit replication lag (assuming the replication is done in an async way for heavy traffic production use-case)
In that case, it will be pretty hard to say when the older DB is similar to the new DB to start the cutover. Hence, doing it from the primary makes it much more determinstic to enable that cut over.