How

Arpit Bhayani

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 10 січ 2025

КОМЕНТАРІ •

@DHARSHANSASTRYBN 2 роки тому ⁺⁹
Great video Arpit!!
One more step is missing after cutover, before we prune old records from source database, either we need to stop Kafka consumer listening to delete CDC events or make an entry in consumer to filter/ignore these records otherwise we may end up applying these delete on target database as well :)
@AsliEngineering 2 роки тому ⁺²
Hahhahaha 😂😂😂 true. We should definitely switchoff the CDC once everything caught up 😀😀
It once happened with me and I was thinking where did the data go???? 😂😂😂😂😂
@DHARSHANSASTRYBN 2 роки тому
@@AsliEngineering 😅
@adianimesh 2 роки тому ⁺⁵
Another approach could be that instead of moving the shop which was making the shard hot we can move the other shops' rows to rebalance the load !
@siteshp 2 місяці тому
Yeah, this can be a good option only if there are fewer number of cold shards to be moved to a different pod. Otherwise, it adds a huge operational complexity.
@mohammedabbas6404 2 роки тому ⁺¹
Very well explained👏🏼. Thank you Arpit.
@akshay_madeit 2 роки тому ⁺¹
Great notes and content. Awesome bro. Keep rocking
@PriyankaYadav-tb3pw Рік тому ⁺¹
Great video Aprit. Thanks 🙏🏻
@sagarmehta3515 Рік тому
Very well explained
@Stockfundamentalstelugu 2 роки тому
Hello @arpit, Thanks for the great video won't they have a replication/ data warehouse to store the data incase of some failure if yes cant we do bulk load from the place instead of reading from the current DB?
@SubhamDas-dw3kk Рік тому
Wondering what would happen during the Black Friday sale, multiple stores (or shops) will be hot. Additionally, a very small downtime (even half a minute) would be of great business loss
@siteshp 2 місяці тому
I would assume, they pre-request over-provisioning of the database to avoid the noisy-neighbour problems.
@rishabhgoel1877 Рік тому
good stuff but shouldn't this be in phased manner like first stopping read from DB1(write still going on) and then stop writes or having some sort of phased treatment/dialup?
@siteshp 2 місяці тому
Yes, you're right. It should be something along similar lines. Howevern if we are assuming read to be routed later too, there would a few seconds of read misses that should be taken care by application retrying.
@swagatochatterjee7104 2 роки тому ⁺²
What if the lag never dies? That is the there are always new writes to the old DB.
@AsliEngineering 2 роки тому
Lag has to ear off it is a function of CPU utilisation.
There is a possibility that lag is increasing because.of an underlying hardware failure and in that case you create a new replica with a recent snapshot requiring you a minimal time to catch up.
@swagatochatterjee7104 2 роки тому
@@AsliEngineering I understand the data pump would transfer the data from the new db to the old one. However, the NGINX is still sending the requests for shop 2 to the old db. So there seems to be no guarantee that there wouldn't be any writes to the old db. Or is there something which I am completely missing?
@AsliEngineering 2 роки тому ⁺²
@Swagato Chatterjee writes are cutoff from old db for sometime through a knob. Covered in video.
You take a small downtime (not really a downtime but a glitch with retries) and stop accepting writes for that shop altogether so that other DB could catchup.
@swagatochatterjee7104 2 роки тому
@@AsliEngineering thanks it makes sense now.
@koteshwarraomaripudi1080 2 роки тому
I absolutely love the content of this channel. Thank you, Arpit.
Do we have any channel out there that posts similar content talking about engineering challenges and how they are solved? If we have any please comment them. Thanks in advance.
@sagarnikam8001 2 роки тому ⁺¹
Hi arpit, great video and really appreciate your efforts...
have a few doubts it will be great if you or anyone here can clearify those.
1. If we move a particular shop(which was making the shard hot) from one shard to another, how will it solve the issue? I mean wouldn't it just make our other shard(where we copied data) hot?
2. If we batch copy data from hot shard wouldn't that increase the load on the shard and might bring it down? I don't exactly know how much load the batch copying will bring in...
3. What would be ideal threshold, of load on shard(e.g. 80%, 70% resource utilisation), after which we should think about moving data? or it is when we see some large difference between resource utilisation among various shard present
@rohan_devarc 2 роки тому ⁺¹
A hot partition problem is a subjective problem and is often related to Database Internals.
- It involves dealing with any kind of mandatory operation that couldn't wait any longer say "backup, compaction/VACCUM, Write Repair, Read Repair ", and these operations often add additional IO networking to the already occupied CPU core. And following that DB goes into an imbalanced state.
- Shiting of shards involves the redistribution of the data volume as well as offloading of mentioned operation to other cores/NUMA sockets.
- So rectifying the hot partition problem is more about rectifying the imbalance.
- Don't rely on a database unless its shard balance mechanism is identical to the application need
From the perspective of DB, only replicas can suffer from hot partition problems. However, this is not true for many applications and hence Shopify has an application-space driver, and "Batch copying" is a lot more economical especially when you have more control over the driver that triggers the same batch operation. This is in contrast with DB tooling which can only offer some amount of abstract control.
"There is no threshold until you start having one" i.e. application-space driver for shard migration would work on application-specific load pattern.
@LeoLeo-nx5gi 2 роки тому
Firstly thanks Arpit bhaiya for explaining in such ease!!
Now even I wanted to ask about the second point which Sagar Nikam mentioned - won't it put more load on our initial shard when we copy data from it to another one?
@dharins1636 2 роки тому ⁺¹
1. the non-hot shop are being moved to reduce the load on the pod x db which has higher load due to shop m
2. This is via bin-log so it is mysql all transactions are logged in a "file" which can be used to replicate the data to different databases or messaging system (see. debezium)
3. That really depends on the use cases and the impact on the system due to the load, which is very much to 'trying' out different configurations (and as mentioned in the video, its driven by analytics team)
@LeoLeo-nx5gi 2 роки тому
@@dharins1636 ohh thanks, didn't knew about Debezium
@ksansudeen 2 роки тому
Hello @arpit, I see you are talking only about the data but how about indexes ? so shopify doesn't use indexes on those table ? i don't think that could be a case.
@AsliEngineering 2 роки тому ⁺²
Indexes are implicitly managed by the database.
@siteshp 2 місяці тому
@@AsliEngineering So, you are saying, the database takes care of creating required indexes while we batch copy in to the new database. Right ?
@tesla1772 2 роки тому
What if some row gets updated after we copied that row
@AbhisarMohapatra 2 роки тому ⁺⁵
The big log would take care of that
@dharins1636 2 роки тому
bin-log is serial hence it tracks all the changes with respect to the changes done in time :)
@tesla1772 2 роки тому
@@dharins1636 okay so we have to replicate that entire query on new db. Is that right?
@shantanutripathi Рік тому
@@tesla1772 Yes, we do copy that binlog up until the current change to new replica + new request would also be queued up AFTER binlog to be applied on new replica. Because if we send new database requests(insert, delete) before applying binlog, data integrity would be lost.
@rjarora Рік тому
I've one question. Won't they already have replicas? Can't they copy from replicas, why put load on master?
@siteshp 2 місяці тому
Replica will have some implicit replication lag (assuming the replication is done in an async way for heavy traffic production use-case)
In that case, it will be pretty hard to say when the older DB is similar to the new DB to start the cutover. Hence, doing it from the primary makes it much more determinstic to enable that cut over.

Наступне

Автоматичне відтворення

How Grab configured their data layer to handle multi-million database transactions a day!