When engineering teams start saying "We assume that...", than this often a sign of not knowing the fallacies of distributed systems and hence, building software that contains the mentioned flaws. Nice and crisp summary!
Dual Write problem is part of larger distributed transaction handling which is a very complex process with a lot of moving parts. In microservices architecture, where services operate independent of each other, distributed transaction management is replaced by series of local transactions known as saga pattern. In this video, presenter is illustrating issue when implementing saga with choreography. The presenter did not mention about 2PC (Two Phase Commit) which is an alternate option. It is less preferred and viable because it requires XA compliant transactions resources to participate. The patterns mentioned in the video favours eventual consistency approach.
have definitely been working through this problem but never had the right name for it. Thankfully got to one of the approaches for solving it - but its definitely a hairy problem that requires forethought!
I've never written a microservice (whatever that even is). But having done hardware design and embedded realtime software for the last 30 years, this stuff is just another spin on the same universal problems. Very familiar in general terms.
@@ConfluentDeveloperRelations I'd like to believe I had success explaining it to collegues. I also used a lot of visual diagrams, which pinpoint almost every potential failure occurence (inter-process failure, message publication failure, message acknowledgement failure, ... ). It helped a lot in conveying why it is impossible to implement an atomic operation spanning multiple different processes.
Great vid! We ran into all the same issues in 2017 when designing and architecting my first microservices platform. We used Kafka with Node and the Listen to Yourself pattern, and we were quite successful with it! It seems like the path of least resistance and complexity.
This is a very descriptive video. I would like to have a question about the message relay layer where the outbox table is read and events are forwarded. I had implemented The Polling publisher pattern specific to my needs. What are the disadvantages when we compare with CDC(Debezium)?
@@ConfluentDeveloperRelations wont that also mean failures are hidden? We were excited about CDC and then I spoke to a trusted ex colleague and they said. Kiss goodbye to domain events if change / CRUD events are your source. I make them right as not all systems support CDC, it also makes holes in the system; which likely looks green at all times.
3:41 certainly could be worth exploring further as a separate topic! Especially if instances of derived events being mishandled, overlooked, or redirected to DQL occur, as this could reintroduce the dual write problem.
@@ConfluentDeveloperRelations Thanks Wade and really appreciate that! In addition I'm more interested in exploring a graceful failure management in event-driven architecture but I couldn't find a convincing solution or resource. Typically, retries are the first solution considered, but they can lead to problems, such as with Kafka, where continually retrying an event on the same topic may block offset progression and cause consumer lag. Redirecting events to a retry topic could also present resilience issues, like potential failures in posting the retry event. (Unfortunately, unlike a message queue that has retry designed by natural, the offset progression mechanism in Kafka introduced pros and cons here.) Often, after multiple retries, events are moved to a Dead Letter Queue (DLQ) or tracked via metrics. However, I've noticed that events in the DLQ tend to be neglected indefinitely, which developers often ignore at first place. This can lead to unrecognized issues that become increasingly difficult to address over time.
@@ConfluentDeveloperRelations Thanks Wade and really appreciate that! In addition I'm more interested in exploring a graceful failure management in event-driven architecture but I couldn't find a convincing solution or resource. Typically, retries are the first solution considered, but they can lead to problems, such as with Kafka, where continually retrying an event on the same topic may block offset progression and cause consumer lag. Redirecting events to a retry topic could also present resilience issues, like potential failures in posting the retry event. (Unfortunately, unlike a message queue that has retry designed by natural, the offset progression mechanism in Kafka introduced pros and cons here.) Often, after multiple retries, events are moved to a Dead Letter Queue (DLQ) or tracked via metrics. However, I've noticed that events in the DLQ tend to be neglected indefinitely, which developers often ignore at first place. This can lead to unrecognized issues that become increasingly difficult to address over time.
One practical application we've observed involves using CDC to synchronize data from a NoSQL database to Elasticsearch. If an operation like a deletion fails and isn't addressed further after entering DLQ, it leads to discrepancies between the two data sources. This situation becomes difficult to rectify without performing another full synchronization.
Brilliant! Thanks a lot! Based on my experience, it's much harder to explain it to the stakeholders to get additional resources for the proper implementation. Especially in the cloud environment. Components are reliable - they say... It will never happen - they say...
IBM Websphere + IBM MQ has been supporting container managed transaction where DB write + MQ send can become part of transaction or consumer + DB + send can become part of transaction and all of it can be rolled back together since very long time
Hi, what if we enable transactions in the kafka producer and defer the e-mail logic to a kafka consumer which is set to read_committed isolation level?
Can we use the SAGA pattern in this case? If something goes wrong, we prepare compensations and execute them on all previous and executed steps int the workflow ...
SAGA Pattern can be used for small scale systems, SAGA Patterns are difficult to scale. If you are working with large scale app then you would want to opt for the patterns explained in the video.
Wade don't you think independent should be used instead of "dependent"? when mentioning that the two writes should be separated since the scanner simply checks for new events periodically and is independent of the state writer and doesn't really case about what's being stored to be emitted.
Hi, im a little confused by the labelling of the diagrams here. Is the command handler supposed to be the microservice ? I noticed the label was changed mid-way through the video (3:24) which added to the confusion. If command handler were a microservice, are you saying that if one operation were to fail, the second one may not be executed and leads to inconsistencies? So instead, the solution you showed is to create another microservice called AsyncProcess, and its sole purpose is to periodically check the state of the first operation has completed successfully (like with a cron job)??? Dump question, if the database write command was handled and rolled back, and the command is wrapped in exception handling, then there technically shouldnt be an event emitted, and would render your solution redundant. Am i missing something?
Regarding 2:36, what about using Apache Camel? In that scenario, you have a transacted route that writes to the database and then emits an event to a Kafka topic? That would ensure both events are wrapped in a transaction.
@@ConfluentDeveloperRelations Seems similar to Microsoft's DTC which at least as advertised allows cross-process transactions (e.g., with WCF). (I have tried to use it long ago and failed, so am unsure.)
many messaging systems are XA complient and as such supports transactions that span multiple systems. Kafka as far as i know does not support XA but many others do...
Is there any issue with CDC that tails replication logs on say Postgres, how do you failovers in a way that gives the best availability of the CDC processor whilst replication slots are stopped and started?
@ConfluentDevXTeam Wade, you are smart and your answer is fully correct. However, my solution looks more like a blockchain of statuses. Here's how it works: 1. Status Writing in Queue: Initially, a status is written in a queue, for example, OP1. 2. Event Sent to Kafka: When Kafka receives the event, it adds a new status to the queue status, turning it into OP1_R1. 3. Service Processing: After the service completes its work, the status is updated to reflect the outcome: in the case of a failure, it becomes OP1_R1_F1, and in the case of success, OP1_R1_S1. 4. Status Display: When needing to show the user the status of the request, we convey success by showing OP1_R1_S1. Solution Implementation: This solution can be implemented using queue status in a distributed cache, trace, or event log. Final Status and Database Update: The final status can be taken from the queue and saved in the database. Even if this update fails, you can retry from the queue again. This approach, akin to a "blockchain of statuses," ensures that every status update is traceable and recoverable, addressing the concern of confirming status updates and overcoming potential failures in updating the database. By leveraging a distributed queue and Kafka, we can ensure that the status is both distributed and consistent across the system, allowing for retries in case of failure and ensuring no status confirmation is lost. I understand the concerns you raised regarding the potential for losing previous events with rapid updates, and this is where the design of the event log and the distributed cache plays a crucial role. By carefully managing the queue and ensuring that each status update is atomic and traceable, we mitigate the risk of lost updates. This approach offers resilience and reliability in tracking the status of each operation within the system, aligning with the principles of Event Sourcing and the Transactional Outbox pattern to ensure comprehensive event handling and status tracking. Thank you for sparking this engaging discussion, Wade. Your insights are invaluable, and they've helped clarify the robustness and potential of this solution in addressing the dual write problem and ensuring consistent status updates.
I would solve this problem by making the transaction idempotent and relying on Kafka's dead letters replaying mechanism. When dealing with asynchronous environments, it's crucial to make everything idempotent, specially database transactions.
@@ConfluentDeveloperRelations thanks Wade! Very relevant counterpoints! I going to study about the patterns you described. You video is excellent, by the way. I've just subscribed to your channel.
Great! Then how does the Change Data Capture work? If the CDC reads data from a DB and publishes events, how does it know which data it already read? Edit: if this problem can be solved completely without using a distributed transaction, does that mean that every problem that needs a distributed transaction can be solved in a different way?
Can't we just do this in the microservice where we only emit the event when changes are successfully committed to database and handle emit failure via some feedback loop
@@ConfluentDeveloperRelations for me it is surprising that new wordings like „dual write“ arise for things which were discussed in different context already several decades ago. In the context of the dual write problem, a Petri net could be used to design a pattern where concurrent write operations are properly synchronized to avoid conflicts. This could involve, for example, introducing places and transitions that represent locks or other synchronization mechanisms. I am ok with the new wordings if those would help to tackle the problems.
@@ConfluentDeveloperRelations Awesome example, I was thinking about "why he doesn't use transaction to execute business logic and after success it, emit event" about it while I was watching your video, thanks for the clarification That edge case prevent to Partition Intolerance from CAP theorem am I right? Do you plan make video about CAP theorem with real life example?
Here we see segregation of responsibilities to achieve that. Although this is achievable too within the same microservice by making sure you have a record in the database that the event has been sent to downstream after you have emit it. In case of failure, the event might be duply emited but as you said the downstream system checks can handle it.
I'm no microservices expert, but if that's a problem to microservices, then microservices suck. On our server, no event would be triggered unless the database write was successful, so it can never happen, that we send out an event and the change is not reflected in the database. Nothing is easier than ensuring that a certain piece of code is executed after a database write succeeded and only in case it ever did succeed. And in case the event cannot be emitted, well, that's the same issue as shown at 3:40. Then we can also re-try at at a later time. Of course that may fail again and it may fail forever, but the video doesn't provide a solution for this problem either.
When engineering teams start saying "We assume that...", than this often a sign of not knowing the fallacies of distributed systems and hence, building software that contains the mentioned flaws. Nice and crisp summary!
best concise, crisp, to-the-point video on this topic till date ever by anyone.
Dual Write problem is part of larger distributed transaction handling which is a very complex process with a lot of moving parts. In microservices architecture, where services operate independent of each other, distributed transaction management is replaced by series of local transactions known as saga pattern.
In this video, presenter is illustrating issue when implementing saga with choreography. The presenter did not mention about 2PC (Two Phase Commit) which is an alternate option. It is less preferred and viable because it requires XA compliant transactions resources to participate. The patterns mentioned in the video favours eventual consistency approach.
have definitely been working through this problem but never had the right name for it. Thankfully got to one of the approaches for solving it - but its definitely a hairy problem that requires forethought!
I've never written a microservice (whatever that even is). But having done hardware design and embedded realtime software for the last 30 years, this stuff is just another spin on the same universal problems. Very familiar in general terms.
A difficult problem, which I find difficult to explain to collegues. Thank you for such a clear and instructive explanation!
@@ConfluentDeveloperRelations I'd like to believe I had success explaining it to collegues. I also used a lot of visual diagrams, which pinpoint almost every potential failure occurence (inter-process failure, message publication failure, message acknowledgement failure, ... ). It helped a lot in conveying why it is impossible to implement an atomic operation spanning multiple different processes.
Great vid! We ran into all the same issues in 2017 when designing and architecting my first microservices platform. We used Kafka with Node and the Listen to Yourself pattern, and we were quite successful with it! It seems like the path of least resistance and complexity.
This is a very descriptive video. I would like to have a question about the message relay layer where the outbox table is read and events are forwarded. I had implemented The Polling publisher pattern specific to my needs. What are the disadvantages when we compare with CDC(Debezium)?
@@ConfluentDeveloperRelations wont that also mean failures are hidden? We were excited about CDC and then I spoke to a trusted ex colleague and they said. Kiss goodbye to domain events if change / CRUD events are your source. I make them right as not all systems support CDC, it also makes holes in the system; which likely looks green at all times.
Great video and really appreciate your discussions in the comments!
3:41 certainly could be worth exploring further as a separate topic! Especially if instances of derived events being mishandled, overlooked, or redirected to DQL occur, as this could reintroduce the dual write problem.
@@ConfluentDeveloperRelations Thanks Wade and really appreciate that! In addition I'm more interested in exploring a graceful failure management in event-driven architecture but I couldn't find a convincing solution or resource.
Typically, retries are the first solution considered, but they can lead to problems, such as with Kafka, where continually retrying an event on the same topic may block offset progression and cause consumer lag. Redirecting events to a retry topic could also present resilience issues, like potential failures in posting the retry event. (Unfortunately, unlike a message queue that has retry designed by natural, the offset progression mechanism in Kafka introduced pros and cons here.)
Often, after multiple retries, events are moved to a Dead Letter Queue (DLQ) or tracked via metrics. However, I've noticed that events in the DLQ tend to be neglected indefinitely, which developers often ignore at first place. This can lead to unrecognized issues that become increasingly difficult to address over time.
@@ConfluentDeveloperRelations Thanks Wade and really appreciate that! In addition I'm more interested in exploring a graceful failure management in event-driven architecture but I couldn't find a convincing solution or resource.
Typically, retries are the first solution considered, but they can lead to problems, such as with Kafka, where continually retrying an event on the same topic may block offset progression and cause consumer lag. Redirecting events to a retry topic could also present resilience issues, like potential failures in posting the retry event. (Unfortunately, unlike a message queue that has retry designed by natural, the offset progression mechanism in Kafka introduced pros and cons here.)
Often, after multiple retries, events are moved to a Dead Letter Queue (DLQ) or tracked via metrics. However, I've noticed that events in the DLQ tend to be neglected indefinitely, which developers often ignore at first place. This can lead to unrecognized issues that become increasingly difficult to address over time.
One practical application we've observed involves using CDC to synchronize data from a NoSQL database to Elasticsearch. If an operation like a deletion fails and isn't addressed further after entering DLQ, it leads to discrepancies between the two data sources. This situation becomes difficult to rectify without performing another full synchronization.
Great video. Something though is wrong with the levels on your mic, it's crackling on louder sounds much more than it should.
Brilliant! Thanks a lot! Based on my experience, it's much harder to explain it to the stakeholders to get additional resources for the proper implementation. Especially in the cloud environment. Components are reliable - they say... It will never happen - they say...
Great video, thanks
IBM Websphere + IBM MQ has been supporting container managed transaction where DB write + MQ send can become part of transaction or consumer + DB + send can become part of transaction and all of it can be rolled back together since very long time
This was great 👍🏻 thank you!
Hi, what if we enable transactions in the kafka producer and defer the e-mail logic to a kafka consumer which is set to read_committed isolation level?
Concise and very useful
Can we use the SAGA pattern in this case? If something goes wrong, we prepare compensations and execute them on all previous and executed steps int the workflow ...
SAGA Pattern can be used for small scale systems, SAGA Patterns are difficult to scale. If you are working with large scale app then you would want to opt for the patterns explained in the video.
Wade don't you think independent should be used instead of "dependent"? when mentioning that the two writes should be separated since the scanner simply checks for new events periodically and is independent of the state writer and doesn't really case about what's being stored to be emitted.
This is amazing 🙌
Learnt something new, thank you!
Hi, im a little confused by the labelling of the diagrams here. Is the command handler supposed to be the microservice ? I noticed the label was changed mid-way through the video (3:24) which added to the confusion. If command handler were a microservice, are you saying that if one operation were to fail, the second one may not be executed and leads to inconsistencies? So instead, the solution you showed is to create another microservice called AsyncProcess, and its sole purpose is to periodically check the state of the first operation has completed successfully (like with a cron job)???
Dump question, if the database write command was handled and rolled back, and the command is wrapped in exception handling, then there technically shouldnt be an event emitted, and would render your solution redundant. Am i missing something?
Regarding 2:36, what about using Apache Camel? In that scenario, you have a transacted route that writes to the database and then emits an event to a Kafka topic? That would ensure both events are wrapped in a transaction.
@@ConfluentDeveloperRelations Seems similar to Microsoft's DTC which at least as advertised allows cross-process transactions (e.g., with WCF). (I have tried to use it long ago and failed, so am unsure.)
Is DDB Streams CDC or transactional outbox?
thanks
Write a monolitj
many messaging systems are XA complient and as such supports transactions that span multiple systems. Kafka as far as i know does not support XA but many others do...
Is there any issue with CDC that tails replication logs on say Postgres, how do you failovers in a way that gives the best availability of the CDC processor whilst replication slots are stopped and started?
why there are no solution that confirm status. so added the status in db and after even confirm the status .
@ConfluentDevXTeam Wade, you are smart and your answer is fully correct. However, my solution looks more like a blockchain of statuses. Here's how it works:
1. Status Writing in Queue: Initially, a status is written in a queue, for example, OP1.
2. Event Sent to Kafka: When Kafka receives the event, it adds a new status to the queue status, turning it into OP1_R1.
3. Service Processing: After the service completes its work, the status is updated to reflect the outcome: in the case of a failure, it becomes OP1_R1_F1, and in the case of success, OP1_R1_S1.
4. Status Display: When needing to show the user the status of the request, we convey success by showing OP1_R1_S1.
Solution Implementation: This solution can be implemented using queue status in a distributed cache, trace, or event log.
Final Status and Database Update: The final status can be taken from the queue and saved in the database. Even if this update fails, you can retry from the queue again.
This approach, akin to a "blockchain of statuses," ensures that every status update is traceable and recoverable, addressing the concern of confirming status updates and overcoming potential failures in updating the database. By leveraging a distributed queue and Kafka, we can ensure that the status is both distributed and consistent across the system, allowing for retries in case of failure and ensuring no status confirmation is lost.
I understand the concerns you raised regarding the potential for losing previous events with rapid updates, and this is where the design of the event log and the distributed cache plays a crucial role. By carefully managing the queue and ensuring that each status update is atomic and traceable, we mitigate the risk of lost updates. This approach offers resilience and reliability in tracking the status of each operation within the system, aligning with the principles of Event Sourcing and the Transactional Outbox pattern to ensure comprehensive event handling and status tracking.
Thank you for sparking this engaging discussion, Wade. Your insights are invaluable, and they've helped clarify the robustness and potential of this solution in addressing the dual write problem and ensuring consistent status updates.
Great video
I would solve this problem by making the transaction idempotent and relying on Kafka's dead letters replaying mechanism. When dealing with asynchronous environments, it's crucial to make everything idempotent, specially database transactions.
@@ConfluentDeveloperRelations thanks Wade! Very relevant counterpoints! I going to study about the patterns you described. You video is excellent, by the way. I've just subscribed to your channel.
Idempotency: Idempotent Writer
I immediately thought in a WAL.
Great! Then how does the Change Data Capture work? If the CDC reads data from a DB and publishes events, how does it know which data it already read?
Edit: if this problem can be solved completely without using a distributed transaction, does that mean that every problem that needs a distributed transaction can be solved in a different way?
Databases have hooks that can be used I.e AfterInsert hook will trigger when a new row is added.
@@dandogamer are you sure that it is how CDC works? If yes, then CDC can only detect rows added into a table after CDC is deployed?
Can't we just do this in the microservice where we only emit the event when changes are successfully committed to database and handle emit failure via some feedback loop
Great video. Learned a lot
Was just coding a side project with a redis cache and postgres and ran into similar issue
Why it was important to mention redis cache and postgres? Nice you run into similar issue. Do you want to talk about it?
@@ConfluentDeveloperRelations for me it is surprising that new wordings like „dual write“ arise for things which were discussed in different context already several decades ago. In the context of the dual write problem, a Petri net could be used to design a pattern where concurrent write operations are properly synchronized to avoid conflicts. This could involve, for example, introducing places and transitions that represent locks or other synchronization mechanisms. I am ok with the new wordings if those would help to tackle the problems.
Gem spotted 💎
but why we cant emit event after the state is saved ?. I mean we are dealing with a single microservice ,can anyone please explain?
@@ConfluentDeveloperRelations Awesome example, I was thinking about "why he doesn't use transaction to execute business logic and after success it, emit event" about it while I was watching your video, thanks for the clarification
That edge case prevent to Partition Intolerance from CAP theorem am I right?
Do you plan make video about CAP theorem with real life example?
Here we see segregation of responsibilities to achieve that.
Although this is achievable too within the same microservice by making sure you have a record in the database that the event has been sent to downstream after you have emit it.
In case of failure, the event might be duply emited but as you said the downstream system checks can handle it.
I'm no microservices expert, but if that's a problem to microservices, then microservices suck. On our server, no event would be triggered unless the database write was successful, so it can never happen, that we send out an event and the change is not reflected in the database. Nothing is easier than ensuring that a certain piece of code is executed after a database write succeeded and only in case it ever did succeed. And in case the event cannot be emitted, well, that's the same issue as shown at 3:40. Then we can also re-try at at a later time. Of course that may fail again and it may fail forever, but the video doesn't provide a solution for this problem either.