How to handle message retries & failures in event driven-systems? Handling retires with Kafka?

Daniel Tammadge

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 6 вер 2024
How to handle message retries & failures in event driven systems?
Make sure to watch • Apache Kafka: Keeping ... (Keeping the order of events when retrying due to failure) after this one
In event-driven architecture if your services are running and processing without a problem, event driven architecture is great but handling failures can be hard.
How do you handle retires in Apache Kafka?
#eventdrivenarchitecture #danieltammadge #ApacheKafka
-
I use www.lucidchart... for my diagrams & www.flaticon.com where I use my pro subscription to find images for my content

КОМЕНТАРІ • 51

@IsabelPalomar Рік тому ⁺¹
Great video! I really like your conclusion and final comments. I have been working with Kafka a lot this year and event driven is definitely complex.
@Danieltammadge Рік тому
Thank you for taking the time to comment. Glad you liked it
@ricardo.fontanelli 3 роки тому ⁺⁴
Great video. I would just add one small thing to the retry mechanism: think about event order! Do you really want to consume event 5 after consumer event 7? In many cases, if you already consumed event 7, for example, to update an entity copy in a microservice, all that you need is to discard event 5. To do so, you need to record the id/offset of the last event successfully processed.
@Danieltammadge 3 роки тому ⁺²
I'm glad you liked it.
When you use the term “entity copy” and “microservices”, I’m assuming you are looking at this from change data capture perspective where the order is, of course, important as you are looking to maintain a local copy of data using the events. And in this use case, one could ignore the failed event if a later event for the same entity is processed later.
Or, if you cannot ignore the earlier event, then trigger a different logic or process to remedy the out of sync data.
In specific solutions where requirements have meant the processor needed to adhere to exactly-once processing or when events could be processed out of order, we implemented a processing log table which we perform guard checks can be performed against.
Hopefully, Ricardo, I have understood your point. Let me know if I have misunderstood anything.
And thank you for taking the time to point this out as that is important and shows that not all solutions fit all, and we need to understand our requirements and design the “least-worse.”
@Danieltammadge 2 роки тому ⁺¹
I’ve uploaded part 2 to this video where I describe an approach which keeps the order of events
ua-cam.com/video/FO2ptQNQKhM/v-deo.html
@MrBillJDavis 11 місяців тому ⁺¹
This is great, thank you. It would be really helpful to talk about what issues might lead to a message getting retried and how that might dictate deciding on X number of retry topics.
@Danieltammadge 6 місяців тому
Thanks
Here are some common reasons for what might lead to a message might get retired:
1. Transient Failures: If an event fails due to a transient issue (e.g., a temporary network failure, a dependent service being momentarily unavailable), retrying the event after a delay might result in successful processing. Moving the event to a retry topic allows the system to handle it separately without blocking the processing of new events.
2. Rate Limiting and Backpressure: External systems or APIs might enforce rate limits, and surpassing these limits can result in failed event processing. Publishing failed events to a retry topic enables you to implement backoff strategies and control the rate at which you attempt to reprocess these events.
3. Resource Contention: If processing fails due to resource contention (e.g., database locks, high CPU utilization), moving events to a retry topic allows the system to alleviate immediate pressure and retry processing later, possibly under more favorable conditions.
4. Error Isolation and Analysis: Moving failed events to a separate topic makes it easier to isolate and analyze errors without disrupting the flow of successfully processed events. This separation facilitates monitoring, debugging, and fixing issues specific to the failed events.
5. Prioritization of Events: In some scenarios, certain events might be more critical than others. If an event fails but does not immediately need to be retried (due to lower priority), it can be moved to a retry topic, allowing higher-priority events to be processed without delay.
6. Maintaining Event Order: If the order of events is crucial, and a failed event needs to be processed before subsequent events, retrying the event while continuing to process others might violate the order. By using a retry topic, you can control the order of reprocessing to ensure that events are handled in the intended sequence.
7. Handling Poison Messages: Some events might repeatedly fail processing due to being malformed or due to an issue that cannot be resolved immediately (poison messages). Moving these events to a separate topic prevents them from repeatedly causing failures in the main processing flow and allows for special handling or manual intervention.
@doganaysahin9770 2 роки тому ⁺³
I agree. You can use this kind of implementations. But you should be also careful when you retry. Because you can loose the order and some stale data could be happen.
I have a question . How you can handle , exception occurs when you try to send retry topic ?
@Danieltammadge 2 роки тому ⁺¹
Thank you for taking the time you watch and comment. You are right. The approach shown here would not ensure that events are processed in the correct order.
To preserve the order of a particular business object changes, you would need to hold any events relating to the object, which has an earlier event pending retry and successful processing, in a holding area. And only process the later events after the initial failed event is reprocessed.
@Danieltammadge 2 роки тому ⁺¹
Please check out my latest video in response to your question, where I go into detail on how to keep the order of events
ua-cam.com/video/FO2ptQNQKhM/v-deo.html
@StephenTD 2 роки тому ⁺¹
Awesome it cleared up my questions around how to handle retries using a event streaming platform like Apache Kafka and thank you for part 2, where you went into how to keep ordering. Again amazing videos!!!!
@Danieltammadge 2 роки тому
Thank you Danny for taking your time to watch my videos and taking your time to write a comment on each one. And I’m glad that you found them helpful
@abhishekbajpai1208 Місяць тому ⁺¹
good explanation,
@Danieltammadge 17 днів тому
Glad you liked it
@eduardleroux9550 Рік тому ⁺²
Great video! Would love to get your take on using Kafka vs AWS SNS / SQS.
It would be great if kafka had a built in retry mechanism (one that does not require additional topics) and once that fails then it's moved to a DLQ.
@Danieltammadge Рік тому ⁺¹
Great suggestion! Eduard. I am working on a video with my take currently so stay tuned. Thank you for taking the time to watch and comment. And apologies for taking so long to reply.
@eduardleroux9550 Рік тому ⁺²
@@Danieltammadge No worries mate, life happens! Looking forward to it, and thanks for posting awesome content and sharing the knowledge!
@cuongnguyenmanh4554 2 роки тому ⁺²
Thanks for sharing 👍. I have a question for waiting time in retry topic. how to config it. Thanks
@Danieltammadge 2 роки тому ⁺¹
Hope you found it helpful.
Let’s say you have a consumer subscribing to a retry topic.
And for the messages in this topic you want to wait 5 minutes since publishing to reprocess the messages.
What you would do is to take advantage of the Consumption Flow Control, which allows you to manually control the flow (kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html).
So you do the following steps
1. Consume messages
2. Check if the message published at timestamp is greater than the retry interval. If yes ( > 5 mins), then process, if not (< 5mins) continue to the next step
3. Pause consumer
4. Wait the time required
5. Resume consumer (in some cases, you may need to close and start the consumer after resuming)
Note: you cannot pause processing without pausing the consumer or Kafka may think the client is in a faulted state and push the message to another consumer.
Also, remember the next message will always be published later, so if you are still waiting for the message at index 5, then index 6 will have a longer time to wait.
@amseager 2 роки тому ⁺⁴
Really wanted to implement a monolith after all of that lol
@Danieltammadge 2 роки тому
Event-driven architecture is not simple
@kevinding0218 Рік тому ⁺²
Great video! I'm interested in the design and would like to dive a little bit, we usually would have different schedule retry in 2nd/the 3rd topic, for example, we want to retry 2nd time after 5 mins/3rd retry after 10mins, but Kafka didn't support a delay queue, how should the producer handle produce a 2nd/3rd retry event so it can be executed with scheduled waiting time?
@Danieltammadge Рік тому ⁺¹
Thank you for taking the time to comment. Hopefully the following will help
danieltammadge.com/2023/02/delaying-apache-kafka-retry-consuming/
@Danieltammadge Рік тому ⁺¹
Try this link. It looks like it got corrupted when I copied kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
The waiting logic is in the downstream retry consumer who consumes the retry topic.
When the upstream event processor needs to retry, the processor should publish events without delay.
When the retry consumer consumes the retry topic and retrieves message/s, the consumer must check if 5 minutes have passed since the upstream processor published the event.
However, if 5 minutes have not passed, then the consumer needs to pause the consumer group.
And set a timer in the service to resume processing in x seconds or minutes to resume the consumer group.
Regarding consumer lag, you would want a 5-minute consumer lag. Consumer lag showing that the consumer is processing events quicker than 5 minutes, shows the retry consumer is not waiting the designated time.
@kevinding0218 Рік тому ⁺¹
@@Danieltammadge Thank you so much! That makes the process much clear!
@tibi536 2 роки тому ⁺³
Nicely explained - I really liked the presentation :)
@Danieltammadge 2 роки тому
Thanks I am glad it helped
@Danieltammadge 2 роки тому ⁺²
Part 2 is uploaded so after you watch this one be sure to check it out. Link is at the end of the video
@kristinaribena1654 2 роки тому
Awesome
@user-xj3ds6ho9f Рік тому ⁺¹
Thank you for sharing @Daniel
@kristinaribena1654 2 роки тому ⁺¹
Awesome. Thanks for posting
@rajapattanayak Рік тому ⁺¹
Hi Daniel great video indeed. I have a question. How can we manage if there is any unhandled exception? If we handle the exception then we can send to retry topic.
@Danieltammadge Рік тому
Hi not sure if I understand your question could you maybe rephrase…
@jincyv7386 2 роки тому ⁺¹
Hi ,how can we handle persistent error in producer side with spring cloud stream
@Danieltammadge Рік тому
Not sure I understand your comment.
But if your system is not ensuring at l least once publishing I would recommend you to watch ua-cam.com/video/yUmzJ7mP3Iw/v-deo.html
@kristinaribena1654 2 роки тому ⁺²
Great video
@musicmania6214 3 роки тому ⁺²
Great video👏
@Danieltammadge 3 роки тому
Thanks
@manideepkumar959 Рік тому ⁺¹
if handson is also there it would have been better,cant get most out of it
@Danieltammadge Рік тому
Hopefully the following will help danieltammadge.com/2023/02/delaying-apache-kafka-retry-consuming/
Thanks for watching and taking the time to comment
@xinyuzhang 6 місяців тому
Thank you!!!!!!
@Danieltammadge 6 місяців тому
You're welcome!
@saritakumar1039 3 роки тому ⁺¹
Can u plz share some code for retry
@Danieltammadge 3 роки тому
I don't have code to share.
But you need to look at pausing the consumer kafka.apache.org/25/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
And then waiting until the message should be processed, and then unpausing by polling again.
Using google and the term Kafka consumer pausing. should get you to what you need... or at minimum provide you with the building blocks.
@chessmaster856 Рік тому ⁺¹
Any code or only this. Anybody can write code but only some can talk
@Danieltammadge Рік тому
ChessMaster thank you for taking the time to comment.
Quick question is your comment a question?
@chessmaster856 Рік тому
@@Danieltammadge yes. Can you provide some. Ode configuration examples a out how many error scenarios need to be handled in a messes queue.
@tejashwinihampannavar8398 2 роки тому ⁺¹
Thanku Sir🙏
@Danieltammadge 2 роки тому
Glad you found it. Please ask any questions you have
@amarnathcherukuri3076 2 роки тому ⁺¹
Great video
@Danieltammadge 2 роки тому
Thank you. Glad you liked it

Наступне

Автоматичне відтворення

Apache Kafka: Keeping the order of events when retrying due to failure