Andrew Potts: Distributed Systems: de-duplication and idempotency

When working on distributed systems there are two important concepts to keep in mind - deduplication and idempotency. The two are sometimes confused but there are subtle differences between them.

Background Context
In our e-commerce system, we receive orders submitted from Checkout and manage them through a lifecycle of Payment Authorisation, Fraud Check, Billing through to Dispatch. We employ asynchronous message-driven architecture using a combination of technologies such as NServiceBus, MSMQ and Azure Service Bus. We use a combination of workflow choreography and orchestration. In this lightweight handlers receive a command or subscribe to an event, do some work, and publish a resulting message, These are chained together with other handlers to complete the overall workflow.

At-Least Once Messaging
Messaging systems typically offer at-least-once messaging, That means you are guaranteed to get a message once but under certain circumstances - such as failure modes - you may get the message twice (perhaps on different independent threads of execution).

Avoiding certain work twice
However, some work items should not be repeated: you should only charge a customer once or perform a fraud check once. If our handler receives a PaymentAccepted event it performs a Fraud Check with a third party service. This costs money and repeatedly calling it will affect a customers fraud score - it should only be done once per order.

Deduplication
Some people employ deduplication to solve this. For example, Azure Service Bus gives you the option to deduplicate messages by setting a time window (say 10 minutes). In this Azure Service Bus will check every message ID and if it sees it has already been processed it will prevent another handler processing it. This is possible because ultimately the broker architecture revolves around a single SQL database (limited to a region).

However, deduplication does not solve all of the problems. Idempotency is very important too,

Idempotency
If we are truly idempotent then a handler should “always produce the same output given the same input”.

Ideally, a handler would be fully idempotent. If the work was to set subscribe to a NewCustomerOrder event and set a new customer flag to true and publish a NewCustomerUpdated event, we should be able to publish this event ten times and the outcome will always be the flag being set to true and a NewCustomerUpdated flag published.

However, in the Fraud Check scenario, we should subscribe to PaymentAccepted event and check history to determine whether to perform a Fraud Check, and always publish the result - normally FraudCheckPassed.

Idempotency supports Replays

This is important for replays. We had a system problem where our system experienced a series of failures that caused messages to be lost at various stages of the workflow.

This is shown below:

It would have been very useful to go to the leftmost part of the workflow and replay the message: this would have resulted in the downstream handlers responding and repeating their work as required. However many handlers employed deduplication: they simply swallowed the input event, did no work, and published no outcome. That meant latter stages of the workflow were not exercised and the system had to be recovered stage by stage. It required a lot of manual effort, was time consuming and was prone to error.

If each handler was idempotent; it would have received the input event, chosen whether to do work or not, and published the output event.

What are the side effects?

So impact could this have?

In the event of duplicated messages at the transport level (e.g. at least-once messaging) we could end up with more load on the system. If one of the leftmost handlers received a duplicate, it would propagate down the right. However I believe this is a small price worth paying for the enhanced supportability. Furthermore Azure Service Bus, being a broker with message-level locking, will prevent this except for handler failure scenarios.

If we are using Event Sourcing then if we duplicate or replay an event our event streams will record the fact that the FraudCheckPassed was published twice. Arguably, though, this is a good thing. It is a true reflection of history, which is what the event stream should be. It is much better to have two publishes recorded rather than someone manually hacking the event stream to fix problems.

In our current system, we enforce deduplication at the transport layer using Azure Service Bus configuration. However I believe this should be changed anyway; it prevents partitioned queues, it prevents us changing transport layer and anyway the deduplication and idempotency should be enforced at the handler level.

Deduplication

Note that idempotency and deduplication are separate concerns (although often mixed!). In the event of a Fraud Check or a call to a Payment Processor for billing it is important that these calls are not repeated twice. If we receive a duplicate or replayed message, we don’t call these external systems again. Rather we just republish the last outcome.

Closing argument

Our APIs are idempotent so why aren’t our handlers? When an API receives the same input request, it publishes the same output. This is important if the client enters a tunnel when issuing an HTTP request.

Why aren’t are handlers behaving in this manner?

Andrew Potts

Monday 17 October 2016

Distributed Systems: de-duplication and idempotency

No comments:

Post a Comment

Blog Archive