Hazelcast: Will item added listener gets triggered across all nodes when I add and clear ISet in the same function? - hazelcast

I want to trigger item added listener trigger across all nodes when I add items, make an async call in the listener, and after the listener is triggered I want to clear the ISet.
According to the EntryProcessor logic, each query is executed as an entry processor and I'm presuming it coherently triggers the item added listener across all nodes. My initial tests also point to the same behavior. But since I'm dealing with production data, I want to be 100% sure that item added listener gets triggered in all the nodes, even though I'm clearing the ISet in the next moment in one of the nodes.
Kindly point me to documentation if you know any. Or please share your experiences if you faced similar situation.

For the requirement "I want to be 100% sure that item added listener gets triggered in all the nodes", I'd suggest ReliableTopic and also make sure your topic listener implementations use ReliableMessageListener interface, as you can see it has additional interfaces for storing (storeSequence();) and getting sequence number (retrieveInitialSequence();), for example you can store this info locally on client. So each client will listen to the events based on the sequence id, which means if it disconnects for some reason, it can resume from the latest event sequence id after it restores.
http://docs.hazelcast.org/docs/latest-development/manual/html/Distributed_Data_Structures/Reliable_Topic.html

Related

Event Sourcing Refactoring

I've been studying DDD for a while, and stumbled into design patterns like CQRS, and Event sourcing (ES). These patterns can be used to help achieving some concepts of DDD with less effort.
In the architecture exemplified below, the aggregates know how to handle the commands and events related to itself. In other words, the Event Handlers and Command Handlers are the Aggregates.
Then, I’ve started modeling one sample Domain just to understand how the implementation would follow the business logic. For this question here is my domain (It’s based on this):
I know this is a bad modeled example, but I’m using it just as an example.
So, using ES, at the end of the operation, we would save all the events (Green arrows) into the event store (if there were no Exceptions), each event into its given Event Stream (Aggregate Type + Aggregate Id):
Everything seems right until now. So If we want to Rebuild the internal state of an instance of any of this Aggregate, we only have to new it up (new()) and apply all the events saved in its respective Event Stream in the correct order.
My question is related to changes in the model. Because, software development is a process where we never stop learning about our domain, and we always come with new ideas. So, let’s analyze some change scenarios:
Change Scenario 1:
Let´s pretend that now, if the Reservation Aggregate check’s that the seat is not available, it should send an event (Seat not reserved) and this event should be handled by one new Aggregate that will store all people that got their seat not reserved:
In the hypothesis where the old system already handled the initial command (Place order) correctly, and saved all the events to its respective event streams:
When we want to Rebuild the internal state of an instance of any of this Aggregate, we only have to new it up (new()) and apply all the events saved in its respective Event Stream in the correct order. (Nothing changed). The only thing, is that the new Use case didn’t exist back in the old model.
Change Scenario 2:
Let’s pretend that now, when the payment is accepted we handle this event (Payment Accepted) in a new Aggregate (Finance Aggregate) and not in the Order Aggregate anymore. And It send a new Event (Payment Received) to the Order Aggregate. I know this scenario is not well structured, but something like this could happen.
In the hypothesis where the old system already handled the initial command (Place order) correctly, and saved all the events to its respective event streams:
When we want to Rebuild the internal state of an instance of any of this Aggregate, we have a problem when applying the events from the Aggregate Event Stream to itself:
Now, the order doesn’t know anymore how to handle Payment Accepted Event.
Problems
So as the examples showed, whenever a system change reflects in an event being handled by a different event handler (Aggregate), there are some major problems. Because, we cannot rebuild the internal state anymore.
So, this problem can have some solutions:
Possible Solution
When an event is not handled by the aggregate in which Event Stream it is stored, we can find the new handler and create a new instance and send the event to it. But to maintain the internal state correct, we need the last event (Payment Received) to be handled by the Order Aggregate. So, we let it dispatch the event (and possible commands):
This solution can have some problems. Let’s imagine that a new command (Place Order) arrives and it has to create this order instance and save the new state. Now we would have:
In gray are the events that were already saved in the last call when the system hadn’t already gone through model changes.
We can see that a new Event Stream is created for the new aggregate (Finance W). And we can see that Event Streams are append-only, so the Payment Accepted event in the Order Y Event Stream is still there.
The first Payment Accepted event in Finance W Event Stream is the one that was supposed to be handled by the Order but had to find a new handler.
The Yellow payment received event in Order’s Event Stream is the event that was generated by the new handler of the Payment Accepted when the Payment Accepted event from the Order’s Event Stream was handled by the Finance.
All the other Green Events are new events that were generated by handling the Place Order Command in the new model.
Problem With the Solution
The next time the aggregate needs to be rebuild, there will be a Payment Accepted event in the stream (because it is append-only), and it will again call the new handler, but this have already been done and the Payment Received event have already been saved to the stream. So, it is not necessary to go through this again, we could ignore this event and continue.
Question
So, my question is how can we handle with model changes that impact who handle each event? How can we rebuild the internal state of an Aggregate after a change like this?
Will we need to build some event Stream migration that changes the events from one stream to the new schema (one or more streams)? Just like we would need in a Relational database?
Will we never be allowed to remove one handler, so we can only add new handlers? This would lead to unmanageable system…
You got almost all right, except one thing: Aggregates should not handle events from other Aggregates. It's like a non-event-sourced Aggregate shares a table with another Aggregate: they should not.
In event-driven DDD, Aggregates are the system's building blocks that receive Commands (things that express the intent) and return Events (things that had happened). For every Command type must exist one and only one Aggregate type that handle it. Before executing a Command, the Aggregate is fed with all its own previously emitted Events, that is, every Event that was emitted in the past by this Aggregate instance is applied to this Aggregate instance, in the chronological order.
So, if you want to correctly model your system, you are not allowed to send events from one Aggregate as events to another Aggregate (a different type or instance).
If you need to model business processes that involve multiple Aggregates, the correct way of doing it is by using a Saga/Process manager. This is a different component. It is the opposite of an Aggregate.
It receive Events emitted by Aggregates and sends Commands to other Aggregates.
In simplest cases, a Saga manager simply takes properties from one Event and creates+populates a Command with those properties. Then it sends the Command to the destination Aggregate.
In more complicated cases, the Saga waits for multiple Events and when all are received only then it creates and sends a Command.
The Saga may also deduplicate or reorder events.
In your case, a Saga could be Sale, whose purpose would be to coordinate the entire sales process, from ordering to product dispatching.
In conclusion, you have that problem because you have not modeled correctly your system. If your Aggregates would have handled only their specific Commands (and not somebody else's Events) then even if you must create a new Saga when a new Business process emerges, it would send the same Command to the Same Aggregate.
Answering briefly
my question is how can we handle with model changes that impact who handle each event?
Handling events is generally an easy thing to change, because the handling part is ephemeral. Events have a single writer, but they can have many readers. You just need to arrange for the plumbing to notify each subscriber of the event.
So in scenario #1, its the PaymentAggregate that writes down the PaymentAccepted event (in its own stream), and then your plumbing notifies the OrderAggregate that the PaymentAccepted event happened, and it does the next thing in its own logic.
To change to scenario #2, we'd leave the Payment Aggregate unchanged, but we'd arrange the plumbing so that it tells the FinanceAggregate about PaymentAccepted, and that it tells the OrderAggregate about PaymentReceived.
Your pictures make it hard to see this; I think you aren't being careful to track that each change of state is stored in the stream of the aggregate that changed. Not your fault - the Microsoft picture is really awful.
In other words, your arrow #3 "Seats Reserved" isn't a SeatsReserved event, it's a Handle(SeatsReserved) command.

How to avoid concurrency on aggregates status using Rebus in a server cluster

I have a web service that use Rebus as Service Bus.
Rebus is configured as explained in this post.
The web service is load balanced with a two servers cluster.
These services are for a production environment and each production machine sends commands to save the produced quantities and/or to update its state.
In the BL I've modelled an Aggregate Root for each machine and it executes the commands emitted by the real machine. To preserve the correct status, the Aggregate needs to receive the commands in the same sequence as they were emitted, and, since there is no concurrency for that machine, that is the same order they are saved on the bus.
E.G.: the machine XX sends a command of 'add new piece done' and then the command 'Set stop for maintenance'. Executing these commands in a sequence you should have Aggregate XX in state 'Stop', but, with multiple server/worker roles, you could have that both commands are executed at the same time on the same version of Aggregate. This means that, depending on who saves the aggregate first, I can have Aggregate XX with state 'Stop' or 'Producing pieces' ... that is not the same thing.
I've introduced a Service Bus to add scale out as the number of machine scales and resilience (if a server fails I have only slowdown in processing commands).
Actually I'm using the name of the aggregate like a "topic" or "destinationAddress" with the IAdvancedApi, so the name of the aggregate is saved into the recipient of the transport. Then I've created a custom Transport class that:
1. does not remove the messages in progress but sets them in state
InProgress.
2. to retrive the messages selects only those that are in a recipient that have no one InProgress.
I'm wandering: is this the best way to guarantee that the bus executes the commands for aggregate in the same sequence as they arrived?
The solution would be have some kind of locking of your aggregate root, which needs to happen at the data store level.
E.g. by using optimistic locking (probably implemented with some kind of revision number or something like that), you would be sure that you would never accidentally overwrite another node's edits.
This would allow for your aggregate to either
a) accept the changes in either order (which is generally preferable – makes your system more tolerant), or
b) reject an invalid change
If the aggregate rejects the change, this could be implemented by throwing an exception. And then, in the Rebus handler that catches this exception, you can e.g. await bus.Defer(TimeSpan.FromSeconds(5), theMessage) which will cause it to be delivered again in five seconds.
You should never rely on message order in a service bus / queuing / messaging environment.
When you do find yourself in this position you may need to re-think your design. Firstly, a service bus is most certainly not an event store and attempting to use it like one is going to lead to pain and suffering :) --- not that you are attempting this but I thought I'd throw it in there.
As for your design, in order to manage this kind of state you may want to look at a process manager. If you are not generating those commands then even this will not help.
However, given your scenario it seems as though the calls are sequential but perhaps it is just your example. In any event, as mookid8000 said, you either want to:
discard invalid changes (with the appropriate feedback),
allow any order of messages as long as they are valid,
ignore out-of-sequence messages till later.
Hope that helps...
"exactly the same sequence as they were saved on the bus"
Just... why?
Would you rely on your HTTP server logs to know which command actually reached an aggregate first? No because it is totally unreliable, just like it is with at-least-one delivery guarantees and it's also irrelevant.
It is your event store and/or normal persistence state that should be the source of truth when it comes to knowing the sequence of events. The order of commands shouldn't really matter.
Assuming optimistic concurrency, if the aggregate is not allowed to transition from A to C then it should guard this invariant and when a TransitionToStateC command will hit it in the A state it will simply get rejected.
If on the other hand, A->C->B transitions are valid and that is the order received by your aggregate well that is what happened from the domain perspective. It really shouldn't matter which command was published first on the bus, just like it doesn't matter which user executed the command first from the UI.
"In my scenario the calls for a specific aggregate are absolutely
sequential and I must guarantee that are executed in the same order"
Why are you executing them asynchronously and potentially concurrently by publishing on a bus then? What you are basically saying is that calls are sequential and cannot be processed concurrently. That means everything should be synchronous because there is no potential benefit from parallelism.
Why:
executeAsync(command1)
executeAsync(command2)
executeAsync(command3)
When you want:
execute(command1)
execute(command2)
execute(command3)
You should have a single command message and the handler of this message executes multiple commands against the aggregate. Then again, in this case I'd just create a single operation on the aggregate that performs all the transitions.

Event Sourcing with Side-Effects

I'm building a service using the familiar event sourcing pattern:
A request is received.
The aggregate's history is loaded.
The aggregate is rebuilt (from its history).
New events are prepared and the aggregate is updated in response to the incoming request from Step 1.
These events are written to the log, and are made available (published) to any subscribers.
In my case, Step 5 is accomplished in two parts. The events are written to the event log. A background process reads from the event log and publishes all events starting from an offset.
In some cases, I need to publish side effects in addition to events related to the aggregate. As far as the system is concerned, these are events too because they are consumed by and affect the state of other services. However, they don't affect the history of the aggregate in this service and are not needed to rebuild it.
How should I handle these in the code?
Option 1-
Don't write side-effecting events to the event log. Publish these in the main process prior to Step 5.
Option 2-
Write everything to the event log and ignore side-effecting events when the history is loaded. (These aren't part of the history!)
Option 3-
Write side-effecting events to a dummy aggregate so they are published, but never loaded.
Option 4-
?
In the first option, there may be trouble if there is a concurrency violation. If the write fails in Step 5, the side effect cannot be easily rolled back. The second option write events that are not part of the aggregate's history. When loading in Step 2, these side-effecting events would have to be ignored. The 3rd option feels like a hack.
Which of these seems right to you?
Name events correctly
Events are "things that happened". So if you are able to name the events that only trigger side effects in a "X happened" fashion, they become a natural part of the event history.
In my experience, this is always possible, because side-effects don't happen out of thin air. Sometimes the name becomes a bit artificial, but it is still better to name events that way than to call them e.g. "send email to that client event".
In terms of your list of alternatives, this would be option 2.
Example
Instead of calling an event "send status email to customer event", call it "status email triggered event". Of course, if there is a better name for the actual trigger, use that one :-)
Option 4 - Have some other service subscribe to the events and produce the side effects, and any additional events related to them.
Events should be fine-grained.
Option 1- Don't write side-effecting events to the event log. Publish
these in the main process prior to Step 5.
What if you later need this part of the history by building a new bounded context?
Option 2- Write everything to the event log and ignore side-effecting
events when the history is loaded. (These aren't part of the history!)
How to ignore the effect of something which does not have any effect? :D
Option 3- Write side-effecting events to a dummy aggregate so they are
published, but never loaded.
Why do you need consistency boundary around something which you will never change?
What you are talking about is the most common form of domain events, which you use to communicate with other BC-s. Ofc. you need to save them.

React Flux dispatcher vs Node.js EventEmitter - scalable?

When you use Node's EventEmitter, you subscribe to a single event. Your callback is only executed when that specific event is fired up:
eventBus.on('some-event', function(data){
// data is specific to 'some-event'
});
In Flux, you register your store with the dispatcher, then your store gets called when every single event is dispatched. It is the job of the store to filter through every event it gets, and determine if the event is important to the store:
eventBus.register(function(data){
switch(data.type){
case 'some-event':
// now data is specific to 'some-event'
break;
}
});
In this video, the presenter says:
"Stores subscribe to actions. Actually, all stores receive all actions, and that's what keeps it scalable."
Question
Why and how is sending every action to every store [presumably] more scalable than only sending actions to specific stores?
The scalability referred to here is more about scaling the codebase than scaling in terms of how fast the software is. Data in flux systems is easy to trace because every store is registered to every action, and the actions define every app-wide event that can happen in the system. Each store can determine how it needs to update itself in response to each action, without the programmer needing to decide which stores to wire up to which actions, and in most cases, you can change or read the code for a store without needing to worrying about how it affects any other store.
At some point the programmer will need to register the store. The store is very specific to the data it'll receive from the event. How exactly is looking up the data inside the store better than registering for a specific event, and having the store always expect the data it needs/cares about?
The actions in the system represent the things that can happen in a system, along with the relevant data for that event. For example:
A user logged in; comes with user profile
A user added a comment; comes with comment data, item ID it was added to
A user updated a post; comes with the post data
So, you can think about actions as the database of things the stores can know about. Any time an action is dispatched, it's sent to each store. So, at any given time, you only need to think about your data mutations a single store + action at a time.
For instance, when a post is updated, you might have a PostStore that watches for the POST_UPDATED action, and when it sees it, it will update its internal state to store off the new post. This is completely separate from any other store which may also care about the POST_UPDATED event—any other programmer from any other team working on the app can make that decision separately, with the knowledge that they are able to hook into any action in the database of actions that may take place.
Another reason this is useful and scalable in terms of the codebase is inversion of control; each store decides what actions it cares about and how to respond to each action; all the data logic is centralized in that store. This is in contrast to a pattern like MVC, where a controller is explicitly set up to call mutation methods on models, and one or more other controllers may also be calling mutation methods on the same models at the same time (or different times); the data update logic is spread through the system, and understanding the data flow requires understanding each place the model might update.
Finally, another thing to keep in mind is that registering vs. not registering is sort of a matter of semantics; it's trivial to abstract away the fact that the store receives all actions. For example, in Fluxxor, the stores have a method called bindActions that binds specific actions to specific callbacks:
this.bindActions(
"FIRST_ACTION_TYPE", this.handleFirstActionType,
"OTHER_ACTION_TYPE", this.handleOtherActionType
);
Even though the store receives all actions, under the hood it looks up the action type in an internal map and calls the appropriate callback on the store.
Ive been asking myself the same question, and cant see technically how registering adds much, beyond simplification. I will pose my understanding of the system so that hopefully if i am wrong, i can be corrected.
TLDR; EventEmitter and Dispatcher serve similar purposes (pub/sub) but focus their efforts on different features. Specifically, the 'waitFor' functionality (which allows one event handler to ensure that a different one has already been called) is not available with EventEmitter. Dispatcher has focussed its efforts on the 'waitFor' feature.
The final result of the system is to communicate to the stores that an action has happened. Whether the store 'subscribes to all events, then filters' or 'subscribes a specific event' (filtering at the dispatcher). Should not affect the final result. Data is transferred in your application. (handler always only switches on event type and processes, eg. it doesn't want to operate on ALL events)
As you said "At some point the programmer will need to register the store.". It is just a question of fidelity of subscription. I don't think that a change in fidelity has any affect on 'inversion of control' for instance.
The added (killer) feature in facebook's Dispatcher is it's ability to 'waitFor' a different store, to handle the event first. The question is, does this feature require that each store has only one event handler?
Let's look at the process. When you dispatch an action on the Dispatcher, it (omitting some details):
iterates all registered subscribers (to the dispatcher)
calls the registered callback (one per stores)
the callback can call 'waitfor()', and pass a 'dispatchId'. This internally references the callback of registered by a different store. This is executed synchronously, causing the other store to receive the action and be updated first. This requires that the 'waitFor()' is called before your code which handles the action.
The callback called by 'waitFor' switches on action type to execute the correct code.
the callback can now run its code, knowing that its dependancies (other stores) have already been updated.
the callback switches on the action 'type' to execute the correct code.
This seems a very simple way to allow event dependancies.
Basically all callbacks are eventually called, but in a specific order. And then switch to only execute specific code. So, it is as if we only triggered a handler for the 'add-item' event on the each store, in the correct order.
If subscriptions where at a callback level (not 'store' level), would this still be possible? It would mean:
Each store would register multiple callbacks to specific events, keeping reference to their 'dispatchTokens' (same as currently)
Each callback would have its own 'dispatchToken'
The user would still 'waitFor' a specific callback, but be a specific handler for a specific store
The dispatcher would then only need to dispatch to callbacks of a specific action, in the same order
Possibly, the smart people at facebook have figured out that this would actually be less performant to add the complexity of individual callbacks, or possibly it is not a priority.

Windows Azure staging <--> production causing conflicts & errors on table storage

We had a terrible problem/experience yesterday when trying to swap our staging <--> production role.
Here is our setup:
We have a workerrole picking up messages from the queue. These messages are processed on the role. (Table Storage inserts, db selects etc ). This can take maybe 1-3 seconds per queue message depending on how many table storage posts he needs to make. He will delete the message when everything is finished.
Problem when swapping:
When our staging project went online our production workerrole started erroring.
When the role wanted to process queue messsage it gave a constant stream of 'EntityAlreadyExists' errors. Because of these errors queue messages weren't getting deleted. This caused the queue messages to be put back in the queue and back to processing and so on....
When looking inside these queue messages and analysing what would happend with them we saw they were actually processed but not deleted.
The problem wasn't over when deleting these faulty messages. Newly queue messages weren't processed as well while these weren't processed yet and no table storage records were added, which sounds very strange.
When deleting both staging and producting and publishing to production again everything started to work just fine.
Possible problem(s)?
We have litle 2 no idea what happened actually.
Maybe both the roles picked up the same messages and one did the post and one errored?
...???
Possible solution(s)?
We have some idea's on how to solve this 'problem'.
Make a poison message fail over system? When the dequeue count gets over X we should just delete that queue message or place it into a separate 'poisonqueue'.
Catch the EntityAlreadyExists error and just delete that queue message or put it in a separate queue.
...????
Multiple roles
I suppose we will have the same problem when putting up multiple roles?
Many thanks.
EDIT 24/02/2012 - Extra information
We actually use the GetMessage()
Every item in the queue is unique and will generate unique messages in table Storage. Little more information about the process: A user posts something and will have to be distributed to certain other users. The message generate from that user will have a unique Id (guid). This message will be posted into the queue and picked up by the worker role. The message is distributed over several other tables (partitionkey -> UserId, rowkey -> Some timestamp in ticks & the unique message id. So there is almost no chance the same messages will be posted in a normal situation.
The invisibility time out COULD be a logical explanation because some messages could be distributed to like 10-20 tables. This means 10-20 insert without the batch option. Can you set or expand this invisibility time out?
Not deleting the queue message because of an exception COULD be a explanation as well because we didn't implement any poison message fail over YET ;).
Regardless of the Staging vs. Production issue, having a mechanism that handles poison messages is critical. We've implemented an abstraction layer over Azure queues that automatically moves messages over to a poison queue once they've been attempted to be processed some configurable amount of times.
You clearly have a fault on handling double messages. The fact that your ID is unique doesn't mean that the message will not be processed twice in some occasions like:
The role dying and with partially finished work, so the message will re-appear for processing in the queue
The role crashing unexpected, so the message ends up back in the queue
The FC migrating moving your role and you don't have code to handle this situation, so the message ends up back in the queue
In all cases, you need code that handles the fact that the message will re-appear. One way is to use the DequeueCount property and check how many times the message was removed from a Queue and received for processing. Make sure you have code that handles partial processing of a message.
Now what probably happened during swapping was, when the production environment became the staging and staging became production, both of them were trying to receive the same messages so they were basically competing each other fro those messages, which is probably not bad because this is a known pattern to work anyway but when you killed your old production (staging) every message that was received for processing and wasn't finished, ended up back in the Queue and your new production environment picked the message for processing again. Having no code logic to handle this scenario and a message was that partially processed, some records in the tables existed and it started causing the behavior you noticed.
There are a few possible causes:
How are you reading the queue messages? If you are doing a Peek Message then the message will still be visible to be picked up by another role instance (or your staging environment) before the message is deleted. You want to make sure you are using Get Message so the message is invisible until it can be deleted.
Is it possible that your first role crashed after doing the work for the message but prior to deleting the message? This would cause the message to become visible again and get picked up by another role instance. At that point the message will be a poison message which will cause your instances to constantly crash.
This problem almost certainly has nothing to do with Staging vs Production, but is most likely caused by having multiple instances reading from the same queue. You can probably reproduce the same problem by specifying 2 instances, or by deploying the same code to 2 different production services, or by running the code locally on your dev machine (still pointing to Azure storage) using 2 instances.
In general you do need to handle poison messages so you need to implement that logic anyways, but I would suggest getting to the root cause of this problem first, otherwise you are just going to run into a lot more problems later on.
With queues you need to code with idempotency in mind and expect and handle the ‘EntityAlreadyExists’ as a viable response.
As others have suggested, causes could be
Multiple message in the queue with the same identifier.
Are peeking for the message and not reading it form the queue and so not making them invisible.
Not deleting the message because an exception was thrown before you can delete them.
Taking too long to process the message so it cannot be deleted (because invisibility was timed out) and appears again
Without looking at the code I am guessing that it is either the 3 or 4 option that is occurring.
If you cannot detect the issue with a code review, you may consider adding time based logging and try/catch wrappers to get a better understanding.
Using queues effectively, in a multi-role environment, requires a slightly different mindset and running into such issues early is actually a blessing in disguise.
Appended 2/24
Just to clarify, modifying the invisibility time out is not a generic solution to this type of problem. Also, note that this feature although available on the REST API, may not be available on the queue client.
Other options involve writing to table storage in an asynchronous manner to speed up your processing time, but again this is a stop gap measures which does not really address the underlying paradigm of working with queues.
So, the bottom line is to be idempotent. You can try using the table storage upsert (update or insert) feature to avoid getting the ‘EntitiyAlreadyExists’ error, if that works for your code. If all you are doing is inserting new entities to azure table storage then the upsert should solve your problem with minimal code change.
If you are doing updates then it is a different ball game all together. One pattern is to pair updates with dummy inserts in the same table with the same partition key so as to error out if the update occurred previously and so skip the update. Later after the message is deleted, you can delete the dummy inserts. However, all this adds to the complexity, so it is much better to revisit the architecture of the product; for example, do you really need to insert/update into so many tables?
Without knowing what your worker role is actually doing I'm taking a guess here, but it sounds like when you have two instances of your worker role running you are getting conflicts while trying to write to an Azure table. It is likely to be because you have code that looks something like this:
var queueMessage = GetNextMessageFromQueue();
Foo myFoo = GetFooFromTableStorage(queueMessage.FooId);
if (myFoo == null)
{
myFoo = new Foo {
PartitionKey = queueMessage.FooId
};
AddFooToTableStorage(myFoo);
}
DeleteMessageFromQueue(queueMessage);
If you have two adjacent messages in the queue with the same FooId it is quite likely that you'll end up with both of the instances checking to see if the Foo exists, not finding it then trying to create it. Whichever instance is the last to try and save the item will get the "Entity already exists" error. Because it errored it never gets to the delete message part of the code and therefore it becomes visible back on the queue after a period of time.
As others have said, dealing with poison messages is a really good idea.
Update 27/02
If it's not subsequent messages (which based on your partition/row key scheme I would say it's unlikely), then my next bet would be it's the same message appearing back in the queue after the visibility timeout. By default if you're using .GetMessage() the timeout is 30 seconds. It has an overload which allows you to specify how long that time frame is. There is also the .UpdateMessage() function that allows you to update that timeout as you're processing the message. For example you could set the initial visibility to 1 minute, then if you're still processing the message 50 seconds later, extent it for another minute.

Resources