Spring Integration Kafka slow message processing when using KafkaNativeOffsetManager - spring-integration

We've been using SI Kafka for a new project here with much success. Prior to a recent switch, we were using the KafkaTopicOffsetManager for the management of our consumers topic offset. In order to not have additional topics per consumer/topic pair and to use Burrow or lag monitoring, we decided to use the latest KafkaNativeOffsetManager that uses the native offset management provided by Kafka. After making the switch though, we noticed that the consumption of messages from the target topic was continually lagging behind. We know this didn't happen with the KafkaTopicOffsetManager as we were using it for months prior to the switch. We also ran side-by-side tests and verified that the consumption of messages was in near real time with the production of messages when using KafkaTopicOffsetManager and the KafkaNativeOffsetManager was always increasingly lagging behind. Both offset managers are using default configuration and are both committing offsets after the message is processed (auto-acknowledge).
So I really have two questions, the first not be the primary of this SO post.
First question is why would this be the case that the native offset management is slower than using a topic for offset management?
Second question is, can we configure SI kafka to not commit offsets on the successful processing of each message but rather provide a different strategy? Our thought was that maybe we shouldn't be committing offsets so frequently and should be either doing them as batch update. For example, commit offsets after successfully processing 25 messages or after 30 seconds.
Thank you

When disabling autocommit and receiving the acknowledgment header, the only thing you need to do is to invoke acknowledge() once your message has been processed. This assumes that even if you are handling the message in a different thread, you will retain a reference to the Acknowledgment instance, either as such, or as part of the original Message - or you are copying headers if you are doing transformations. But the call needs to be made by your code.
Secondly, the performance issue - it is caused by the fact that the KafkaNativeOffsetManager implementation makes a blocking, relatively more expensive call to the brokers (relative to the simply sending a message to a compacted topic, as the KafkaTopicOffsetManager does. Generally speaking doing updates after every message is expensive, and in Spring XD we mitigate that by using https://github.com/spring-projects/spring-xd/blob/master/extensions/spring-xd-extension-kafka/src/main/java/org/springframework/integration/x/kafka/WindowingOffsetManager.java, which reduces the number of effective writes. I suppose we could do something similar for Spring Integration.
To wit: comparatively, 100000 updates complete in 9.8s with the KafkaNativeOffsetManager and in 0.382s with the KafkaTopicOffsetManager, as shown by https://github.com/mbogoevici/spring-integration-kafka/blob/perftest/src/test/java/org/springframework/integration/kafka/performance/OffsetManagerPerformanceTests.java (results gathered on my own machine). Results may be skewed somehow, but still indicate a big difference. Tracing in YourKit confirms the result.

Not sure what's the problem with the KafkaNativeOffsetManager, would be great if you share some investigation on the matter, some bottleneck place in our code in the JIRA.
For the deferred offset commit I can suggest autoCommitOffset = false on the KafkaMessageDrivenChannelAdapter. Having that the sent to the channel message will be enriched with the KafkaHeaders.ACKNOWLEDGMENT header in face of DefaultAcknowledgment. It really answers to your request:
/**
* Invoked when the message for which the acknowledgment has been created has been processed.
* Calling this method implies that all the previous messages in the partition have been processed already.
*/
void acknowledge();

Related

How are the missing events replayed?

I am trying to learn more about CQRS and Event Sourcing (Event Store).
My understanding is that a message queue/bus is not normally used in this scenario - a message bus can be used to facilitate communication between Microservices, however it is not typically used specifically for CQRS. However, the way I see it at the moment - a message bus would be very useful guaranteeing that the read model is eventually in sync hence eventual consistency e.g. when the server hosting the read model database is brought back online.
I understand that eventual consistency is often acceptable with CQRS. My question is; how does the read side know it is out of sync with the write side? For example, lets say there are 2,000,000 events created in Event Store on a typical day and 1,999,050 are also written to the read store. The remaining 950 events are not written because of a software bug somewhere or because the server hosting the read model is offline for a few secondsetc. How does eventual consistency work here? How does the application know to replay the 950 events that are missing at the end of the day or the x events that were missed because of the downtime ten minutes ago?
I have read questions on here over the last week or so, which talk about messages being replayed from event store e.g. this one: CQRS - Event replay for read side, however none talk about how this is done. Do I need to setup a scheduled task that runs once per day and replays all events that were created since the date the scheduled task last succeeded? Is there a more elegant approach?
I've used two approaches in my projects, depending on the requirements:
Synchronous, in-process Readmodels. After the events are persisted, in the same request lifetime, in the same process, the Readmodels are fed with those events. In case of a Readmodel's failure (bug or catchable error/exception) the error is logged and that Readmodel is just skipped and the next Readmodel is fed with the events and so on. Then follow the Sagas, that may generate commands that generate more events and the cycle is repeated.
I use this approach when the impact of a Readmodel's failure is acceptable by the business, when the readiness of a Readmodel's data is more important than the risk of failure. For example, they wanted the data immediately available in the UI.
The error log should be easily accessible on some admin panel so someone would look at it in case a client reports inconsistency between write/commands and read/query.
This also works if you have your Readmodels coupled to each other, i.e. one Readmodel needs data from another canonical Readmodel. Although this seems bad, it's not, it always depends. There are cases when you trade updater code/logic duplication with resilience.
Asynchronous, in-another-process readmodel updater. This is used when I use total separation of the Readmodel from the other Readmodels, when a Readmodel's failure would not bring the whole read-side down; or when a Readmodel needs another language, different from the monolith. Basically this is a microservice. When something bad happens inside a Readmodel it necessary that some authoritative higher level component is notified, i.e. an Admin is notified by email or SMS or whatever.
The Readmodel should also have a status panel, with all kinds of metrics about the events that it has processed, if there are gaps, if there are errors or warnings; it also should have a command panel where an Admin could rebuild it at any time, preferable without a system downtime.
In any approach, the Readmodels should be easily rebuildable.
How would you choose between a pull approach and a push approach? Would you use a message queue with a push (events)
I prefer the pull based approach because:
it does not use another stateful component like a message queue, another thing that must be managed, that consume resources and that can (so it will) fail
every Readmodel consumes the events at the rate it wants
every Readmodel can easily change at any moment what event types it consumes
every Readmodel can easily at any time be rebuild by requesting all the events from the beginning
there order of events is exactly the same as the source of truth because you pull from the source of truth
There are cases when I would choose a message queue:
you need the events to be available even if the Event store is not
you need competitive/paralel consumers
you don't want to track what messages you consume; as they are consumed they are removed automatically from the queue
This talk from Greg Young may help.
How does the application know to replay the 950 events that are missing at the end of the day or the x events that were missed because of the downtime ten minutes ago?
So there are two different approaches here.
One is perhaps simpler than you expect - each time you need to rebuild a read model, just start from event 0 in the stream.
Yeah, the scale on that will eventually suck, so you won't want that to be your first strategy. But notice that it does work.
For updates with not-so-embarassing scaling properties, the usual idea is that the read model tracks meta data about stream position used to construct the previous model. Thus, the query from the read model becomes "What has happened since event #1,999,050"?
In the case of event store, the call might look something like
EventStore.ReadStreamEventsForwardAsync(stream, 1999050, 100, false)
Application doesn't know it hasn't processed some events due to a bug.
First of all, I don't understand why you assume that the number of events written on the write side must equal number of events processed by read side. Some projections may subscribe to the same event and some events may have no subscriptions on the read side.
In case of a bug in projection / infrastructure that resulted in a certain projection being invalid you might need to rebuild this projection. In most cases this would be a manual intervention that would reset the checkpoint of projection to 0 (begining of time) so the projection will pick up all events from event store from scratch and reprocess all of them again.
The event store should have a global sequence number across all events starting, say, at 1.
Each projection has a position tracking where it is along the sequence number. The projections are like logical queues.
You can clear a projection's data and reset the position back to 0 and it should be rebuilt.
In your case the projection fails for some reason, like the server going offline, at position 1,999,050 but when the server starts up again it will continue from this point.

"Resequencing" messages after processing them out-of-order

I'm working on what's basically a highly-available distributed message-passing system. The system receives messages from someplace over HTTP or TCP, perform various transformations on it, and then sends it to one or more destinations (also using TCP/HTTP).
The system has a requirement that all messages sent to a given destination are in-order, because some messages build on the content of previous ones. This limits us to processing the messages sequentially, which takes about 750ms per message. So if someone sends us, for example, one message every 250ms, we're forced to queue the messages behind each other. This eventually introduces intolerable delay in message processing under high load, as each message may have to wait for hundreds of other messages to be processed before it gets its turn.
In order to solve this problem, I want to be able to parallelize our message processing without breaking the requirement that we send them in-order.
We can easily scale our processing horizontally. The missing piece is a way to ensure that, even if messages are processed out-of-order, they are "resequenced" and sent to the destinations in the order in which they were received. I'm trying to find the best way to achieve that.
Apache Camel has a thing called a Resequencer that does this, and it includes a nice diagram (which I don't have enough rep to embed directly). This is exactly what I want: something that takes out-of-order messages and puts them in-order.
But, I don't want it to be written in Java, and I need the solution to be highly available (i.e. resistant to typical system failures like crashes or system restarts) which I don't think Apache Camel offers.
Our application is written in Node.js, with Redis and Postgresql for data persistence. We use the Kue library for our message queues. Although Kue offers priority queueing, the featureset is too limited for the use-case described above, so I think we need an alternative technology to work in tandem with Kue to resequence our messages.
I was trying to research this topic online, and I can't find as much information as I expected. It seems like the type of distributed architecture pattern that would have articles and implementations galore, but I don't see that many. Searching for things like "message resequencing", "out of order processing", "parallelizing message processing", etc. turn up solutions that mostly just relax the "in-order" requirements based on partitions or topics or whatnot. Alternatively, they talk about parallelization on a single machine. I need a solution that:
Can handle processing on multiple messages simultaneously in any order.
Will always send messages in the order in which they arrived in the system, no matter what order they were processed in.
Is usable from Node.js
Can operate in a HA environment (i.e. multiple instances of it running on the same message queue at once w/o inconsistencies.)
Our current plan, which makes sense to me but which I cannot find described anywhere online, is to use Redis to maintain sets of in-progress and ready-to-send messages, sorted by their arrival time. Roughly, it works like this:
When a message is received, that message is put on the in-progress set.
When message processing is finished, that message is put on the ready-to-send set.
Whenever there's the same message at the front of both the in-progress and ready-to-send sets, that message can be sent and it will be in order.
I would write a small Node library that implements this behavior with a priority-queue-esque API using atomic Redis transactions. But this is just something I came up with myself, so I am wondering: Are there other technologies (ideally using the Node/Redis stack we're already on) that are out there for solving the problem of resequencing out-of-order messages? Or is there some other term for this problem that I can use as a keyword for research? Thanks for your help!
This is a common problem, so there are surely many solutions available. This is also quite a simple problem, and a good learning opportunity in the field of distributed systems. I would suggest writing your own.
You're going to have a few problems building this, namely
2: Exactly-once delivery
1: Guaranteed order of messages
2: Exactly-once delivery
You've found number 1, and you're solving this by resequencing them in redis, which is an ok solution. The other one, however, is not solved.
It looks like your architecture is not geared towards fault tolerance, so currently, if a server craches, you restart it and continue with your life. This works fine when processing all requests sequentially, because then you know exactly when you crashed, based on what the last successfully completed request was.
What you need is either a strategy for finding out what requests you actually completed, and which ones failed, or a well-written apology letter to send to your customers when something crashes.
If Redis is not sharded, it is strongly consistent. It will fail and possibly lose all data if that single node crashes, but you will not have any problems with out-of-order data, or data popping in and out of existance. A single Redis node can thus hold the guarantee that if a message is inserted into the to-process-set, and then into the done-set, no node will see the message in the done-set without it also being in the to-process-set.
How I would do it
Using redis seems like too much fuzz, assuming that the messages are not huge, and that losing them is ok if a process crashes, and that running them more than once, or even multiple copies of a single request at the same time is not a problem.
I would recommend setting up a supervisor server that takes incoming requests, dispatches each to a randomly chosen slave, stores the responses and puts them back in order again before sending them on. You said you expected the processing to take 750ms. If a slave hasn't responded within say 2 seconds, dispatch it again to another node randomly within 0-1 seconds. The first one responding is the one we're going to use. Beware of duplicate responses.
If the retry request also fails, double the maximum wait time. After 5 failures or so, each waiting up to twice (or any multiple greater than one) as long as the previous one, we probably have a permanent error, so we should probably ask for human intervention. This algorithm is called exponential backoff, and prevents a sudden spike in requests from taking down the entire cluster. Not using a random interval, and retrying after n seconds would probably cause a DOS-attack every n seconds until the cluster dies, if it ever gets a big enough load spike.
There are many ways this could fail, so make sure this system is not the only place data is stored. However, this will probably work 99+% of the time, it's probably at least as good as your current system, and you can implement it in a few hundred lines of code. Just make sure your supervisor is using asynchronous requests so that you can handle retries and timeouts. Javascript is by nature single-threaded, so this is slightly trickier than normal, but I'm confident you can do it.

resequencer with holes on secuence

we have an ETL scenario where we use the resequencer.
Messages arrive to the flow with a sequence number that the resequencer uses it to send messages in order, but sometimes messages are discarded previously (because of data validation) and do not arrive to the resequencer. This produces holes in the sequence and resequencer stops sending messages using the default release strategy. To avoid this, we developed a new SequenceTimeoutReleaseStrategy that is a mix between default strategy and TimeoutCountSequenceSizeReleaseStrategy from SI. When a message arrives, it checks the timeout and release it if necesary.
All this worked well unless for the last messages that arrive before the timeout and have holes. This messages aren't release by the strategy. We could use a reaper but the secuence may have more than one hole in the sequence so when the resequencer release them it will stop in the first sequence break and remove the group losing the rest of the messages. So, the question is: is there a way to use the resequencer where there can be holes in the sequence?
One solution we have and want to avoid is having a scheduled tasks that removes the messages directly from the message store, but this could be a problem with concurrency and so on, so we prefer other solutions.
Any help is appreciated here
Regards
Guzman
There are two components involved; the release strategy says "something" can be released; the actual decision as to what is released is performed by the MessageGroupProcessor. In this case, a ResequencingMessageGroupProcessor.
You would need to customize that class to "skip" the hole(s).
You can't wire in a customized MGP using the <reseequencer/> namespace, you would have to wire up using <bean/> s - a ResequencingMessageHandler and a ConsumerEndpointFactoryBean.
Or use a BeanFactoryPostProcessor to change the constructor argument to your custom class.

Windows Azure staging <--> production causing conflicts & errors on table storage

We had a terrible problem/experience yesterday when trying to swap our staging <--> production role.
Here is our setup:
We have a workerrole picking up messages from the queue. These messages are processed on the role. (Table Storage inserts, db selects etc ). This can take maybe 1-3 seconds per queue message depending on how many table storage posts he needs to make. He will delete the message when everything is finished.
Problem when swapping:
When our staging project went online our production workerrole started erroring.
When the role wanted to process queue messsage it gave a constant stream of 'EntityAlreadyExists' errors. Because of these errors queue messages weren't getting deleted. This caused the queue messages to be put back in the queue and back to processing and so on....
When looking inside these queue messages and analysing what would happend with them we saw they were actually processed but not deleted.
The problem wasn't over when deleting these faulty messages. Newly queue messages weren't processed as well while these weren't processed yet and no table storage records were added, which sounds very strange.
When deleting both staging and producting and publishing to production again everything started to work just fine.
Possible problem(s)?
We have litle 2 no idea what happened actually.
Maybe both the roles picked up the same messages and one did the post and one errored?
...???
Possible solution(s)?
We have some idea's on how to solve this 'problem'.
Make a poison message fail over system? When the dequeue count gets over X we should just delete that queue message or place it into a separate 'poisonqueue'.
Catch the EntityAlreadyExists error and just delete that queue message or put it in a separate queue.
...????
Multiple roles
I suppose we will have the same problem when putting up multiple roles?
Many thanks.
EDIT 24/02/2012 - Extra information
We actually use the GetMessage()
Every item in the queue is unique and will generate unique messages in table Storage. Little more information about the process: A user posts something and will have to be distributed to certain other users. The message generate from that user will have a unique Id (guid). This message will be posted into the queue and picked up by the worker role. The message is distributed over several other tables (partitionkey -> UserId, rowkey -> Some timestamp in ticks & the unique message id. So there is almost no chance the same messages will be posted in a normal situation.
The invisibility time out COULD be a logical explanation because some messages could be distributed to like 10-20 tables. This means 10-20 insert without the batch option. Can you set or expand this invisibility time out?
Not deleting the queue message because of an exception COULD be a explanation as well because we didn't implement any poison message fail over YET ;).
Regardless of the Staging vs. Production issue, having a mechanism that handles poison messages is critical. We've implemented an abstraction layer over Azure queues that automatically moves messages over to a poison queue once they've been attempted to be processed some configurable amount of times.
You clearly have a fault on handling double messages. The fact that your ID is unique doesn't mean that the message will not be processed twice in some occasions like:
The role dying and with partially finished work, so the message will re-appear for processing in the queue
The role crashing unexpected, so the message ends up back in the queue
The FC migrating moving your role and you don't have code to handle this situation, so the message ends up back in the queue
In all cases, you need code that handles the fact that the message will re-appear. One way is to use the DequeueCount property and check how many times the message was removed from a Queue and received for processing. Make sure you have code that handles partial processing of a message.
Now what probably happened during swapping was, when the production environment became the staging and staging became production, both of them were trying to receive the same messages so they were basically competing each other fro those messages, which is probably not bad because this is a known pattern to work anyway but when you killed your old production (staging) every message that was received for processing and wasn't finished, ended up back in the Queue and your new production environment picked the message for processing again. Having no code logic to handle this scenario and a message was that partially processed, some records in the tables existed and it started causing the behavior you noticed.
There are a few possible causes:
How are you reading the queue messages? If you are doing a Peek Message then the message will still be visible to be picked up by another role instance (or your staging environment) before the message is deleted. You want to make sure you are using Get Message so the message is invisible until it can be deleted.
Is it possible that your first role crashed after doing the work for the message but prior to deleting the message? This would cause the message to become visible again and get picked up by another role instance. At that point the message will be a poison message which will cause your instances to constantly crash.
This problem almost certainly has nothing to do with Staging vs Production, but is most likely caused by having multiple instances reading from the same queue. You can probably reproduce the same problem by specifying 2 instances, or by deploying the same code to 2 different production services, or by running the code locally on your dev machine (still pointing to Azure storage) using 2 instances.
In general you do need to handle poison messages so you need to implement that logic anyways, but I would suggest getting to the root cause of this problem first, otherwise you are just going to run into a lot more problems later on.
With queues you need to code with idempotency in mind and expect and handle the ‘EntityAlreadyExists’ as a viable response.
As others have suggested, causes could be
Multiple message in the queue with the same identifier.
Are peeking for the message and not reading it form the queue and so not making them invisible.
Not deleting the message because an exception was thrown before you can delete them.
Taking too long to process the message so it cannot be deleted (because invisibility was timed out) and appears again
Without looking at the code I am guessing that it is either the 3 or 4 option that is occurring.
If you cannot detect the issue with a code review, you may consider adding time based logging and try/catch wrappers to get a better understanding.
Using queues effectively, in a multi-role environment, requires a slightly different mindset and running into such issues early is actually a blessing in disguise.
Appended 2/24
Just to clarify, modifying the invisibility time out is not a generic solution to this type of problem. Also, note that this feature although available on the REST API, may not be available on the queue client.
Other options involve writing to table storage in an asynchronous manner to speed up your processing time, but again this is a stop gap measures which does not really address the underlying paradigm of working with queues.
So, the bottom line is to be idempotent. You can try using the table storage upsert (update or insert) feature to avoid getting the ‘EntitiyAlreadyExists’ error, if that works for your code. If all you are doing is inserting new entities to azure table storage then the upsert should solve your problem with minimal code change.
If you are doing updates then it is a different ball game all together. One pattern is to pair updates with dummy inserts in the same table with the same partition key so as to error out if the update occurred previously and so skip the update. Later after the message is deleted, you can delete the dummy inserts. However, all this adds to the complexity, so it is much better to revisit the architecture of the product; for example, do you really need to insert/update into so many tables?
Without knowing what your worker role is actually doing I'm taking a guess here, but it sounds like when you have two instances of your worker role running you are getting conflicts while trying to write to an Azure table. It is likely to be because you have code that looks something like this:
var queueMessage = GetNextMessageFromQueue();
Foo myFoo = GetFooFromTableStorage(queueMessage.FooId);
if (myFoo == null)
{
myFoo = new Foo {
PartitionKey = queueMessage.FooId
};
AddFooToTableStorage(myFoo);
}
DeleteMessageFromQueue(queueMessage);
If you have two adjacent messages in the queue with the same FooId it is quite likely that you'll end up with both of the instances checking to see if the Foo exists, not finding it then trying to create it. Whichever instance is the last to try and save the item will get the "Entity already exists" error. Because it errored it never gets to the delete message part of the code and therefore it becomes visible back on the queue after a period of time.
As others have said, dealing with poison messages is a really good idea.
Update 27/02
If it's not subsequent messages (which based on your partition/row key scheme I would say it's unlikely), then my next bet would be it's the same message appearing back in the queue after the visibility timeout. By default if you're using .GetMessage() the timeout is 30 seconds. It has an overload which allows you to specify how long that time frame is. There is also the .UpdateMessage() function that allows you to update that timeout as you're processing the message. For example you could set the initial visibility to 1 minute, then if you're still processing the message 50 seconds later, extent it for another minute.

Patterns to azure idempotent operations?

anybody know patterns to design idempotent operations to azure manipulation, specially the table storage? The more common approach is generate a id operation and cache it to verify new executions, but, if I have dozen of workers processing operations this approach will be more complicated. :-))
Thank's
Ok, so you haven't provided an example, as requested by knightpfhor and codingoutloud. That said, here's one very common way to deal with idempotent operations: Push your needed actions to a Windows Azure queue. Then, regardless of the number of worker role instances you have, only one instance may work on a specific queue item at a time. When a queue message is read from the queue, it becomes invisible for the amount of time you specify.
Now: a few things can happen during processing of that message:
You complete processing after your timeout period. When you go to delete the message, you get an exception.
You realize you're running out of time, so you increase the queue message timeout (today, you must call the REST API to do this; one day it'll be included in the SDK).
Something goes wrong, causing an exception in your code before you ever get to delete the message. Eventually, the message becomes visible in the queue again (after specified invisibility timeout period).
You complete processing before the timeout and successfully delete the message.
That deals with concurrency. For idempotency, that's up to you to ensure you can repeat an operation without side-effects. For example, you calculate someone's weekly pay, queue up a print job, and store the weekly pay in a Table row. For some reason, a failure occurs and you either don't ever delete the message or your code aborts before getting an opportunity to delete the message.
Fast-forward in time, and another worker instance (or maybe even the same one) re-reads this message. At this point, you should theoretically be able to simply re-perform the needed actions. If this isn't really possible in your case, you don't have an idempotent operation. However, there are a few mechanisms at your disposal to help you work around this:
Each queue message has a DequeueCount. You can use this to determine if the queue message has been processed before and, if so, take appropriate action (maybe examine the Table row for that employee, for example).
Maybe there are stages of your processing pipeline that can't be repeated. In that case: you now have the ability to modify the queue message contents while the queue message is still invisible to others and being processed by you. So, imagine appending something like |SalaryServiceCalled . Then a bit later, appending |PrintJobQueued and so on. Now, if you have a failure in your pipeline, you can figure out where you left off, the next time you read your message.
Hope that helps. Kinda shooting in the dark here, not knowing more about what you're trying to achieve.
EDIT: I guess I should mention that I don't see the connection between idempotency and Table Storage. I think that's more of a concurrency issue, as idempotency would need to be dealt with whether using Table Storage, SQL Azure, or any other storage container.
I believe you can use Reply log storage way to solve this problem

Resources