Run Spring Integration flow concurrently for each Ftp file - spring-integration

I have a Integration flow configured using Java DSL which pulls file from Ftp server using Ftp.inboundChannelAdapter then transforms it to JobRequest, then I have a .handle() method which triggers my batch job, everything is working as per required but the process in running sequentially for each file inside the FTP folder
I added currentThreadName in my Transformer Endpoint it was printing same thread name for each file
Here is what I have tried till now
1.task executor bean
#Bean
public TaskExecutor taskExecutor(){
return new SimpleAsyncTaskExecutor("Integration");
}
2.Integration flow
#Bean
public IntegrationFlow integrationFlow(JobLaunchingGateway jobLaunchingGateway) throws IOException {
return IntegrationFlows.from(Ftp.inboundAdapter(myFtpSessionFactory)
.remoteDirectory("/bar")
.localDirectory(localDir.getFile())
,c -> c.poller(Pollers.fixedRate(1000).taskExecutor(taskExecutor()).maxMessagesPerPoll(20)))
.transform(fileMessageToJobRequest(importUserJob(step1())))
.handle(jobLaunchingGateway)
.log(LoggingHandler.Level.WARN, "headers.id + ': ' + payload")
.route(JobExecution.class,j->j.getStatus().isUnsuccessful()?"jobFailedChannel":"jobSuccessfulChannel")
.get();
}
3.I also read in another SO thread that I need ExecutorChannel so I configured one but I don't know how to inject this channel into my Ftp.inboundAdapter, from logs is see that the channel is always integrationFlow.channel#0 which I guess is a DirectChannel
#Bean
public MessageChannel inputChannel() {
return new ExecutorChannel(taskExecutor());
}
I dont know what I'm missing here, or I might have not properly understood Spring Messaging System as I'm very much new to Spring and Spring-Integration
Any help is appreciated
Thanks

The ExecutorChannel you can simply inject into the flow and it is going to be applied to the SourcePollingChannelAdapter by the framework. So, having that inputChannel defined as a bean you just do this:
.channel(inputChannel())
before your .transform(fileMessageToJobRequest(importUserJob(step1()))).
See more in docs: https://docs.spring.io/spring-integration/docs/current/reference/html/dsl.html#java-dsl-channels
On the other hand to process your files in parallel according your .taskExecutor(taskExecutor()) configuration, you just need to have a .maxMessagesPerPoll(20) as 1. The logic in the AbstractPollingEndpoint is like this:
this.taskExecutor.execute(() -> {
int count = 0;
while (this.initialized && (this.maxMessagesPerPoll <= 0 || count < this.maxMessagesPerPoll)) {
if (pollForMessage() == null) {
break;
}
count++;
}
So, we do have tasks in parallel, but only when they reach that maxMessagesPerPoll where it is 20 in your current case. There is also some explanation in the docs: https://docs.spring.io/spring-integration/docs/current/reference/html/messaging-endpoints.html#endpoint-pollingconsumer
The maxMessagesPerPoll property specifies the maximum number of messages to receive within a given poll operation. This means that the poller continues calling receive() without waiting, until either null is returned or the maximum value is reached. For example, if a poller has a ten-second interval trigger and a maxMessagesPerPoll setting of 25, and it is polling a channel that has 100 messages in its queue, all 100 messages can be retrieved within 40 seconds. It grabs 25, waits ten seconds, grabs the next 25, and so on.

Related

How to handle errors after message has been handed off to QueueChannel?

I have 10 rabbitMQ queues, called event.q.0, event.q.2, <...>, event.q.9. Each of these queues receive messages routed from event.consistent-hash exchange. I want to build a fault tolerant solution that will consume messages for a specific event in sequential manner, since ordering is important. For this I have set up a flow that listens to those queues and routes messages based on event ID to a specific worker flow. Worker flows work based on queue channels so that should guarantee the FIFO order for an event with specific ID. I have come up with with the following set up:
#Bean
public IntegrationFlow eventConsumerFlow(RabbitTemplate rabbitTemplate, Advice retryAdvice) {
return IntegrationFlows
.from(
Amqp.inboundAdapter(new SimpleMessageListenerContainer(rabbitTemplate.getConnectionFactory()))
.configureContainer(c -> c
.adviceChain(retryAdvice())
.addQueueNames(queueNames)
.prefetchCount(amqpProperties.getPreMatch().getDefinition().getQueues().getEvent().getPrefetch())
)
.messageConverter(rabbitTemplate.getMessageConverter())
)
.<Event, String>route(e -> String.format("worker-input-%d", e.getId() % numberOfWorkers))
.get();
}
private Advice deadLetterAdvice() {
return RetryInterceptorBuilder
.stateless()
.maxAttempts(3)
.recoverer(recoverer())
.backOffPolicy(backOffPolicy())
.build();
}
private ExponentialBackOffPolicy backOffPolicy() {
ExponentialBackOffPolicy backOffPolicy = new ExponentialBackOffPolicy();
backOffPolicy.setInitialInterval(1000);
backOffPolicy.setMultiplier(3.0);
backOffPolicy.setMaxInterval(15000);
return backOffPolicy;
}
private MessageRecoverer recoverer() {
return new RepublishMessageRecoverer(
rabbitTemplate,
"error.exchange.dlx"
);
}
#PostConstruct
public void init() {
for (int i = 0; i < numberOfWorkers; i++) {
flowContext.registration(workerFlow(MessageChannels.queue(String.format("worker-input-%d", i), queueCapacity).get()))
.autoStartup(false)
.id(String.format("worker-flow-%d", i))
.register();
}
}
private IntegrationFlow workerFlow(QueueChannel channel) {
return IntegrationFlows
.from(channel)
.<Object, Class<?>>route(Object::getClass, m -> m
.resolutionRequired(true)
.defaultOutputToParentFlow()
.subFlowMapping(EventOne.class, s -> s.handle(oneHandler))
.subFlowMapping(EventTwo.class, s -> s.handle(anotherHandler))
)
.get();
}
Now, when lets say an error happens in eventConsumerFlow, the retry mechanism works as expected, but when an error happens in workerFlow, the retry doesn't work anymore and the message doesn't get sent to dead letter exchange. I assume this is because once message is handed off to QueueChannel, it gets acknowledged automatically. How can I make the retry mechanism work in workerFlow as well, so that if exception happens there, it could retry a couple of times and send a message to DLX when tries are exhausted?
If you want resiliency, you shouldn't be using queue channels at all; the messages will be acknowledged immediately after the message is put in the in-memory queue;if the server crashes, those messages will be lost.
You should configure a separate adapter for each queue if you want no message loss.
That said, to answer the general question, any errors on downstream flows (including after a queue channel) will be sent to the errorChannel defined on the inbound adapter.

ActiveMQ: Dispatched queue contains more messages then prefetch size

I have prefetch size set to 1 (jms.prefetchPolicy.all=1 in url). In web console I can see that prefetch is 1 for all of my consumers. One consumer got stuck and there were 67 messages on his dispatch queue -see my screenshot
Could you help me understand how could it happen? I've read plenty of articles on this and my understanding is that Dispatch queue size should be up to prefetch size?!
I use following configuration to consume messages from queue:
ConnectionFactory getActiveMQConnectionFactory() {
// Configure the ActiveMQConnectionFactory
ActiveMQConnectionFactory activeMQConnectionFactory = new ActiveMQConnectionFactory();
activeMQConnectionFactory.setBrokerURL(brokerUrl);
activeMQConnectionFactory.setUserName(user);
activeMQConnectionFactory.setPassword(password);
activeMQConnectionFactory.setNonBlockingRedelivery(true);
// Configure the redeliver policy and the dead letter queue
RedeliveryPolicy redeliveryPolicy = new RedeliveryPolicy();
redeliveryPolicy.setInitialRedeliveryDelay(initialRedeliveryDelay);
redeliveryPolicy.setRedeliveryDelay(redeliveryDelay);
redeliveryPolicy.setUseExponentialBackOff(useExponentialBackOff);
redeliveryPolicy.setMaximumRedeliveries(maximumRedeliveries);
RedeliveryPolicyMap redeliveryPolicyMap = activeMQConnectionFactory.getRedeliveryPolicyMap();
redeliveryPolicyMap.put(new ActiveMQQueue(thumbnailQueue), redeliveryPolicy);
activeMQConnectionFactory.setRedeliveryPolicy(redeliveryPolicy);
return activeMQConnectionFactory;
}
public IntegrationFlow createThumbnailFlow(String concurrency, CreateThumbnailReceiver receiver) {
return IntegrationFlows.from(
Jms.messageDrivenChannelAdapter(
Jms.container(getActiveMQConnectionFactory(), thumbnailQueue)
.concurrency(concurrency)
.sessionTransacted(true)
.get()
))
.transform(new JsonToObjectTransformer(CreateThumbnailRequest.class, jsonObjectMapper()))
.handle(receiver)
.get();
}
The problem was cause by difference between version of broker (5.14.5) and client (5.15.3). After upgrading broker dispatched queue contains at most 2 message as expected.

Spring Integration: File polling memory consumption

I've following InboundChannelAdapter with Poller to process files every 30 seconds. The files are not large but I realize the memory consumptions keeps going up even when there's no files coming.
#Bean
#InboundChannelAdapter(value = "flowFileInChannel" ,poller = #Poller(fixedDelay ="30000", maxMessagesPerPoll = "1"))
public MessageSource<File> flowInboundFileAdapter(#Value("${integration.path}") File directory) {
FileReadingMessageSource source = new FileReadingMessageSource();
source.setDirectory(directory);
source.setFilter(flowPathFileFilter);
source.setUseWatchService(true);
source.setScanEachPoll(true);
source.setAutoCreateDirectory(false);
return source;
}
Is there an internal queue that is not cleared after each poll? How do I configure to avoid eating up memory.
After digging deeper, it looks like the below Spring IntegrationFlows which processes the data from the InboundChannelDapter is holding up the memory after each file polling. After I commenting out the middle part, the memory consumption seems stable (instead of increasing consumption). Now I'm wondering how do we force Spring IntegrationFlows to clear those Messages and Headers after they're passed through different channels (i.e. after the last channel below)
public IntegrationFlow incomingLocateFlow(){
return IntegrationFlows.from(locateIncomingChannel())
// .split("locateItemSplitter","split")
// .transform(locateItemEnrichmentTransformer)
// .transform(locateRequestTransformer)
// .aggregate(new Consumer<AggregatorSpec>() { // 32
//
// #Override
// public void accept(AggregatorSpec aggregatorSpec) {
// aggregatorSpec.processor(locateRequestProcessor, null); // 33
// }
//
// }, null)
// .transform(locateIncomingResultTransformer)
// .transform(locateExceptionReportWritingHandler)
.channel(locateIncomingCompleteChannel())
.get();
}
Indeed there is an AcceptOnceFileListFilter with the code like:
private final Queue<F> seen;
private final Set<F> seenSet = new HashSet<F>();
On each poll those internal collections are replenished with new files.
For this purpose you can consider to use FileSystemPersistentAcceptOnceFileListFilter with the persistent MetadataStore implementation to avoid memory consumption.
Also consider to use some tool to analyze the memory content. You might have something else downstream on the flowFileInChannel.
UPDATE
Since you use .aggregate() it is definitely the point where memory is consumed by default. That's because there is SimpleMessageStore to keep messages for grouping. Plus there is an option expireGroupsUponCompletion(boolean) which is false by default. Therefore even after successful releasing some info is still in the MessageStore. That's how your memory is consumed a bit from time to time.
That option is false by default to let to have logic when we discard late message for completed group. When it is true, you are able to form fresh group for the same correlationKey.
See more info about Aggregator in the Reference Manual.

Process websocket incomming messages using multiple threads in tomcat

From what I understand (please correct me if I am wrong), in tomcat incoming websocket messages are processed sequentially. Meaning that if you have 100 incoming messages in one websocket, they will be processed using only one thread one-by-one from message 1 to message 100.
But this does not work for me. I need to concurrently process incoming messages in a websocket in order to increase my websocket throughput. The messages coming in do not depend on each other hence do not need to be processed sequentially.
The question is how to configure tomcat such that it would assign multiple worker threads per websocket to process incoming messages concurrently?
Any hint is appreciated.
This is where in tomcat code that I think it is blocking per websocket connection (which makes sense):
/**
* Called when there is data in the ServletInputStream to process.
*
* #throws IOException if an I/O error occurs while processing the available
* data
*/
public void onDataAvailable() throws IOException {
synchronized (connectionReadLock) {
while (isOpen() && sis.isReady()) {
// Fill up the input buffer with as much data as we can
int read = sis.read(
inputBuffer, writePos, inputBuffer.length - writePos);
if (read == 0) {
return;
}
if (read == -1) {
throw new EOFException();
}
writePos += read;
processInputBuffer();
}
}
}
You can't configure Tomcat to do what you want. You need to write a message handler that consumes the message, passes it to an Executor (or similar for processing) and then returns.

How to parallelize an azure worker role?

I have got a Worker Role running in azure.
This worker processes a queue in which there are a large number of integers. For each integer I have to do processings quite long (from 1 second to 10 minutes according to the integer).
As this is quite time consuming, I would like to do these processings in parallel. Unfortunately, my parallelization seems to not be efficient when I test with a queue of 400 integers.
Here is my implementation :
public class WorkerRole : RoleEntryPoint {
private readonly CancellationTokenSource cancellationTokenSource = new CancellationTokenSource();
private readonly ManualResetEvent runCompleteEvent = new ManualResetEvent(false);
private readonly Manager _manager = Manager.Instance;
private static readonly LogManager logger = LogManager.Instance;
public override void Run() {
logger.Info("Worker is running");
try {
this.RunAsync(this.cancellationTokenSource.Token).Wait();
}
catch (Exception e) {
logger.Error(e, 0, "Error Run Worker: " + e);
}
finally {
this.runCompleteEvent.Set();
}
}
public override bool OnStart() {
bool result = base.OnStart();
logger.Info("Worker has been started");
return result;
}
public override void OnStop() {
logger.Info("Worker is stopping");
this.cancellationTokenSource.Cancel();
this.runCompleteEvent.WaitOne();
base.OnStop();
logger.Info("Worker has stopped");
}
private async Task RunAsync(CancellationToken cancellationToken) {
while (!cancellationToken.IsCancellationRequested) {
try {
_manager.ProcessQueue();
}
catch (Exception e) {
logger.Error(e, 0, "Error RunAsync Worker: " + e);
}
}
await Task.Delay(1000, cancellationToken);
}
}
}
And the implementation of the ProcessQueue:
public void ProcessQueue() {
try {
_queue.FetchAttributes();
int? cachedMessageCount = _queue.ApproximateMessageCount;
if (cachedMessageCount != null && cachedMessageCount > 0) {
var listEntries = new List<CloudQueueMessage>();
listEntries.AddRange(_queue.GetMessages(MAX_ENTRIES));
Parallel.ForEach(listEntries, ProcessEntry);
}
}
catch (Exception e) {
logger.Error(e, 0, "Error ProcessQueue: " + e);
}
}
And ProcessEntry
private void ProcessEntry(CloudQueueMessage entry) {
try {
int id = Convert.ToInt32(entry.AsString);
Service.GetData(id);
_queue.DeleteMessage(entry);
}
catch (Exception e) {
_queueError.AddMessage(entry);
_queue.DeleteMessage(entry);
logger.Error(e, 0, "Error ProcessEntry: " + e);
}
}
In the ProcessQueue function, I try with different values of MAX_ENTRIES: first =20 and then =2.
It seems to be slower with MAX_ENTRIES=20, but whatever the value of MAX_ENTRIES is, it seems quite slow.
My VM is a A2 medium.
I really don't know if I do the parallelization correctly ; maybe the problem comes from the worker itself (which may be it is hard to have this in parallel).
You haven't mentioned which Azure Messaging Queuing technology you are using, however for tasks where I want to process multiple messages in parallel I tend to use the Message Pump Pattern on Service Bus Queues and Subscriptions, leveraging the OnMessage() method available on both Service Bus Queue and Subscription Clients:
QueueClient OnMessage() - https://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.queueclient.onmessage.aspx
SubscriptionClient OnMessage() - https://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.subscriptionclient.onmessage.aspx
An overview of how this stuff works :-) - http://fabriccontroller.net/blog/posts/introducing-the-event-driven-message-programming-model-for-the-windows-azure-service-bus/
From MSDN:
When calling OnMessage(), the client starts an internal message pump
that constantly polls the queue or subscription. This message pump
consists of an infinite loop that issues a Receive() call. If the call
times out, it issues the next Receive() call.
This pattern allows you to use a delegate (or anonymous function in my preferred case) that handles the receipt of the Brokered Message instance on a separate thread on the WaWorkerHost process. In fact, to increase the level of throughput, you can specify the number of threads that the Message Pump should provide, thereby allowing you to receive and process 2, 4, 8 messages from the queue in parallel. You can additionally tell the Message Pump to automagically mark the message as complete when the delegate has successfully finished processing the message. Both the thread count and AutoComplete instructions are passed in the OnMessageOptions parameter on the overloaded method.
public override void Run()
{
var onMessageOptions = new OnMessageOptions()
{
AutoComplete = true, // Message-Pump will call Complete on messages after the callback has completed processing.
MaxConcurrentCalls = 2 // Max number of threads the Message-Pump can spawn to process messages.
};
sbQueueClient.OnMessage((brokeredMessage) =>
{
// Process the Brokered Message Instance here
}, onMessageOptions);
RunAsync(_cancellationTokenSource.Token).Wait();
}
You can still leverage the RunAsync() method to perform additional tasks on the main Worker Role thread if required.
Finally, I would also recommend that you look at scaling your Worker Role instances out to a minimum of 2 (for fault tolerance and redundancy) to increase your overall throughput. From what I have seen with multiple production deployments of this pattern, OnMessage() performs perfectly when multiple Worker Role Instances are running.
A few things to consider here:
Are your individual tasks CPU intensive? If so, parallelism may not help. However, if they are mostly waiting on data processing tasks to be processed by other resources, parallelizing is a good idea.
If parallelizing is a good idea, consider not using Parallel.ForEach for queue processing. Parallel.Foreach has two issues that prevent you from being very optimal:
The code will wait until all kicked off threads finish processing before moving on. So, if you have 5 threads that need 10 seconds each and 1 thread that needs 10 minutes, the overall processing time for Parallel.Foreach will be 10 minutes.
Even though you are assuming that all of the threads will start processing at the same time, Parallel.Foreach does not work this way. It looks at number of cores on your server and other parameters and generally only kicks off number of threads it thinks it can handle, without knowing too much about what's in those threads. So, if you have a lot of non-CPU bound threads that /can/ be kicked off at the same time without causing CPU over-utilization, default behaviour will not likely run them optimally.
How to do this optimally:
I am sure there are a ton of solutions out there, but for reference, the way we've architected it in CloudMonix (that must kick off hundreds of independent threads and complete them as fast as possible) is by using ThreadPool.QueueUserWorkItem and manually keeping track number of threads that are running.
Basically, we use a Thread-safe collection to keep track of running threads that are started by ThreadPool.QueueUserWorkItem. Once threads complete, remove them from that collection. The queue-monitoring loop is indendent of executing logic in that collection. Queue-monitoring logic gets messages from the queue if the processing collection is not full up to the limit that you find most optimal. If there is space in the collection, it tries to pickup more messages from the queue, adds them to the collection and kick-start them via ThreadPool.QueueUserWorkItem. When processing completes, it kicks off a delegate that cleans up thread from the collection.
Hope this helps and makes sense

Resources