I am trying to setup a simple application using spring integration. The goal is to simply use a file inbound channel adapter to monitor a directory for new files and process files as they are added. For simplicity the processing the files at the moment is simply logging some output (name of file being processed). I do however want to process files in a multithreaded fashion. So lets say 10 files are picked up and should be processed in parallel and once these are completed only then we move on to the next 10 files.
For that I tried two different approaches and both seem to work similarly and I wanted to understand the differences between using poller or dispatcher for something like this.
Approach #1 - Using poller
<int-file:inbound-channel-adapter id="filesIn" directory="in">
<int:poller fixed-rate="1" task-executor="executor" />
</int-file:inbound-channel-adapter>
<int:service-activator ref="moveToStage" method="move" input-channel="filesIn" />
<task:executor id="executor" pool-size="5" queue-capacity="0" rejection-policy="DISCARD" />
So here the idea as I understand is that we are constantly polling the directory and as soon as a file is received its sent to filesIn channel until the pool limit is reached. Then until the pool is occupied no additional files are sent even though im assuming the polling still continues in the background. This seems to work but I am not sure if using the max messages per poll can be helpful here to decrease the polling frequency. By setting the max messages per poll close to pool size.
Approach #2 - Using dispatcher
<int-file:inbound-channel-adapter id="filesIn" directory="in">
<int:poller fixed-rate="5000" max-messages-per-poll="3" />
</int-file:inbound-channel-adapter>
<int:bridge input-channel="filesIn" output-channel="filesReady" />
<int:channel id="filesReady">
<int:dispatcher task-executor="executor"/>
</int:channel>
<int:service-activator ref="moveToStage" method="move" input-channel="filesInReady" />
<task:executor id="executor" pool-size="5" queue-capacity="0" rejection-policy="CALLER_RUNS" />
okay so here the poller is not using the executor so I am assuming its polling in a sequential fashion. Every poll 3 files should be picked up and then sent to filesReady channel which then uses the dispatcher to pass the files on to the service activator and because it uses the executor for dispatcher it immediately returns control and allows the filesIn channel to send more files.
I guess my question is am I understanding both approaches correctly and if one is better than other.
Thanks
Yes, your understanding is correct.
Generally, I would say that polling every millisecond (and discarding the poll when the queue is full) is a waste of resources (CPU and I/O).
Also, increasing the max messages per poll in the first case won't help because the poll is done on the executor thread (the scheduler hands off the poll to the executor and that thread will handle the mmpp).
In the second case, since the scheduler thread hands off during the poll (rather than before it), the mmpp will work as expected.
So, in general, your second implementation is best (as long as you can live with an average 2.5 second delay when a new file(s) arrives).
Related
At some point one someQueue's size is starting to grow up. The Messages are enqueued, but they are not deqeued. Such queue is consumed with <from uri="activemq:queue:someQueue?concurrentConsumers=5"/> and it seems that parallel Processors are working fine, because data seems to be processed - further routes are triggered.
I suppose that one of the Processor which works concurently somohow stuck. How to check what is the reason? How to check if it really stuck, without adding additional watch dogs/timer threads in Processor? How to check the processor working time? Is there way to find which message caused it?
After some time when queue is to big the route stop processing data, and no other paralell Processors are run.
What I used till now is just keeping the thread code safe, and displaying status with activemq:dstat. I also think about attaching the JPDA to the Karaf to see what is happaning insisde, but maybe there are other, simplier methods to find what is the problem.
<route id="someRoute" startupOrder="7">
<from uri="activemq:queue:someQueue?concurrentConsumers=5"/>
<process ref="someProcesor"/>
<choice>
<when>
<simple>${header.ProcesedSuccesfull}</simple>
<to uri="activemq:queue:otherQueue"/>
</when>
<otherwise>
<log loggingLevel="ERROR" message="error" loggerRef="myLogger"/>
</otherwise>
</choice>
</route>
You can see the inflight repository, and as well the await manager mbean [1]. The latter can tell you how many threads are blocked waiting for some external condition to trigger before continuing.
You can use hawtio [2] web console to see that from a web browser, and then trigger a stuck thread to unblock and continue and reject/fail the messaging, but that would leave the thread to continue running in your use-case so it can pickup new messages.
https://github.com/apache/camel/blob/master/camel-core/src/main/java/org/apache/camel/api/management/mbean/ManagedAsyncProcessorAwaitManagerMBean.java
http://hawt.io/
in our application,consumer started polling continuously at the load of the application and therefore sometimes it impact the execution time for one of the
method by polling in between the method execution.
Method (let say test())which ideally take few millisecond to run in junit case is now taking few seconds for the execution in app.Therfore, would like to skip the polling
at this point of time,if possible.
In spring integration doc have seen something called PollSkipAdvice/PollSkipStrategy which says The PollSkipAdvice can be used to suppress (skip) a poll.
Could you please suggest,if this can be of any help in above scenario.Would be great, if explain using example.Thanks.
sample config:
<int-kafka:inbound-channel-adapter
id="kafkaInboundChannelAdapter" kafka-consumer-context-ref="consumerContext"
auto-startup="false" channel="inputFromKafka">
<int:poller fixed-delay="10" time-unit="MILLISECONDS"
max-messages-per-poll="5" />
</int-kafka:inbound-channel-adapter>
You scenario isn't clear. Really...
We have here only one adapter with aggressive fixed-delay every 10 MILLISECONDS and only for the small amount of messages.
Consider to increase the poll time and make max-messages-per-poll as -1 to poll all of them for the one poll task.
From other side it isn't clear how that your test() method is involved...
Also consider to switch to <int-kafka:message-driven-channel-adapter> for better control over messages.
Regarding PollSkipAdvice... Really not sure which aim would you like to reach with it...
And one more point. Bear in mind all <poller>s use the same ThreadPoolTaskScheduler with the 10 as pool. So, maybe some other long-living task keeps threads from it busy...
This your <int-kafka:inbound-channel-adapter> takes only one, but each 10 millis, of course.
I am receiving data from multiple sources in different threads. Planning to pass the data to a single channel and then a separate thread will process data from this channel. My context is as follows
<task:executor id="singleThreadedExecutor" pool-size="1" />
<int:channel id="entryChannel">
<int:dispatcher task-executor="singleThreadedExecutor"/>
</int:channel>
<int:header-enricher input-channel="entryChannel" output-channel="processDataChannel">
<int:error-channel ref="exceptionHandlerChannel" overwrite="true" />
<int:header name="systemtime" expression="T(java.lang.System).currentTimeMillis()" />
<int:header name="nanotime" expression="T(java.lang.System).nanoTime()" />
</int:header-enricher>
I wanted to process data as soon as it arrives. I have concerns when data arrives much faster then data processing time, in the separate thread.
From the documentation, calling Send on entryChannel should return immediately.
Does dispatcher has internal queuing mechanism to ensure data will be handed over to the channel? How can we ensure that data will be handed over as soon as it arrives?
Interested to know about the best practice in cases where we need to process data in a separate thread, as soon as it arrives, in SI?
First of all your issue is here:
<task:executor id="singleThreadedExecutor" pool-size="1" />
Independently of senders only one Thread is able to get messages from the entryChannel. And it does that only when it is free and ready to do what you ask from it. But in your case it is busy to process the first message, then the second and so on. One by one and only one at a time, because it is a single thread.
You just need to increase pool size (e.g. 10) to allow to distribute messages from that channel for several parallel threads.
Regarding the second question: Back Pressure.
The task executor itself holds the tasks (that invoke the subscribed endpoint) in a queue so, yes, the caller exits immediately (as long as there's room in the queue, which is always true with your configuration).
I came across the following in the spring integration reference doc
The receiveTimeout property specifies the amount of time the poller should wait if no messages are available when it invokes the receive operation. For example, consider two options that seem similar on the surface but are actually quite different: the first has an interval trigger of 5 seconds and a receive timeout of 50 milliseconds while the second has an interval trigger of 50 milliseconds and a receive timeout of 5 seconds. The first one may receive a message up to 4950 milliseconds later than it arrived on the channel (if that message arrived immediately after one of its poll calls returned). On the other hand, the second configuration will never miss a message by more than 50 milliseconds. The difference is that the second option requires a thread to wait, but as a result it is able to respond much more quickly to arriving messages. This technique, known as long polling, can be used to emulate event-driven behavior on a polled source.
Based on my experience, the second option can cause a problem because an interval of 50 milliseconds will make the poller run every 50 millies, but if there's no messages to pick up each threads created will wait for 5 seconds for a message to appear. In that 5 seconds the poller will get executed another 100 times, potentially creating another 100 threads, etc.
This quickly runs away.
My question is did I misunderstand the way this all works? Because if I'm correct I think the reference documentation should be changed, or at least a warning added.
e<bean id="store" class="org.springframework.integration.jdbc.store.JdbcChannelMessageStore">
<property name="dataSource" ref="channelServerDataSource"/>
<property name="channelMessageStoreQueryProvider" ref="queryProvider"/>
<property name="region" value="${user.name}_${channel.queue.region:default}"/>
<property name="usingIdCache" value="false"/>
</bean>
<int:transaction-synchronization-factory id="syncFactory">
<int:after-commit expression="#store.removeFromIdCache(headers.id.toString())" />
<int:after-rollback expression="#store.removeFromIdCache(headers.id.toString())"/>
</int:transaction-synchronization-factory>
<int:channel id="transacitonAsyncServiceQueue">
<int:queue message-store="store"/>
<!-- <int:queue/> -->
</int:channel>
<bean id="rxPollingTrigger" class="org.springframework.scheduling.support.PeriodicTrigger">
<constructor-arg value="500"/>
<constructor-arg value="MILLISECONDS"/>
<property name = "initialDelay" value = "30000"/>
<!-- initialDelay important to ensure channel doesnt start processing before the datasources have been initialised becuase we
now persist transactions in the queue, at startup (restart) there maybe some ready to go which get processed before the
connection pools have been created which happens when the servlet is first hit -->
</bean>
<int:service-activator ref="asyncChannelReceiver" method="processMessage" input-channel="transacitonAsyncServiceQueue">
<int:poller trigger="rxPollingTrigger" max-messages-per-poll="20" task-executor="taskExecutor" receive-timeout="400">
<int:transactional transaction-manager="transactionManagerAsyncChannel" />
</int:poller>
<int:request-handler-advice-chain>
<ref bean="databaseSessionContext" />
</int:request-handler-advice-chain>
</int:service-activator>
<task:executor id="taskExecutor" pool-size="100-200" queue-capacity="200" keep-alive="1" rejection-policy="CALLER_RUNS" />
My question is did I misunderstand the way this all works?
Yes, you misunderstand.
The trigger (in this case a PeriodicTrigger with an interval of 50ms) is only consulted to calculate the next poll time when the current poll exits.
There is only one poller thread running concurrently. If there is no message, the poll thread is suspended for 5 seconds; the trigger is then consulted (t.nextExecutionTime()) and the next poll is scheduled for +50ms; so, with no data a single thread will run every 5.05 seconds.
When messages are present, and you wish to have a concurrency greater than one, you would use a task executor to allow the poller thread to hand off to another thread so that the trigger is immediately consulted for the next poll time.
Based on my experience
Please clarify "your experience" and show configuration, evidence etc.
If you have a suspected thread leak, the first step, generally, is to take a thread dump to figure out what they are all doing.
EDIT: (in response to your comments below).
There's not really a downside of CALLER_RUNS in this scenario because, although the current thread "jumps ahead" of the queued tasks, it's not like this poll has newer data than the queued tasks, it's just a poll after all. However, poller threads are a limited resource (although the limit can be changed) so long-running tasks on a poller thread are generally discouraged.
ABORT could cause some noise in the logs; an alternative is to configure a PollSkipAdvice where the advice can look at the task queue and silently ignore the current poll. In 4.2, we've added even more flexibility to the poller.
You will find many articles on the internet that say that using RDBMS as a queue is not the greatest idea; you might want to consider a JMS or rabbitmq-backed channel instead. If you are tied to JDBC, you should be sure to use the JdbcChannelMessageStore and not the JdbcMessageStore. The former is preferred for backing channels since it only uses 1 table; the latter has some performance issues when used to back a channel, because of contention on the message group table. See Backing Message Channels in the JDBC Support chapter for more information.
I'm struggling with how to affect (upward) the number of threads that a VM transport in Mule uses. I've read the Tuning & Performance page in Mule's documentation, but something isn't clicking with me.
I have one flow that executes a JDBC call and then shoves the result set into VM queues using a foreach.
<foreach>
<vm:outbound-endpoint exchange-pattern="one-way" path="checkService"/>
</foreach>
This is picked by another flow for processing, which consists of making an HTTPS call out and checking the return value.
<flow name="ExecuteTests" doc:name="ExecuteTests">
<vm:inbound-endpoint exchange-pattern="one-way" path="checkService"/>
<https:outbound-endpoint exchange-pattern="request-response"...
...etc.
</flow>
Some of those calls are quick, but some take up to 5 seconds. What I'd like is for the ExecuteTests flow to use more threads to do the processing, but I only ever see threads 02-05 in the logs. I would have expected to see the number of threads used closer to the dispatcher thread pool for the outbound HTTPS connector...which I thought defaulted to 16.
I tried the following:
<vm:connector name="vmConnector">
<receiver-threading-profile maxThreadsActive="100" maxThreadsIdle="100"/>
</vm:connector>
<flow name="ExecuteTests" doc:name="ExecuteTests">
<vm:inbound-endpoint exchange-pattern="one-way" path="checkService" connector-ref="vmConnector"/>
...etc.
but it made no difference.
Thinking that maybe the buffer for the inbound endpoint was messing with things, I tried:
<vm:connector name="vmConnector">
<receiver-threading-profile maxThreadsActive="100" maxThreadsIdle="100"/>
<vm:queue-profile maxOutstandingMessages="1"/>
</vm:connector>
but it didn't help either.
What am I missing?
From your flow it can be observed that the issue is not with the threads.(I believe).
Normally the VM receiver threads are fast and they scale up based on the number of requests coming in.
In the "ExecuteTests" flow there seems to be an outbound HTTP call. So it is the HTTP call that might be the cause for the delay.
But still if you want to increase the threads try adding a dispatcher threading profile for a HTTP connector.
Also inorder to increase the number of threads for your flow processing use Flow threading profiles.
Like
<asynchronous-processing-strategy name="executeTestsStrategy" maxThreads="40" minThreads="20" />
<flow name="ExecuteTests" doc:name="ExecuteTests" processingStrategy="executeTestsStrategy">
<vm:inbound-endpoint exchange-pattern="one-way" path="checkService"/>
<https:outbound-endpoint exchange-pattern="request-response"...
...etc.
</flow>
Hope this helps.