We have a cluster of servers that are are monitoring a shared network mount for processing EDI files. We recently added code to use the RedisMetadataStore as follows:
#Bean
public ConcurrentMetadataStore metadataStore() {
return new RedisMetadataStore(redisConnectionFactory);
}
#Bean
public FileSystemPersistentAcceptOnceFileListFilter persistentAcceptOnceFileFilter() {
return new FileSystemPersistentAcceptOnceFileListFilter(metadataStore(), "edi-file-locks");
}
#Bean
public IntegrationFlow flowInboundNetTransferFile(
#Value("${edi.incoming.directory.netTransfers}") String inboundDirectory,
#Value("${edi.incoming.age-before-ready-seconds:30}") int ageBeforeReadySeconds,
#Value("${taskExecutor.inboundFile.corePoolSize:4}") int corePoolSize,
#Qualifier("taskExecutorInboundFile") TaskExecutor taskExecutor) {
//Setup a filter to only pick up a files older than a certain age, relative to the current time. This prevents cases
//where something is writing to the file as the EDI processor is moving that file.
LastModifiedFileListFilter lastModifiedFilter = new LastModifiedFileListFilter();
lastModifiedFilter.setAge(ageBeforeReadySeconds);
return IntegrationFlows
.from(
Files
.inboundAdapter(new File(inboundDirectory))
.locker(ediDocumentLocker())
.filter(new ChainFileListFilter<File>())
.filter(new IgnoreHiddenFileListFilter())
.filter(lastModifiedFilter)
.filter(persistentAcceptOnceFileFilter()),
e -> e.poller(Pollers.fixedDelay(20000).maxMessagesPerPoll(corePoolSize).taskExecutor(taskExecutor)))
.channel(channelInboundFile())
.get();
}
This was working fine in our lower environments, however, we use a Redis cluster in our production environment and when we deployed to that environment we are encountering an exception :
org.springframework.dao.InvalidDataAccessApiUsageException: WATCH is currently not supported in cluster mode.
at org.springframework.data.redis.connection.jedis.JedisClusterConnection.watch(JedisClusterConnection.java:2450)
at org.springframework.data.redis.connection.DefaultStringRedisConnection.watch(DefaultStringRedisConnection.java:951)
at org.springframework.data.redis.core.RedisTemplate$24.doInRedis(RedisTemplate.java:885)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:204)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:166)
at org.springframework.data.redis.core.RedisTemplate.watch(RedisTemplate.java:882)
at org.springframework.data.redis.support.collections.DefaultRedisMap$2.execute(DefaultRedisMap.java:225)
at org.springframework.data.redis.support.collections.DefaultRedisMap$2.execute(DefaultRedisMap.java:221)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:226)
at org.springframework.data.redis.support.collections.DefaultRedisMap.replace(DefaultRedisMap.java:221)
at org.springframework.data.redis.support.collections.RedisProperties.replace(RedisProperties.java:238)
at org.springframework.integration.redis.metadata.RedisMetadataStore.replace(RedisMetadataStore.java:154)
at org.springframework.integration.file.filters.AbstractPersistentAcceptOnceFileListFilter.accept(AbstractPersistentAcceptOnceFileListFilter.java:83)
at org.springframework.integration.file.filters.AbstractFileListFilter.filterFiles(AbstractFileListFilter.java:40)
at org.springframework.integration.file.filters.ChainFileListFilter.filterFiles(ChainFileListFilter.java:50)
at org.springframework.integration.file.DefaultDirectoryScanner.listFiles(DefaultDirectoryScanner.java:95)
at org.springframework.integration.file.FileReadingMessageSource.scanInputDirectory(FileReadingMessageSource.java:387)
at org.springframework.integration.file.FileReadingMessageSource.receive(FileReadingMessageSource.java:366)
at org.springframework.integration.endpoint.SourcePollingChannelAdapter.receiveMessage(SourcePollingChannelAdapter.java:224)
at org.springframework.integration.endpoint.AbstractPollingEndpoint.doPoll(AbstractPollingEndpoint.java:245)
at org.springframework.integration.endpoint.AbstractPollingEndpoint.access$000(AbstractPollingEndpoint.java:58)
at org.springframework.integration.endpoint.AbstractPollingEndpoint$1.call(AbstractPollingEndpoint.java:190)
at org.springframework.integration.endpoint.AbstractPollingEndpoint$1.call(AbstractPollingEndpoint.java:186)
at org.springframework.integration.endpoint.AbstractPollingEndpoint$Poller$1.run(AbstractPollingEndpoint.java:353)
at org.springframework.integration.util.ErrorHandlingTaskExecutor$1.run(ErrorHandlingTaskExecutor.java:55)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
org.springframework.dao.InvalidDataAccessApiUsageException: WATCH is currently not supported in cluster mode.
at org.springframework.data.redis.connection.jedis.JedisClusterConnection.watch(JedisClusterConnection.java:2450)
at org.springframework.data.redis.connection.DefaultStringRedisConnection.watch(DefaultStringRedisConnection.java:951)
at org.springframework.data.redis.core.RedisTemplate$24.doInRedis(RedisTemplate.java:885)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:204)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:166)
at org.springframework.data.redis.core.RedisTemplate.watch(RedisTemplate.java:882)
at org.springframework.data.redis.support.collections.DefaultRedisMap$2.execute(DefaultRedisMap.java:225)
at org.springframework.data.redis.support.collections.DefaultRedisMap$2.execute(DefaultRedisMap.java:221)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:226)
at org.springframework.data.redis.support.collections.DefaultRedisMap.replace(DefaultRedisMap.java:221)
at org.springframework.data.redis.support.collections.RedisProperties.replace(RedisProperties.java:238)
at org.springframework.integration.redis.metadata.RedisMetadataStore.replace(RedisMetadataStore.java:154)
at org.springframework.integration.file.filters.AbstractPersistentAcceptOnceFileListFilter.accept(AbstractPersistentAcceptOnceFileListFilter.java:83)
at org.springframework.integration.file.filters.AbstractFileListFilter.filterFiles(AbstractFileListFilter.java:40)
at org.springframework.integration.file.filters.ChainFileListFilter.filterFiles(ChainFileListFilter.java:50)
at org.springframework.integration.file.DefaultDirectoryScanner.listFiles(DefaultDirectoryScanner.java:95)
at org.springframework.integration.file.FileReadingMessageSource.scanInputDirectory(FileReadingMessageSource.java:387)
at org.springframework.integration.file.FileReadingMessageSource.receive(FileReadingMessageSource.java:366)
at org.springframework.integration.endpoint.SourcePollingChannelAdapter.receiveMessage(SourcePollingChannelAdapter.java:224)
at org.springframework.integration.endpoint.AbstractPollingEndpoint.doPoll(AbstractPollingEndpoint.java:245)
at org.springframework.integration.endpoint.AbstractPollingEndpoint.access$000(AbstractPollingEndpoint.java:58)
at org.springframework.integration.endpoint.AbstractPollingEndpoint$1.call(AbstractPollingEndpoint.java:190)
at org.springframework.integration.endpoint.AbstractPollingEndpoint$1.call(AbstractPollingEndpoint.java:186)
at org.springframework.integration.endpoint.AbstractPollingEndpoint$Poller$1.run(AbstractPollingEndpoint.java:353)
at org.springframework.integration.util.ErrorHandlingTaskExecutor$1.run(ErrorHandlingTaskExecutor.java:55)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Looking a little closer, it looks like this store is using RedisProperties as a backing store, which in turn is using features of the Redis client that are not supported with the Cluster client. Has anyone worked around this issue? Perhaps written an alternate store that does support a Redis cluster?
Related
I have a Integration flow configured using Java DSL which pulls file from Ftp server using Ftp.inboundChannelAdapter then transforms it to JobRequest, then I have a .handle() method which triggers my batch job, everything is working as per required but the process in running sequentially for each file inside the FTP folder
I added currentThreadName in my Transformer Endpoint it was printing same thread name for each file
Here is what I have tried till now
1.task executor bean
#Bean
public TaskExecutor taskExecutor(){
return new SimpleAsyncTaskExecutor("Integration");
}
2.Integration flow
#Bean
public IntegrationFlow integrationFlow(JobLaunchingGateway jobLaunchingGateway) throws IOException {
return IntegrationFlows.from(Ftp.inboundAdapter(myFtpSessionFactory)
.remoteDirectory("/bar")
.localDirectory(localDir.getFile())
,c -> c.poller(Pollers.fixedRate(1000).taskExecutor(taskExecutor()).maxMessagesPerPoll(20)))
.transform(fileMessageToJobRequest(importUserJob(step1())))
.handle(jobLaunchingGateway)
.log(LoggingHandler.Level.WARN, "headers.id + ': ' + payload")
.route(JobExecution.class,j->j.getStatus().isUnsuccessful()?"jobFailedChannel":"jobSuccessfulChannel")
.get();
}
3.I also read in another SO thread that I need ExecutorChannel so I configured one but I don't know how to inject this channel into my Ftp.inboundAdapter, from logs is see that the channel is always integrationFlow.channel#0 which I guess is a DirectChannel
#Bean
public MessageChannel inputChannel() {
return new ExecutorChannel(taskExecutor());
}
I dont know what I'm missing here, or I might have not properly understood Spring Messaging System as I'm very much new to Spring and Spring-Integration
Any help is appreciated
Thanks
The ExecutorChannel you can simply inject into the flow and it is going to be applied to the SourcePollingChannelAdapter by the framework. So, having that inputChannel defined as a bean you just do this:
.channel(inputChannel())
before your .transform(fileMessageToJobRequest(importUserJob(step1()))).
See more in docs: https://docs.spring.io/spring-integration/docs/current/reference/html/dsl.html#java-dsl-channels
On the other hand to process your files in parallel according your .taskExecutor(taskExecutor()) configuration, you just need to have a .maxMessagesPerPoll(20) as 1. The logic in the AbstractPollingEndpoint is like this:
this.taskExecutor.execute(() -> {
int count = 0;
while (this.initialized && (this.maxMessagesPerPoll <= 0 || count < this.maxMessagesPerPoll)) {
if (pollForMessage() == null) {
break;
}
count++;
}
So, we do have tasks in parallel, but only when they reach that maxMessagesPerPoll where it is 20 in your current case. There is also some explanation in the docs: https://docs.spring.io/spring-integration/docs/current/reference/html/messaging-endpoints.html#endpoint-pollingconsumer
The maxMessagesPerPoll property specifies the maximum number of messages to receive within a given poll operation. This means that the poller continues calling receive() without waiting, until either null is returned or the maximum value is reached. For example, if a poller has a ten-second interval trigger and a maxMessagesPerPoll setting of 25, and it is polling a channel that has 100 messages in its queue, all 100 messages can be retrieved within 40 seconds. It grabs 25, waits ten seconds, grabs the next 25, and so on.
I've following InboundChannelAdapter with Poller to process files every 30 seconds. The files are not large but I realize the memory consumptions keeps going up even when there's no files coming.
#Bean
#InboundChannelAdapter(value = "flowFileInChannel" ,poller = #Poller(fixedDelay ="30000", maxMessagesPerPoll = "1"))
public MessageSource<File> flowInboundFileAdapter(#Value("${integration.path}") File directory) {
FileReadingMessageSource source = new FileReadingMessageSource();
source.setDirectory(directory);
source.setFilter(flowPathFileFilter);
source.setUseWatchService(true);
source.setScanEachPoll(true);
source.setAutoCreateDirectory(false);
return source;
}
Is there an internal queue that is not cleared after each poll? How do I configure to avoid eating up memory.
After digging deeper, it looks like the below Spring IntegrationFlows which processes the data from the InboundChannelDapter is holding up the memory after each file polling. After I commenting out the middle part, the memory consumption seems stable (instead of increasing consumption). Now I'm wondering how do we force Spring IntegrationFlows to clear those Messages and Headers after they're passed through different channels (i.e. after the last channel below)
public IntegrationFlow incomingLocateFlow(){
return IntegrationFlows.from(locateIncomingChannel())
// .split("locateItemSplitter","split")
// .transform(locateItemEnrichmentTransformer)
// .transform(locateRequestTransformer)
// .aggregate(new Consumer<AggregatorSpec>() { // 32
//
// #Override
// public void accept(AggregatorSpec aggregatorSpec) {
// aggregatorSpec.processor(locateRequestProcessor, null); // 33
// }
//
// }, null)
// .transform(locateIncomingResultTransformer)
// .transform(locateExceptionReportWritingHandler)
.channel(locateIncomingCompleteChannel())
.get();
}
Indeed there is an AcceptOnceFileListFilter with the code like:
private final Queue<F> seen;
private final Set<F> seenSet = new HashSet<F>();
On each poll those internal collections are replenished with new files.
For this purpose you can consider to use FileSystemPersistentAcceptOnceFileListFilter with the persistent MetadataStore implementation to avoid memory consumption.
Also consider to use some tool to analyze the memory content. You might have something else downstream on the flowFileInChannel.
UPDATE
Since you use .aggregate() it is definitely the point where memory is consumed by default. That's because there is SimpleMessageStore to keep messages for grouping. Plus there is an option expireGroupsUponCompletion(boolean) which is false by default. Therefore even after successful releasing some info is still in the MessageStore. That's how your memory is consumed a bit from time to time.
That option is false by default to let to have logic when we discard late message for completed group. When it is true, you are able to form fresh group for the same correlationKey.
See more info about Aggregator in the Reference Manual.
We're calling SparkSQL job from Spark streaming. We're getting concurrent exception and Kafka consumer is closed error. Here is code and exception details:
Kafka consumer code
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe(SparkServiceConfParams.AIR.CONSUME_TOPICS,
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {}
..
// publish generated json gzip to kafka
decodedStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
AIRDataSetBean processAIRData = airMainJsonProcessor.processAIRData(json, sparkSession);
Error Details
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
Finally Kafka consumer closed:
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException:
This consumer has already been closed.
This issue is resolved using the Cache or Persist option of Spark streaming. In this scenario using cache RDD is not read from Kafka again and issue is resolved. It enables the concurrent usage of stream . But please use wisely cache option.Here is code:
JavaDStream<ConsumerRecord<String, byte[]>> cache = consumerStream.cache();
In porting our code base from gridgain to ignite, I've found similar / renamed methods for most of the ignite methods. There are a few that I need to clarify though.
What is the ignite equivalent for
//Listener for asynchronous local node grid events. You can subscribe for local node grid event notifications via {#link GridEventStorageManager#addLocalEventListener
public interface GridLocalEventListener extends EventListener {}
What is the recommended way to invoke a compute future. See picture for compile failures.
Apart from that, it looks like future.listenAsync() should be future.listen()
final ProcessingTaskAdapter taskAdapter = new ProcessingTaskAdapter(task, manager, node);
ComputeTaskFuture<ProcessingJob> future = grid.cluster()
.forPredicate(this) //===> what should this be
.compute().execute(taskAdapter, job);
future.listen(new IgniteInClosure<IgniteFuture<ProcessingJob>>() {
#Override
public void apply(IgniteFuture<ProcessingJob> future) {
try {
// Need this to extract the remote exception, if one occurred
future.get();
} catch (IgniteException e) {
manager.fail(e.getCause() != null ? e.getCause() : e);
} finally {
manager.finishJob(job);
jobDistributor.distribute(taskAdapter.getSelectedNode());
}
}
There is no special class anymore, you simply use IgnitePredicate as a listener. Refer to [1] for details.
Refer to [2] for information about async support. Also note that projections were replaced with cluster groups [3] (one of your compile errors is because of that). And you're correct, listenAsync was renamed to listen.
[1] https://apacheignite.readme.io/docs/events
[2] https://apacheignite.readme.io/docs/async-support
[3] https://apacheignite.readme.io/docs/cluster-groups
I'm using Spark Streaming to process a stream by processing each partition (saving events to HBase), then ack the last event in each RDD from the driver to the receiver, so the receiver can ack it to its source in turn.
public class StreamProcessor {
final AckClient ackClient;
public StreamProcessor(AckClient ackClient) {
this.ackClient = ackClient;
}
public void process(final JavaReceiverInputDStream<Event> inputDStream)
inputDStream.foreachRDD(rdd -> {
JavaRDD<Event> lastEvents = rdd.mapPartition(events -> {
// ------ this code executes on the worker -------
// process events one by one; I don't use ackClient here
// return the event with the max delivery tag here
});
// ------ this code executes on the driver -------
Event lastEvent = .. // find event with max delivery tag across partitions
ackClient.ack(lastEvent); // use ackClient to ack last event
});
}
}
The problem here is that I get the following error (even though everything seems to work fine):
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602)
at org.apache.spark.api.java.JavaRDDLike$class.mapPartitions(JavaRDDLike.scala:141)
at org.apache.spark.api.java.JavaRDD.mapPartitions(JavaRDD.scala:32)
...
Caused by: java.io.NotSerializableException: <some non-serializable object used by AckClient>
...
It seems that Spark is trying to serialize AckClient to send it to the workers, but I thought that only code inside mapPartitions is serialized/shipped to the workers, and that the code at the RDD level (i.e. inside foreachRDD but not inside mapPartitions) would not be serialized/shipped to the workers.
Can someone confirm if my thinking is correct or not? And if it is correct, should this be reported as a bug?
You are correct, this was fixed in 1.1. However, if you look at the stack trace, the cleaner that is throwing is being invoked in the mapPartitions
at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602)
So, the problem has to do with your mapPartitions. Make sure that you aren't accidentally wrapping this, as that is a common issue