Spring Batch -- Multi-File-Resource -- Takes same time as single Thread?

Spring Batch -- Multi-File-Resource -- Takes same time as single Thread? - multithreading

I am using Spring Batch for data migration from XML to Oracle Database.
With Single Thread execution, process takes 80-90 Mins to insert 20K users approx.
I want to reduce it to more than half but even using Multi File Resource, I am not able to achieve that.
I have a single XML to be processed so I started simply by adding
task executor and making Reader synchronized but not able to achieve gain.
So what I am doing, I split XML into multiple XMLS and want to try with Multi File Resource. Here is the configuration.
<batch:job id="importJob">
<batch:step id="step1Master">
<batch:partition handler="handler" partitioner="partitioner" />
</batch:step>
</batch:job>
<bean id="handler"
class="org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler">
<property name="taskExecutor" ref="taskExecutor" />
<property name="step" ref="slaveStep" />
<property name="gridSize" value="20" />
</bean>
<batch:step id="slaveStep">
<batch:tasklet transaction-manager="transactionManager"
allow-start-if-complete="true">
<batch:chunk reader="reader" writer="writer"
processor="processor" commit-interval="1000" skip-limit="1500000">
<batch:skippable-exception-classes>
<batch:include class="java.lang.Exception" />
</batch:skippable-exception-classes>
</batch:chunk>
</batch:tasklet>
</batch:step>
<bean id="taskExecutor"
class="org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor">
<property name="corePoolSize" value="100" />
<property name="maxPoolSize" value="300" />
<property name="allowCoreThreadTimeOut" value="true" />
</bean>
<bean id="partitioner"
class="org.springframework.batch.core.partition.support.MultiResourcePartitioner"
scope="step">
<property name="keyName" value="inputFile" />
<property name="resources"
value="file:/.../*.xml" />
</bean>
<bean id="processor"
class="...Processor"
scope="step" />
<bean id="reader" class="org.springframework.batch.item.xml.StaxEventItemReader"
scope="step">
<property name="fragmentRootElementName" value="user" />
<property name="unmarshaller" ref="userDetailUnmarshaller" />
<property name="resource" value="#{stepExecutionContext[inputFile]}" />
</bean>
My Single XML file contains users around 1000 and I am trying by having 20 files.
I kept commit-interval=1000 as each file has 1000 records to be insert in DB.
Do commit-interval needs to adjusted accordingly?
I am using ORACLE DB, Do I need to do any pool management there.
Current Pool of ORACLE DB configured in JBOSS
Min Pool = 100
Max Pool = 300
I see logging like
17:01:50,553 DEBUG [Writer] (taskExecutor-11) [UserDetailWriter] | user added
17:01:50,683 DEBUG [Writer] (taskExecutor-15) [UserDetailWriter] | user added
17:01:51,093 DEBUG [Writer] (taskExecutor-11) [UserDetailWriter] | user added
17:01:59,795 DEBUG [Writer] (taskExecutor-12) [UserDetailWriter] | user added
17:02:00,385 DEBUG [Writer] (taskExecutor-12) [UserDetailWriter] | user added
17:02:00,385 DEBUG [Writer] (taskExecutor-12) [UserDetailWriter] | user added
It seems multiple threads are being created but still I am not seeing any performance improvement here?
Please suggest what I am doing wrong?

go through this documentation for parallel processing
http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html#scalabilityParallelSteps

Related

Crawling rules in heritrix, how to load embedded content?

I want heritrix (version 3.4.0 currently) to crawl site.domain/path and load all pages below that but also include needed things to show the pages, like imgages, scripts and such.
According to https://heritrix.readthedocs.io/en/latest/glossary.html heading "Discovery Path", what I want is "Embedded links" - E and maybe Speculative embed - X. I do not want it to follow normal links - L outside my path.
I have been experimenting with the rules and my basic idea is this: (Last matching rule wins according to docs.)
accept all
reject everything outside site.domain/path
accept embedded files (images/css/script/etc)
It works fine to crawl, only pages within that path on the server but it does not load the needed files for the pages.
How to make it load the needed files as well?
Configuration in my job so far:
<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.AcceptDecideRule" />
<bean class="org.archive.modules.deciderules.NotMatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regexList">
<list>
<value>.*site\.domain/path/.*</value>
</list>
</property>
</bean>
<!-- HOW to accept embedded things here? -->
<!-- Below are some of the "standard" rules set up on a fresh job, it behaves the same with and without them when it comes to not loading embedded stuff -->
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
<!-- <property name="maxHops" value="20" /> -->
</bean>
<!-- ...and REJECT those with suspicious repeating path-segments... -->
<bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
<!-- <property name="maxRepetitions" value="2" /> -->
</bean>
<!-- ...and REJECT those with more than threshold number of path-segments... -->
<bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
<!-- <property name="maxPathDepth" value="20" /> -->
</bean>
<!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
<bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
</bean>
<!-- ...but always REJECT those with unsupported URI schemes -->
<bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
</bean>
</list>
</property>
</bean>

This accepts those containing E or X in the Discovery Path.
<bean class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
<property name="decision" value="ACCEPT"/>
<property name="regex" value="(E|X)" />
</bean>
PS
Ironic when you spend some hours on something and when you make a question and while making an adjustment on it stumble upon the solution.

In Spring LDAP, what happens when a pooled connection fails to successfully connect?

I am using Spring LDAP (2.0.2.RELEASE) to interact with our AD environment. I have integrated pooling within my applicationContext.xml.
In the Java LDAP docs (section 3.4), it states
If the LDAP provider cannot establish a connection within that period, it aborts the connection attempt
My question is: does spring handle a retry for this connection, or is there an error that occurs/thrown? I know Spring utilizes many of the underlying JVM LDAP features, but I have yet to find anything specific in this area.
Pertinent pieces of my applicationContext:
<bean id="dirContextValidator" class="org.springframework.ldap.pool.validation.DefaultDirContextValidator" />
<bean id="exampleConnectionDetails" class="org.springframework.ldap.core.support.LdapContextSource" scope="singleton">
<property name="url" value="ldaps://ldap.example.com:636" />
<property name="userDn" value="CN=LDAP_User,DC=example,DC=com" />
<property name="password" value="superSecretPwd" />
<property name="pooled" value="false"/>
<property name="referral" value="follow"/>
</bean>
<bean id="exampleContextSource" class="org.springframework.ldap.pool.factory.PoolingContextSource">
<property name="contextSource" ref="exampleConnectionDetails" />
<property name="dirContextValidator" ref="dirContextValidator" />
<property name="testOnBorrow" value="true" />
<property name="testWhileIdle" value="true" />
<property name="maxWait" value="10000" />
<property name="whenExhaustedAction" value="0" />
<property name="minIdle" value="5" />
<property name="maxIdle" value="10" />
<property name="timeBetweenEvictionRunsMillis" value="15000" />
<property name="minEvictableIdleTimeMillis" value="30000" />
<property name="numTestsPerEvictionRun" value="7" />
</bean>

The documentation for testOnBorrow states this,
testOnBorrow: The indication of whether objects will be validated before being borrowed from the pool. If the object fails to validate, it will be dropped from the pool, and an attempt to borrow another will be made.
I understand, Spring will attempt to make another connection in the event of an failure.

duplicate message processed when polling files from s3

I am using s3 module to poll files from s3.It downloads the file to local system and starts processing it.I am running this on 3 node cluster with module count as 1.Now lets assume the file is downloaded to local system from s3 and xd is processing it.If xd node goes down it would have processed half the message.When the server comes up it will start processing file again hence I will get duplicate message.I am trying to change to idempotent pattern with message store to change the module count to 3 but still this duplicate message issues will be there.
<int:poller fixed-delay="${fixedDelay}" default="true">
<int:advice-chain>
<ref bean="pollAdvise"/>
</int:advice-chain>
</int:poller>
<bean id="pollAdvise" class="org.springframework.integration.scheduling.PollSkipAdvice">
<constructor-arg ref="healthCheckStrategy"/>
</bean>
<bean id="healthCheckStrategy" class="ServiceHealthCheckPollSkipStrategy">
<property name="url" value="${url}"/>
<property name="doHealthCheck" value="${doHealthCheck}"/>
</bean>
<bean id="credentials" class="org.springframework.integration.aws.core.BasicAWSCredentials">
<property name="accessKey" value="${accessKey}"/>
<property name="secretKey" value="${secretKey}"/>
</bean>
<bean id="clientConfiguration" class="com.amazonaws.ClientConfiguration">
<property name="proxyHost" value="${proxyHost}"/>
<property name="proxyPort" value="${proxyPort}"/>
<property name="preemptiveBasicProxyAuth" value="false"/>
</bean>
<bean id="s3Operations" class="org.springframework.integration.aws.s3.core.CustomC1AmazonS3Operations">
<constructor-arg index="0" ref="credentials"/>
<constructor-arg index="1" ref="clientConfiguration"/>
<property name="awsEndpoint" value="s3.amazonaws.com"/>
<property name="temporaryDirectory" value="${temporaryDirectory}"/>
<property name="awsSecurityKey" value=""/>
</bean>
<!-- aws-endpoint="https://s3.amazonaws.com" -->
<int-aws:s3-inbound-channel-adapter aws-endpoint="s3.amazonaws.com"
bucket="${bucket}"
s3-operations="s3Operations"
credentials-ref="credentials"
file-name-wildcard="${fileNameWildcard}"
remote-directory="${remoteDirectory}"
channel="splitChannel"
local-directory="${localDirectory}"
accept-sub-folders="false"
delete-source-files="true"
archive-bucket="${archiveBucket}"
archive-directory="${archiveDirectory}">
</int-aws:s3-inbound-channel-adapter>
<int-file:splitter input-channel="splitChannel" output-channel="output" markers="false" charset="UTF-8">
<int-file:request-handler-advice-chain>
<bean class="org.springframework.integration.handler.advice.ExpressionEvaluatingRequestHandlerAdvice">
<property name="onSuccessExpression" value="payload.delete()"/>
</bean>
</int-file:request-handler-advice-chain>
</int-file:splitter>
<int:idempotent-receiver id="expressionInterceptor" endpoint="output"
metadata-store="redisMessageStore"
discard-channel="nullChannel"
throw-exception-on-rejection="false"
key-expression="payload"/>
<bean id="redisMessageStore" class="o.s.i.redis.store.RedisChannelMessageStore">
<constructor-arg ref="redisConnectionFactory"/>
</bean>
<bean id="redisConnectionFactory"
class="o.s.data.redis.connection.jedis.JedisConnectionFactory">
<property name="port" value="7379" />
</bean>
<int:channel id="output"/>
Update 2
This configuration worked for me Thanks for your help.
<int:idempotent-receiver id="s3Interceptor" endpoint="s3splitter"
metadata-store="redisMessageStore"
discard-channel="nullChannel"
throw-exception-on-rejection="false"
key-expression="payload.name"/>
<bean id="redisMessageStore" class="org.springframework.integration.redis.metadata.RedisMetadataStore">
<constructor-arg ref="redisConnectionFactory"/>
</bean>
<bean id="redisConnectionFactory"
class="org.springframework.data.redis.connection.jedis.JedisConnectionFactory">
<property name="port" value="6379" />
</bean>
<int:bridge id="batchBridge" input-channel="bridge" output-channel="output">
</int:bridge>
<int:idempotent-receiver id="splitterInterceptor" endpoint="batchBridge"
metadata-store="redisMessageStore"
discard-channel="nullChannel"
throw-exception-on-rejection="false"
key-expression="payload"/>
<int:channel id="output"/>
I had few doubts wanted to clarify If i am doing right.
1)As you can see I have ExpressionEvaluatingRequestHandlerAdvice to delete the file.Will the file get deleted after i read the file into redis or after last record is read?
2)I explored redis using desktop manager I see this I have a MetaData as man Key
Both (file and payload) metadatastore key and value are going to same table is this fine?or should it be different metadatastore?
Can i use hash of payload instead of payload as key?Is there something like payload.hash!

Looks like it is continuation of the Multiple message processed, but unfortunately we don't see <idempotent-receiver> configuration in your case.
According to your comment there looks like you continue to use SimpleMetadataStore or clean the shared one (Redis/Mongo) very often.
You should share more info where to dig. Some logs and DEBUG investigation would be good, too.
UPDATE
The Idempotent Receiver is exactly for endpoint. In your config it is for the MessageChannel. That's why you don't achieve any proper work, because the MessageChannel is just ignored from the IdempotentReceiverInterceptor.
you should add an id for your <int-file:splitter> and use that id from the endpoint attribute. Not should if that would be good idea to use File as a key for idempotency. The name sounds better.
UPDATE 2
If a node goes down and lets assume a file is dowloaded(file size with million records may be gb) to xd node and I would have processed half the records and node crashes .When server comes up I think we will process same records again?
OK. I got your point finally! you have an issue with splitted lines from the file already.
Also I'd use Idempotent Receiver for the <splitter> as well to avoid duplicate files from S3.
To fix your use-case you should place in between <splitter> and output channel one more endpoint - <bridge> to skip duplicate lines with the Idempotent Receiver.

Spring Integration auto reconnect to IBM MQ thru JBOSS Resource Adapter

We are using Spring Integration in our project and we have a requirement where If IBM MQ goes down then we will have to auto connect to IBM MQ when it is up. We have done this implementation using recoveryInterval option of org.springframework.jms.listener.DefaultMessageListenerContainer class. We have given recovery interval value to try to recover the MQ connection. But it is not recovering the connection after MQ restart. Below was my existing configuration:
<jms:message-driven-channel-adapter id="adapterId" channel="raw-channel" container="messageListenerContainer" />
<bean id="messageListenerContainer" class="org.springframework.jms.listener.DefaultMessageListenerContainer">
<property name="connectionFactory" ref="customQueueCachingConnectionFactory" />
<property name="destination" ref="requestQueue" />
<property name="recoveryInterval" value="60000" />
</bean>
Below is the Current Connection Factory :
<bean id="queueCachingConnectionFactory"
class="org.springframework.jms.connection.CachingConnectionFactory">
<property name="targetConnectionFactory" ref="queueConnectionFactory" />
<property name="sessionCacheSize" value="10" />
<property name="cacheProducers" value="false" />
<!-- <property name="reconnectOnException" value="true" /> -->
<!-- <property name="exceptionListener" ref="MQExceptionListener"></property> -->
</bean>
<jee:jndi-lookup id="queueConnectionFactory" jndi-name="MQConnectionFactory"
expected-type="javax.jms.ConnectionFactory" lookup-on-startup="true"></jee:jndi-lookup>
<jee:jndi-lookup id="queue" jndi-name="Queue"
expected-type="javax.jms.Queue" lookup-on-startup="true"/>
ERROR [task-scheduler-4] LoggingHandler:145 -org.springframework.jms.IllegalStateException: MQJCA1019: The connection is closed.; nested exception is com.ibm.msg.client.jms.DetailedIllegalStateException: MQJCA1019: The connection is closed.
The application attempted to use a JMS connection after it had closed the connection.
Modify the application so that it closes the JMS connection only after it has finished using the connection.
Thanks in Advance!!

The default message listening container should reference the caching connection factory:
<!-- caching connection factory fascade, also implements exception listener -->
<bean id="cachingConnectionFactory" class="org.springframework.jms.connection.CachingConnectionFactory">
<property name="targetConnectionFactory" ref="connectionFactory"/>
<property name="sessionCacheSize" value="10"/>
<property name="reconnectOnException" value="true"/>
</bean>
<!-- this is the Message Driven POJO (MDP) -->
<bean id="messageDrivenPOJO" class="com.redhat.gss.spring.MessageDrivenPOJO" />
<!-- The message listener container -->
<bean id="messageListener" class="org.springframework.jms.listener.DefaultMessageListenerContainer">
<property name="sessionTransacted" value="true"/>
<property name="concurrentConsumers" value="1"/>
<property name="cacheLevelName" value="CACHE_CONSUMER"/>
<property name="receiveTimeout" value="10000"/>
<property name="sessionAcknowledgeMode" value="2"/>
<property name="messageListener" ref="messageDrivenPOJO"/>
<property name="connectionFactory" ref="cachingConnectionFactory"/>
<property name="exceptionListener" ref="cachingConnectionFactory"/>
<property name="destination" ref="jbossQueue"/>
</bean>

handling file backup/archive on success and retry on failure

i have setup a simple file integration the reads a files from a directory and transfer via sftp to a remote system. it is working fine, however i need to handle archiving after successful transfer and retry until *n if unsuccessful transfer. any advise how this can be done using spring-integration components?
<bean id="sftpSessionFactory" class="org.springframework.integration.sftp.session.DefaultSftpSessionFactory">
<property name="host" value="sftp.host"/>
<property name="port" value="22"/>
<property name="user" value="user1"/>
<property name="password" value="pass1"/>
</bean>
<file:inbound-channel-adapter channel="fileInputChannel"
directory="C:/FileServer"
prevent-duplicates="true"
filename-pattern="*.pdf">
<integration:poller fixed-rate="5000"/>
</file:inbound-channel-adapter>
<sftp:outbound-channel-adapter id="sftpOutboundAdapter"
session-factory="sftpSessionFactory"
channel="fileInputChannel"
charset="UTF-8"
remote-directory=".">
<sftp:request-handler-advice-chain>
<bean class="org.springframework.integration.handler.advice.ExpressionEvaluatingRequestHandlerAdvice">
<property name="onSuccessExpression" value="payload.renameTo('C:/succeeded/' + payload.name)"/>
<property name="successChannel" ref="afterSuccessDeleteChannel"/>
<property name="onFailureExpression" value="payload.renameTo('C:/failed/' + payload.name)"/>
<property name="failureChannel" ref="afterFailRenameChannel" />
</bean>
<bean id="retryAdvice" class="org.springframework.integration.handler.advice.RequestHandlerRetryAdvice">
<property name="retryTemplate" ref="retryTemplate"/>
</bean>
</sftp:request-handler-advice-chain>
</sftp:outbound-channel-adapter>
Thanks

See the retry-and-more sample. It shows how to use a retry advice and how to take different actions based on success/failure.
For both you would declare the retry advice after the expression evaluating advice so the expression evaluating advice is invoked when retries are exhausted.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spring Batch -- Multi-File-Resource -- Takes same time as single Thread? - multithreading

go through this documentation for parallel processing http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html#scalabilityParallelSteps

Related

Crawling rules in heritrix, how to load embedded content?

In Spring LDAP, what happens when a pooled connection fails to successfully connect?

duplicate message processed when polling files from s3

Spring Integration auto reconnect to IBM MQ thru JBOSS Resource Adapter

handling file backup/archive on success and retry on failure

Categories

Resources