Crawling rules in heritrix, how to load embedded content?

Crawling rules in heritrix, how to load embedded content? - heritrix

I want heritrix (version 3.4.0 currently) to crawl site.domain/path and load all pages below that but also include needed things to show the pages, like imgages, scripts and such.
According to https://heritrix.readthedocs.io/en/latest/glossary.html heading "Discovery Path", what I want is "Embedded links" - E and maybe Speculative embed - X. I do not want it to follow normal links - L outside my path.
I have been experimenting with the rules and my basic idea is this: (Last matching rule wins according to docs.)
accept all
reject everything outside site.domain/path
accept embedded files (images/css/script/etc)
It works fine to crawl, only pages within that path on the server but it does not load the needed files for the pages.
How to make it load the needed files as well?
Configuration in my job so far:
<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.AcceptDecideRule" />
<bean class="org.archive.modules.deciderules.NotMatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regexList">
<list>
<value>.*site\.domain/path/.*</value>
</list>
</property>
</bean>
<!-- HOW to accept embedded things here? -->
<!-- Below are some of the "standard" rules set up on a fresh job, it behaves the same with and without them when it comes to not loading embedded stuff -->
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
<!-- <property name="maxHops" value="20" /> -->
</bean>
<!-- ...and REJECT those with suspicious repeating path-segments... -->
<bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
<!-- <property name="maxRepetitions" value="2" /> -->
</bean>
<!-- ...and REJECT those with more than threshold number of path-segments... -->
<bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
<!-- <property name="maxPathDepth" value="20" /> -->
</bean>
<!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
<bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
</bean>
<!-- ...but always REJECT those with unsupported URI schemes -->
<bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
</bean>
</list>
</property>
</bean>

This accepts those containing E or X in the Discovery Path.
<bean class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
<property name="decision" value="ACCEPT"/>
<property name="regex" value="(E|X)" />
</bean>
PS
Ironic when you spend some hours on something and when you make a question and while making an adjustment on it stumble upon the solution.

Related

How to set expiry policy for ignite Cache [Inserting data using SPARK Dataframe]

I am trying to set Expiry Policy using xml config file:
<?xml version="1.0" encoding="UTF-8"?>
<!--
Ignite configuration with all defaults and enabled p2p deployment and enabled events.
-->
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:util="http://www.springframework.org/schema/util"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/util
http://www.springframework.org/schema/util/spring-util.xsd">
<bean abstract="true" id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<!-- Set to true to enable distributed class loading for examples, default is false. -->
<property name="peerClassLoadingEnabled" value="true"/>
<!-- Enable task execution events for examples. -->
<property name="includeEventTypes">
<list>
<!--Task execution events-->
<util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_STARTED"/>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_FINISHED"/>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_FAILED"/>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_TIMEDOUT"/>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_SESSION_ATTR_SET"/>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_REDUCED"/>
<!--Cache events-->
<util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_OBJECT_PUT"/>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_OBJECT_READ"/>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_OBJECT_REMOVED"/>
</list>
</property>
<!-- Explicitly configure TCP discovery SPI to provide list of initial nodes. -->
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<!--
Ignite provides several options for automatic discovery that can be used
instead os static IP based discovery. For information on all options refer
to our documentation: http://apacheignite.readme.io/docs/cluster-config
-->
<!-- Uncomment static IP finder to enable static-based discovery of initial nodes. -->
<!--<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">-->
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder">
<property name="addresses">
<list>
<!-- In distributed environment, replace with actual host IP address. -->
<value>127.0.0.1:47500..47509</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
</bean>
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="setExpiryPolicyFactory" >
<bean class="javax.cache.expiry.CreatedExpiryPolicy">
<!---- Remove all data after 10 Seconds---->
<property name="expiryDuration" value="10000"/>
</bean>
</property>
<property name="eagerTtl" value="true"/>
</bean>
I am a bit new for Ignite. I walk through documentation but unable to find. Is any way to setExpiryPolicy into XMl?
Any help would be appreciated!

That requires a bit of Spring XML magic because there is no easy-to-use factory class for the CreatedExpiryPolicy (i.e. no CreatedExpiryPolicyFactory that you can just instantiate like a normal bean), but there is the CreatedExpiryPolicy::factoryOf that can be passed as factory-method:
<property name="expiryPolicyFactory">
<bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
<constructor-arg>
<bean class="javax.cache.expiry.Duration">
<constructor-arg value="SECONDS"/>
<constructor-arg value="10"/>
</bean>
</constructor-arg>
</bean>
</property>

Apache Ignite doesn`t support expiry policies and TTL for SQL tables.
An alternative would be to create cache with dynamically generated names based on desired time window.
For example you could append hour of the day if you need it hourly:
MY_CACHE_00
MY_CACHE_01
MY_CACHE_02
etc...
MY_CACHE_23
Then make your software write/read to the latest hour CACHE, and clear the oldest.

no bean named > 'mystoreBrandCategoryCodeValueProvider' available (hybris)

Caused by: java.util.concurrent.ExecutionException:
de.hybris.platform.solrfacetsearch.indexer.exceptions.IndexerRuntimeException:
de.hybris.platform.solrfacetsearch.indexer.exceptions.IndexerException:
Failed to index item with PK 8796431187969: No bean named
'mystoreBrandCategoryCodeValueProvider' available
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_171]
at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_171]
at de.hybris.platform.solrfacetsearch.indexer.strategies.impl.DefaultIndexerStrategy.runWorkers(DefaultIndexerStrategy.java:141)
~[solrfacetsearchserver.jar:?]
I get this error when i try to go to localhost for mystore.
My steps:
i created b2b from b2c as described on helphybris
it is working well because i can visit powertools website
I copied all impexes from powertools to mystore which is under mystoreinitialdata/import
then i went to backoffice/wcms and saw my store as url
and also i could see my catalogs on catalogs tab; product, catalog and classification. Just like powertools.
What i want is, with powertools impexes copied to mystore, i want to see powertools items under mystore.
But it gives error which i posted in the beginning.
I only copied impexes.
For example
mystore/solr.impex
has
;$solrIndexedType; color ;string;;;Refine;Alpha; 4000;true;;mystoreVariantCategoryCodeValueProvider;categoryFacetDisplayNameProvider;defaultTopValuesProvider
which i copied from powertools. But powertools has
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/context
http://www.springframework.org/schema/context/spring-context.xsd">
<context:annotation-config/>
<alias alias="b2bAcceleratorCoreSystemSetup" name="powertoolsStoreSystemSetup" />
<bean id="powertoolsStoreSystemSetup" class="de.hybris.platform.powertoolsstore.setup.PowertoolsStoreSystemSetup" parent="abstractCoreSystemSetup">
<property name="powertoolsCoreDataImportService" ref="powertoolsCoreDataImportService"/>
<property name="powertoolsSampleDataImportService" ref="powertoolsSampleDataImportService"/>
</bean>
<bean id="powertoolsSampleDataImportService" class="de.hybris.platform.powertoolsstore.services.dataimport.impl.PowertoolsSampleDataImportService"
parent="sampleDataImportService">
</bean>
<bean id="powertoolsCoreDataImportService" class="de.hybris.platform.powertoolsstore.services.dataimport.impl.PowertoolsCoreDataImportService"
parent="coreDataImportService">
</bean>
<!-- Solr field value providers TEMPORARY FOR NOW SO DO NOT NEED TO DEPEND ON yb2bacceleratorcore -->
<bean id="powertoolsCategoryCodeValueProvider" parent="abstractCategoryCodeValueProvider">
<property name="categorySource" ref="powertoolsCategorySource"/>
</bean>
<bean id="powertoolsBrandCategoryCodeValueProvider" parent="abstractCategoryCodeValueProvider">
<property name="categorySource" ref="powertoolsBrandCategorySource"/>
</bean>
<bean id="powertoolsVariantCategoryCodeValueProvider" parent="abstractCategoryCodeValueProvider">
<property name="categorySource" ref="powertoolsVariantCategorySource"/>
</bean>
<bean id="powertoolsCategoryNameValueProvider" parent="abstractCategoryNameValueProvider">
<property name="categorySource" ref="powertoolsCategorySource"/>
</bean>
<bean id="powertoolsBrandCategoryNameValueProvider" parent="abstractCategoryNameValueProvider">
<property name="categorySource" ref="powertoolsBrandCategorySource"/>
</bean>
<bean id="powertoolsCategorySource" parent="variantCategorySource">
<property name="rootCategory" value="1"/> <!-- '1' is the root icecat category -->
</bean>
<bean id="powertoolsVariantCategorySource" parent="variantCategorySource"/>
<bean id="powertoolsBrandCategorySource" parent="defaultCategorySource">
<property name="rootCategory" value="brands"/> <!-- 'brands' is the root of the brands hierarchy -->
</bean>
<!-- Solr field value providers TEMPORARY FOR NOW SO DO NOT NEED TO DEPEND ON yb2bacceleratorcore -->
</beans>
this in powertoolsspring-xml
there is no folder as mystorestore because the directory is powertoolsstore in
<bean id="powertoolsSampleDataImportService" class="de.hybris.platform.powertoolsstore.services.dataimport.impl.PowertoolsSampleDataImportService"
parent="sampleDataImportService">
and for
class="de.hybris.platform.powertoolsstore.setup.PowertoolsStoreSystemSetup"
mystore only has
mystore/initialdata/setup/InitialDataSystemSetup.java
and for
<bean id="powertoolsSampleDataImportService" class="de.hybris.platform.powertoolsstore.services.dataimport.impl.PowertoolsSampleDataImportService"
parent="sampleDataImportService">
mystore doesnot havea services.
What should i do? I want to see localhost with items. so i thought best way is to copy from powertools?

you solr indexer cron job is searching for bean 'mystoreBrandCategoryCodeValueProvider', so this bean should be defined in your spring file, remove it if not used.
possible solutions:
1. update solr.impex : remove this bean if you are not using it and import the impex via hac or update the system and make your your impex is being imported while system update.
Check your solrIndexedType if some old filed is using this bean, remove it (via hmc)
2.Add this bean into spring file if you are using it.

Hope you have copied all Impex correctly
Make sure
Copy impex correctly in right folder path
/mystoreinitialdata/resources/mystoreinitialdata/import/sampledata/productCatalogs/mystoreProductCatalog/products-media.impex
Update powertool word reference with mystore
Point siteResource to correct path
$siteResource=jar:com.mystore.initialdata.constants.MystoreInitialDataConstants&/mystoreinitialdata/import/sampledata/productCatalogs/$productCatalog
Correct the InitialDataSystemSetup class
Like
public static final String MYSTORE = "mystore";
#SystemSetup(type = Type.PROJECT, process = Process.ALL)
public void createProjectData(final SystemSetupContext context)
{
final List<ImportData> importData = new ArrayList<ImportData>();
final ImportData mystoreImportData = new ImportData();
mystoreImportData.setProductCatalogName(MYSTORE);
mystoreImportData.setContentCatalogNames(Arrays.asList(MYSTORE));
mystoreImportData.setStoreNames(Arrays.asList(MYSTORE));
importData.add(mystoreImportData);
/* uncomment below line to test mystoreinitialdata */
getCoreDataImportService().execute(this, context, importData);
getEventService().publishEvent(new CoreDataImportedEvent(context, importData));
getSampleDataImportService().execute(this, context, importData);
getEventService().publishEvent(new SampleDataImportedEvent(context, importData));
}
Correct/Add the bean in your *core-spring.xml which you have used in your impex.
Like
<bean id="yAcceleratorInitialDataSystemSetup"
class="com.store.initialdata.setup.InitialDataSystemSetup"
parent="abstractCoreSystemSetup">
<property name="coreDataImportService" ref="coreDataImportService"/>
<property name="sampleDataImportService" ref="sampleDataImportService"/>
</bean>
<!-- Solr ValueProvider -->
<bean id="mystoreCategorySource" parent="variantCategorySource">
<property name="rootCategory" value="1" /> <!-- '1' is the root icecat category -->
</bean>
<bean id="mystoreVariantCategorySource" parent="variantCategorySource" />
<bean id="mystoreBrandCategorySource" parent="defaultCategorySource">
<property name="rootCategory" value="brands" /> <!-- 'brands' is the root of the brands hierarchy -->
</bean>
<bean id="mystoreCategoryCodeValueProvider" parent="abstractCategoryCodeValueProvider">
<property name="categorySource" ref="mystoreCategorySource" />
</bean>
<bean id="mystoreBrandCategoryCodeValueProvider" parent="abstractCategoryCodeValueProvider">
<property name="categorySource" ref="mystoreBrandCategorySource" />
</bean>
<bean id="mystoreVariantCategoryCodeValueProvider" parent="abstractCategoryCodeValueProvider">
<property name="categorySource" ref="mystoreVariantCategorySource" />
</bean>
<bean id="mystoreCategoryNameValueProvider" parent="abstractCategoryNameValueProvider">
<property name="categorySource" ref="mystoreCategorySource" />
</bean>
<bean id="mystoreBrandCategoryNameValueProvider" parent="abstractCategoryNameValueProvider">
<property name="categorySource" ref="mystoreBrandCategorySource" />
</bean>
Update your system

Update the running system!
hac > Platform > Update

Integrating Apache Cassandra with Apache Ignite

I'm trying to integrate Apache Ignite with Apache Cassandra(3.11.2) as I want to use Ignite to cache the data present in my already existing Cassandra database.
After going through the online resources, I've done the following till now:
Downloaded Apache Ignite.
Copied all the folders present in "libs/optional/" to "libs/"(I don't know which ones will be required for Cassandra).
Created 3 xmls in the config folder i.e. "cassandra-config.xml", "connection-settings.xml" and "persistance-settings.xml". Currently I'm using the same node(172.16.129.68) for both Cassandra and Ignite.
cassandra-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd">
<!-- Cassandra connection settings -->
<import resource="connection-settings.xml" />
<!-- Persistence settings for 'cache1' -->
<bean id="cache1_persistence_settings" class="org.apache.ignite.cache.store.cassandra.persistence.KeyValuePersistenceSettings">
<constructor-arg type="org.springframework.core.io.Resource" value="file:/home/cass/apache_ignite/apache-ignite-fabric-2.4.0-bin/config/persistance-settings.xml" />
</bean>
<!-- Ignite configuration -->
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="cacheConfiguration">
<list>
<!-- Configuring persistence for "cache1" cache -->
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache1"/>
<property name="readThrough" value="true"/>
<property name="writeThrough" value="true"/>
<property name="writeBehindEnabled" value="true"/>
<property name="cacheStoreFactory">
<bean class="org.apache.ignite.cache.store.cassandra.CassandraCacheStoreFactory">
<property name="dataSourceBean" value="cassandraAdminDataSource"/>
<property name="persistenceSettingsBean" value="cache1_persistence_settings"/>
</bean>
</property>
</bean>
</list>
</property>
<!-- Explicitly configure TCP discovery SPI to provide list of initial nodes. -->
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<!--
Ignite provides several options for automatic discovery that can be used
instead os static IP based discovery. For information on all options refer
to our documentation: http://apacheignite.readme.io/docs/cluster-config
-->
<!-- Uncomment static IP finder to enable static-based discovery of initial nodes. -->
<!--<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">-->
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder">
<property name="addresses">
<list>
<!-- In distributed environment, replace with actual host IP address. -->
<value>172.16.129.68:47500..47509</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
</bean>
connection-settings.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd">
<bean id="loadBalancingPolicy" class="com.datastax.driver.core.policies.TokenAwarePolicy">
<constructor-arg type="com.datastax.driver.core.policies.LoadBalancingPolicy">
<bean class="com.datastax.driver.core.policies.RoundRobinPolicy"/>
</constructor-arg>
</bean>
<bean id="cassandraAdminDataSource" class="org.apache.ignite.cache.store.cassandra.datasource.DataSource">
<property name="port" value="9042"/>
<property name="contactPoints" value="172.16.129.68"/>
<property name="readConsistency" value="ONE"/>
<property name="writeConsistency" value="ONE"/>
<property name="loadBalancingPolicy" ref="loadBalancingPolicy"/>
</bean>
persistance-settings.xml
<persistence keyspace="test" table="epc_table">
<keyPersistence class="java.lang.String" strategy="PRIMITIVE" column="imsi"/>
<valuePersistence strategy="BLOB"/>
</persistence>
I run the following command to start Ignite from bin folder.
ignite.sh ../config/cassandra-config.xml
Now, I want to take a look at the cassandra table via sqlline. I've tried the following:
./sqlline.sh -u jdbc:cassandra://172.16.129.68:9042/test //(test is the name of the keyspace)
I get the following output:
No known driver to handle "jdbc:cassandra://172.16.129.68:9042/test". Searching for known drivers...
java.lang.NullPointerException
sqlline version 1.3.0
0: jdbc:cassandra://172.16.129.68:9042/test>
I've also tried:
./sqlline.sh -u jdbc:ignite:thin://172.16.129.68
but when I use "!tables", I'm not able to see any table.
What exactly has been missing? How to access/modify the tables present in Cassandra using sqlline?
Operating System: RHEL 6.5

Apache Ignite is a key-value database and there are no tables created by default that you are able to view with JDBC connector. CacheStore is a way to integrate Ignite with external DB or any other storage, and it loads data as a key-value pair.
In your config you said Ignite that you want to store and load entries in/from Cassandra, but still Ignite doesn't know entries structure (BTW Ignite really doesn't care what objects were putted into it).
To be able to list tables and do queries on it, you need to create tables. For that you need to have ignite-indexing in /lib directory and set QueryEntity or indexed types if you have annotated POJOs. Here is example of such configuration:
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="mycache"/>
<!-- Configure query entities -->
<property name="queryEntities">
<list>
<bean class="org.apache.ignite.cache.QueryEntity">
<property name="keyType" value="java.lang.Long"/>
<property name="valueType" value="org.apache.ignite.examples.Person"/>
<property name="fields">
<map>
<entry key="id" value="java.lang.Long"/>
<entry key="orgId" value="java.lang.Long"/>
<entry key="firstName" value="java.lang.String"/>
<entry key="lastName" value="java.lang.String"/>
<entry key="resume" value="java.lang.String"/>
<entry key="salary" value="java.lang.Double"/>
</map>
</property>
<property name="indexes">
<list>
<bean class="org.apache.ignite.cache.QueryIndex">
<constructor-arg value="id"/>
</bean>
<bean class="org.apache.ignite.cache.QueryIndex">
<constructor-arg value="orgId"/>
</bean>
<bean class="org.apache.ignite.cache.QueryIndex">
<constructor-arg value="salary"/>
</bean>
</list>
</property>
</bean>
</list>
</property>
If you configure that, you'll get an ability to enlist and query that tables over SQLine. (Please note, that you cannot query data that are not loaded into Ignite. To load them, you may use IgniteCache.get() with enabled readThrough option or IgniteCache.loadCache() to load everything from Cassandra table).
To query Cassandra with JDBC, you need a JDBC driver for it, try, for example DBSchema.

duplicate message processed when polling files from s3

I am using s3 module to poll files from s3.It downloads the file to local system and starts processing it.I am running this on 3 node cluster with module count as 1.Now lets assume the file is downloaded to local system from s3 and xd is processing it.If xd node goes down it would have processed half the message.When the server comes up it will start processing file again hence I will get duplicate message.I am trying to change to idempotent pattern with message store to change the module count to 3 but still this duplicate message issues will be there.
<int:poller fixed-delay="${fixedDelay}" default="true">
<int:advice-chain>
<ref bean="pollAdvise"/>
</int:advice-chain>
</int:poller>
<bean id="pollAdvise" class="org.springframework.integration.scheduling.PollSkipAdvice">
<constructor-arg ref="healthCheckStrategy"/>
</bean>
<bean id="healthCheckStrategy" class="ServiceHealthCheckPollSkipStrategy">
<property name="url" value="${url}"/>
<property name="doHealthCheck" value="${doHealthCheck}"/>
</bean>
<bean id="credentials" class="org.springframework.integration.aws.core.BasicAWSCredentials">
<property name="accessKey" value="${accessKey}"/>
<property name="secretKey" value="${secretKey}"/>
</bean>
<bean id="clientConfiguration" class="com.amazonaws.ClientConfiguration">
<property name="proxyHost" value="${proxyHost}"/>
<property name="proxyPort" value="${proxyPort}"/>
<property name="preemptiveBasicProxyAuth" value="false"/>
</bean>
<bean id="s3Operations" class="org.springframework.integration.aws.s3.core.CustomC1AmazonS3Operations">
<constructor-arg index="0" ref="credentials"/>
<constructor-arg index="1" ref="clientConfiguration"/>
<property name="awsEndpoint" value="s3.amazonaws.com"/>
<property name="temporaryDirectory" value="${temporaryDirectory}"/>
<property name="awsSecurityKey" value=""/>
</bean>
<!-- aws-endpoint="https://s3.amazonaws.com" -->
<int-aws:s3-inbound-channel-adapter aws-endpoint="s3.amazonaws.com"
bucket="${bucket}"
s3-operations="s3Operations"
credentials-ref="credentials"
file-name-wildcard="${fileNameWildcard}"
remote-directory="${remoteDirectory}"
channel="splitChannel"
local-directory="${localDirectory}"
accept-sub-folders="false"
delete-source-files="true"
archive-bucket="${archiveBucket}"
archive-directory="${archiveDirectory}">
</int-aws:s3-inbound-channel-adapter>
<int-file:splitter input-channel="splitChannel" output-channel="output" markers="false" charset="UTF-8">
<int-file:request-handler-advice-chain>
<bean class="org.springframework.integration.handler.advice.ExpressionEvaluatingRequestHandlerAdvice">
<property name="onSuccessExpression" value="payload.delete()"/>
</bean>
</int-file:request-handler-advice-chain>
</int-file:splitter>
<int:idempotent-receiver id="expressionInterceptor" endpoint="output"
metadata-store="redisMessageStore"
discard-channel="nullChannel"
throw-exception-on-rejection="false"
key-expression="payload"/>
<bean id="redisMessageStore" class="o.s.i.redis.store.RedisChannelMessageStore">
<constructor-arg ref="redisConnectionFactory"/>
</bean>
<bean id="redisConnectionFactory"
class="o.s.data.redis.connection.jedis.JedisConnectionFactory">
<property name="port" value="7379" />
</bean>
<int:channel id="output"/>
Update 2
This configuration worked for me Thanks for your help.
<int:idempotent-receiver id="s3Interceptor" endpoint="s3splitter"
metadata-store="redisMessageStore"
discard-channel="nullChannel"
throw-exception-on-rejection="false"
key-expression="payload.name"/>
<bean id="redisMessageStore" class="org.springframework.integration.redis.metadata.RedisMetadataStore">
<constructor-arg ref="redisConnectionFactory"/>
</bean>
<bean id="redisConnectionFactory"
class="org.springframework.data.redis.connection.jedis.JedisConnectionFactory">
<property name="port" value="6379" />
</bean>
<int:bridge id="batchBridge" input-channel="bridge" output-channel="output">
</int:bridge>
<int:idempotent-receiver id="splitterInterceptor" endpoint="batchBridge"
metadata-store="redisMessageStore"
discard-channel="nullChannel"
throw-exception-on-rejection="false"
key-expression="payload"/>
<int:channel id="output"/>
I had few doubts wanted to clarify If i am doing right.
1)As you can see I have ExpressionEvaluatingRequestHandlerAdvice to delete the file.Will the file get deleted after i read the file into redis or after last record is read?
2)I explored redis using desktop manager I see this I have a MetaData as man Key
Both (file and payload) metadatastore key and value are going to same table is this fine?or should it be different metadatastore?
Can i use hash of payload instead of payload as key?Is there something like payload.hash!

Looks like it is continuation of the Multiple message processed, but unfortunately we don't see <idempotent-receiver> configuration in your case.
According to your comment there looks like you continue to use SimpleMetadataStore or clean the shared one (Redis/Mongo) very often.
You should share more info where to dig. Some logs and DEBUG investigation would be good, too.
UPDATE
The Idempotent Receiver is exactly for endpoint. In your config it is for the MessageChannel. That's why you don't achieve any proper work, because the MessageChannel is just ignored from the IdempotentReceiverInterceptor.
you should add an id for your <int-file:splitter> and use that id from the endpoint attribute. Not should if that would be good idea to use File as a key for idempotency. The name sounds better.
UPDATE 2
If a node goes down and lets assume a file is dowloaded(file size with million records may be gb) to xd node and I would have processed half the records and node crashes .When server comes up I think we will process same records again?
OK. I got your point finally! you have an issue with splitted lines from the file already.
Also I'd use Idempotent Receiver for the <splitter> as well to avoid duplicate files from S3.
To fix your use-case you should place in between <splitter> and output channel one more endpoint - <bridge> to skip duplicate lines with the Idempotent Receiver.

Spring Batch -- Multi-File-Resource -- Takes same time as single Thread?

I am using Spring Batch for data migration from XML to Oracle Database.
With Single Thread execution, process takes 80-90 Mins to insert 20K users approx.
I want to reduce it to more than half but even using Multi File Resource, I am not able to achieve that.
I have a single XML to be processed so I started simply by adding
task executor and making Reader synchronized but not able to achieve gain.
So what I am doing, I split XML into multiple XMLS and want to try with Multi File Resource. Here is the configuration.
<batch:job id="importJob">
<batch:step id="step1Master">
<batch:partition handler="handler" partitioner="partitioner" />
</batch:step>
</batch:job>
<bean id="handler"
class="org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler">
<property name="taskExecutor" ref="taskExecutor" />
<property name="step" ref="slaveStep" />
<property name="gridSize" value="20" />
</bean>
<batch:step id="slaveStep">
<batch:tasklet transaction-manager="transactionManager"
allow-start-if-complete="true">
<batch:chunk reader="reader" writer="writer"
processor="processor" commit-interval="1000" skip-limit="1500000">
<batch:skippable-exception-classes>
<batch:include class="java.lang.Exception" />
</batch:skippable-exception-classes>
</batch:chunk>
</batch:tasklet>
</batch:step>
<bean id="taskExecutor"
class="org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor">
<property name="corePoolSize" value="100" />
<property name="maxPoolSize" value="300" />
<property name="allowCoreThreadTimeOut" value="true" />
</bean>
<bean id="partitioner"
class="org.springframework.batch.core.partition.support.MultiResourcePartitioner"
scope="step">
<property name="keyName" value="inputFile" />
<property name="resources"
value="file:/.../*.xml" />
</bean>
<bean id="processor"
class="...Processor"
scope="step" />
<bean id="reader" class="org.springframework.batch.item.xml.StaxEventItemReader"
scope="step">
<property name="fragmentRootElementName" value="user" />
<property name="unmarshaller" ref="userDetailUnmarshaller" />
<property name="resource" value="#{stepExecutionContext[inputFile]}" />
</bean>
My Single XML file contains users around 1000 and I am trying by having 20 files.
I kept commit-interval=1000 as each file has 1000 records to be insert in DB.
Do commit-interval needs to adjusted accordingly?
I am using ORACLE DB, Do I need to do any pool management there.
Current Pool of ORACLE DB configured in JBOSS
Min Pool = 100
Max Pool = 300
I see logging like
17:01:50,553 DEBUG [Writer] (taskExecutor-11) [UserDetailWriter] | user added
17:01:50,683 DEBUG [Writer] (taskExecutor-15) [UserDetailWriter] | user added
17:01:51,093 DEBUG [Writer] (taskExecutor-11) [UserDetailWriter] | user added
17:01:59,795 DEBUG [Writer] (taskExecutor-12) [UserDetailWriter] | user added
17:02:00,385 DEBUG [Writer] (taskExecutor-12) [UserDetailWriter] | user added
17:02:00,385 DEBUG [Writer] (taskExecutor-12) [UserDetailWriter] | user added
It seems multiple threads are being created but still I am not seeing any performance improvement here?
Please suggest what I am doing wrong?

go through this documentation for parallel processing
http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html#scalabilityParallelSteps

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Crawling rules in heritrix, how to load embedded content? - heritrix

Related

How to set expiry policy for ignite Cache [Inserting data using SPARK Dataframe]

no bean named > 'mystoreBrandCategoryCodeValueProvider' available (hybris)

Integrating Apache Cassandra with Apache Ignite

duplicate message processed when polling files from s3

Spring Batch -- Multi-File-Resource -- Takes same time as single Thread?

Categories

Resources