Process parallel multiple read in spring Integration

Process parallel multiple read in spring Integration - spring-integration

I have one file which contains some record and i have second file which also contains same record but with more details , so i want to process in a way that read one record from first file and search in second.
How to read two files in parallel ?
Update :
return IntegrationFlows.from(readFilefromDirectory(), fileInboundPollingConsumer())
.split(fileSplitter())
.channel(c ->c.executor(id, executor))
here in first line we are reading first file and then splitting it sending to executer channel so i want to know where should i write logic to read second file means to read from directory and then search record in that excel file .

All the parallel work in Spring Integration can be done via ExecutorChannel.
I'd suggest to use FileSplitter for the first file and an ExecutorChannel as an output.
As for the second file... Well, I'd read it once into the memory e.g. as a Map if you have some delimiter between record and its details. And use that memory store for incoming record.
UPDATE
If the second file is very big, you can do like this:
Scanner scanner=new Scanner(file);
while(scanner.hasNextLine()) {
String line = scanner.nextLine();
if(line.startsWith(payload)) {
return line;
}
}
UPDATE 2
return IntegrationFlows.from(readFilefromDirectory(), fileInboundPollingConsumer())
.split(fileSplitter())
.channel(c ->c.executor(id, executor))
.handle("searchService", "searchMethod")
where searchService is some service #Bean (or just #Service) and searchMethod utilize logic to accept the String line and implement logic to do the search in other file. That search operation will be done in parallel for each splitted like.
The logic to read Excel file is out of Spring Integration scope. That is fully different question.

Related

Camel ${body} sort and transmit via SFTP

I need some advice on the proper Camel EIP to use to accomplish the following:
I have a Camel ${body} that consists of a List of Maps ...example ( as Groovy code )
def y = [
['sku':'VT-KK2150', 'Performance':'act7', 'ProductGroup':'Jammers', 'name':'ProductPerformance1'],
['sku':'VT-LL2150', 'Performance':'act8', 'ProductGroup':'Sammers', 'name':'ProductPerformance2'],
['sku':'VT-MM2150', 'Performance':'act9', 'ProductGroup':'Bammers', 'name':'ProductPerformance3'],
['sku':'VT-RR2150', 'Performance':'act29', 'ProductGroup':'Bammers', 'name':'ProductPerformance3'] ]
I wish to send a SFTP copy of the Map content to a plain file.
So far, I CAN accomplish the above without trouble. As a matter of clarification, I can see the expected
content of the List by including following in my Camel Route :
<log message="Sku from record 1 = ${body[0]['sku']}"/>
<log message="Performance from record 1 = ${body[0]['Performance']}"/>
<log message="ProductGroup from record 1 = ${body[0]['ProductGroup']}"/>
The problem I have now is that I now need to copy the above content as 3 separate SFTP transmissions
( with 3 different filenames ) based on the value of one of the above Map fields ( in this case the ProductGroup field ).
Example ...in the above example, I would need to send 3 files via SFTP ...one would contain the record for the
Jammers ProductGroup, one would contain the record for the Sammers ProductGroup, and the last one would
contain the 2 records for the Bammers ProductGroup.
Basically, as a first step, I am thinking I need to somehow sort the List based on the ProductGroup field.
I can accomplish that with one line of Groovy code:
y.sort{m1, m2 -> m1.ProductGroup <=> m2.ProductGroup}
The issue now is how to iterate over the Camel ${body} and create a new SFTP transmission when the
ProductGroup name changes.
I am thinking I may be able to do this by implementing some Processor logic and stoing the result
as Header properties ....that said, I am wondering if there is a simpler approach or an existing EIP
for this case.

Avoid duplicate entires when Inserting Excel or CSV-like entries into a neo4j graph

I have the following .xslx file:
My software regardless tis language will return the following graph:
My software iterates line by line and on each line iteration executes the following query
MERGE (A:POINT {x:{xa},y:{ya}}) MERGE (B:POINT {x:{xb},y:{yb}}) MERGE (C:POINT {x:{xc},y:{yc}}) MERGE (A)-[:LINKS]->(B)-[:LINKS]->(C) MERGE (C)-[:LINKS]->(A)
Will this avoid by inserting duplicate entries?

According to this question, yes it will avoid writing duplicate entries.
The query above will match any existing nodes and it will avoid to write duplicates.
A good rule of thumb is on each node that it may be a duplicate write it into a seperate MERGE query and afterwards write the merge statements for each relationship between 2 nodes.
Update
After some experiece when using asyncronous technologies such nodejs or even parallel threads you must verify that you read the next line AFTER you inserted the previous one. The reason why is because is that doing multiple insertions asyncronously may result having multiple nodes into your graph that are actually the same ones.
In node.js project of mine I read the excell file like:
const iterateWorksheet=function(worksheet,maxRows,row,callback){
process.nextTick(function(){
//Skipping first row
if(row==1){
return iterateWorksheet(worksheet,maxRows,2,callback);
}
if(row > maxRows){
return;
}
const alphas=_.range('A'.charCodeAt(0),config.excell.maxColumn.charCodeAt(0));
let rowData={};
_.each(alphas,(column) => {
column=String.fromCharCode(column);
const item=column+row;
const key=config.excell.columnMap[column];
if(worksheet[item] && key ){
rowData[key]=worksheet[item].v;
}
});
// The callback is the isertion over a neo4j db
return callback(rowData,(error)=>{
if(!error){
return iterateWorksheet(worksheet,maxRows,row+1,callback);
}
});
});
}
As you see I visit the next line when I successfully inserted the previous one. I find no way yet to serialize the inserts like most conventional RDBMS's does.
In case or web or server applications another UNTESTED approach is to use queue servers such as RabbitMQ or similar in order to queue the queries. Then the code responsimble for insertion will read from the queue so the whole isolation should be in the queue.
Furthermore ensure that all inserts are into a transaction.

In Spark Streaming how to process old data and delete processed Data

We are running a Spark streaming job that retrieves files from a directory (using textFileStream).
One concern we are having is the case where the job is down but files are still being added to the directory.
Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed.
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
2) Is there a way to delete the processed files?

The article below pretty much covers all your questions.
https://blog.yanchen.ca/2016/06/28/fileinputdstream-in-spark-streaming/
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
Stream reader initiates batch window using the system clock when a job/application is launched. Apparently all the files created before would be ignored. Try enabling checkpointing.
2) Is there a way to delete the processed files?
Deleting files might be unnecessary. If checkpointing works, the files not being processed are identified by Spark. If for some reason the files are to be deleted, implement a custom input format and reader (please refer article) to capture the file name and use this information as appropriate. But I wouldn't recommend this approach.

Is there a way to delete the processed files?
In my experience, I can´t get to work the checkpointing feature so I had to delete/move the processed files that have entered each batch.
The way for getting those files is a bit tricky, but basically we can say that they are ancestors (dependencies) of the current RDD. What I use then, is a recursive method that crawls that structure and recovers the names of the RDDs that begin with hdfs.
/**
* Recursive method to extract original metadata files involved in this batch.
* #param rdd Each RDD created for each batch.
* #return All HDFS files originally read.
*/
def extractSourceHDFSFiles(rdd: RDD[_]): Set[String] = {
def extractSourceHDFSFilesWithAcc(rdd: List[RDD[_]]) : Set[String] = {
rdd match {
case Nil => Set()
case head :: tail => {
val name = head.toString()
if (name.startsWith("hdfs")){
Set(name.split(" ")(0)) ++ extractSourceHDFSFilesWithAcc(head.dependencies.map(_.rdd).toList) ++ extractSourceHDFSFilesWithAcc(tail)
}
else {
extractSourceHDFSFilesWithAcc(head.dependencies.map(_.rdd).toList) ++ extractSourceHDFSFilesWithAcc(tail)
}
}
}
}
extractSourceHDFSFilesWithAcc(rdd.dependencies.map(_.rdd).toList)
}
So, in the forEachRDD method you can easily invoke it:
stream.forEachRDD(rdd -> {
val filesInBatch = extractSourceHDFSFiles(rdd)
logger.info("Files to be processed:")
// Process them
// Delete them when you are done
})

The answer to your second question,
It is now possible in Spark 3. You can use "cleanSource" option for readStream.
Thanks to documentation https://spark.apache.org/docs/latest/structuread-streaming-programming-guide.html and this video https://www.youtube.com/watch?v=EM7T34Uu2Gg.
After searching for many hours, finally got the solution

Reading files from SFTP with file name filter on poll basis in mule

I am using groovy script with file name filter to read file from SFTP location on Poll basis using below code
<poll doc:name="Poll">
<schedulers:cron-scheduler expression="${payment.schedule}"/>
<scripting:transformer doc:name="Groovy">
<scripting:script engine="Groovy">
<scripting:text><![CDATA[
def endpointBuilder = muleContext.endpointFactory.getEndpointBuilder(
"sftp://${sftp.user}:${sftp.password}#${sftp.host}:${sftp.port}/${sftp.root.path}")
endpointBuilder.addMessageProcessor(new org.mule.routing.MessageFilter(new org.mule.transport.file.filters.FilenameWildcardFilter('payment_*')))
def inboundEndpoint = endpointBuilder.buildInboundEndpoint()
inboundEndpoint.request(3000L) ;
]]></scripting:text>
</scripting:script>
</scripting:transformer>
</poll>
Issue which I am facing here is that, Its reading only one file on each poll schedule.
However, my expectation is, it should read all the file which satisfies the condition of filter from SFTP location on each poll schedule.
How can I achieve this?
Thanks.

You may want to use the SFTP Inbound Endpoint with file:filename-wildcard-filter instead of relying on a Groovy script wrapped in a poll. Such as:
<sftp:inbound-endpoint host="${sftp.host}" port="${sftp.port}" path="/home/test/sftp-files" user="${sftp.user}" password="${sftp.password}">
<file:filename-wildcard-filter pattern="payment_*"/>
</sftp:inbound-endpoint>
See the related example in SFTP documentation.

Making log files in NiFi

I want to make log files for each processors in NiFi. I use splitText for splitting log files and then processing them after it I have one log message distribute in 5 files. I want to keep this data and write it in one log file for each processor (for example I use this expression fro getting executescript processor${regex:toLower():contains('executescript')}).
How can I write this logs in one log file for each processor?
Should I use any native NiFi processor or make it by Groovy code?
Is it possible to get flowfile data I used this but processor seems to have bad response:
def flowFile1 = session.create();
def flowFile=session.get();
while(flowFile != null){
flowFile1 = session.write(flowFile, {outputStream -> def builder = new groovy.json.JsonBuilder(flowFile)
outputStream.write(builder.toPrettyString().getBytes(Standar‌dCharsets.UTF_8)) } as OutputStreamCallback)
}
flowFile1 = session.putAttribute(flowFile,'filename','ExecuteScriptLog')
session.remove(flowFile);
session.transfer(flowFile1, REL_SUCCESS)
I have WorkFlow ike thi and i wnat to get connection name for example 'executescrip't and make flowfile with this name and input all flowfile data whcih is placed inside this 'executescript' queue and write it in one file created by me (in this case 'executescript')

the logging configuration you can manage through NIFI_HOME/conf/logback.xml file.
you can define here logging files (appenders) and what messages should be logged.
the logback manual: https://logback.qos.ch/manual/index.html
each processor has a classname that you can see on the screen (ex:org.apache.nifi.processors.attributes.UpdateAttribute) -
you need this info to configure logger in logback.xml and direct it to appender (file)
also you can define filtering in the logback.xml for each appender
so that only messages that matches regexp will be appended into into it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Process parallel multiple read in spring Integration - spring-integration

Related

Camel ${body} sort and transmit via SFTP

Avoid duplicate entires when Inserting Excel or CSV-like entries into a neo4j graph

In Spark Streaming how to process old data and delete processed Data

Reading files from SFTP with file name filter on poll basis in mule

Making log files in NiFi

Categories

Resources