Making log files in NiFi - groovy

I want to make log files for each processors in NiFi. I use splitText for splitting log files and then processing them after it I have one log message distribute in 5 files. I want to keep this data and write it in one log file for each processor (for example I use this expression fro getting executescript processor${regex:toLower():contains('executescript')}).
How can I write this logs in one log file for each processor?
Should I use any native NiFi processor or make it by Groovy code?
Is it possible to get flowfile data I used this but processor seems to have bad response:
def flowFile1 = session.create();
def flowFile=session.get();
while(flowFile != null){
flowFile1 = session.write(flowFile, {outputStream -> def builder = new groovy.json.JsonBuilder(flowFile)
outputStream.write(builder.toPrettyString().getBytes(Standar‌​dCharsets.UTF_8)) } as OutputStreamCallback)
}
flowFile1 = session.putAttribute(flowFile,'filename','ExecuteScriptLog')
session.remove(flowFile);
session.transfer(flowFile1, REL_SUCCESS)
I have WorkFlow ike thi and i wnat to get connection name for example 'executescrip't and make flowfile with this name and input all flowfile data whcih is placed inside this 'executescript' queue and write it in one file created by me (in this case 'executescript')

the logging configuration you can manage through NIFI_HOME/conf/logback.xml file.
you can define here logging files (appenders) and what messages should be logged.
the logback manual: https://logback.qos.ch/manual/index.html
each processor has a classname that you can see on the screen (ex:org.apache.nifi.processors.attributes.UpdateAttribute) -
you need this info to configure logger in logback.xml and direct it to appender (file)
also you can define filtering in the logback.xml for each appender
so that only messages that matches regexp will be appended into into it.

Related

Jmeter tearDown Thread Group can't access previous thread group used file

I have a Test plan with several thread groups that write summary report results in the same csv file hosted in a server, this works fine using a networkdrive (z:) and changing jmeter.properties -> resultcollector.action_if_file_exists=APPEND.
Finally I have a tearDown Thread Group that insert the csv data into a sql server (the previous used networkdrive is hosted in this server in c:\jmeter\results.csv) and then it deletes the csv.
The case is when I run the test plan full I always have this error: "Cannot bulk load because the file "c:\jmeter\results.csv" could not be opened. Operating system error code 32"
The strange thing is that if I start the tearDown Thread Group alone it works fine, it makes the bulk insert in sql server and then it deletes de csv.
I started 2 days ago with Jmeter, so I'm sure I am misunderstanding something :S
Summary Report Config
JDBC Request
BeanShell PostProcessor that deletes csv
Test plan Structure
It happens because Summary Report (as well as other listeners) keep the file(s) open until test ends so you need to trigger this "close" event somehow.
Since JMeter 3.1 you're supposed to be using JSR223 Test Elements and Groovy language for scripting therefore replace this Beanshell PostProcessor with the JSR223 PostProcessor and use the following code:
import org.apache.jmeter.reporters.ResultCollector
import org.apache.jorphan.collections.SearchByClass
def engine = engine = ctx.getEngine()
def test = engine.getClass().getDeclaredField('test')
test.setAccessible(true)
def testPlanTree = test.get(engine)
SearchByClass<ResultCollector> listenerSearch = new SearchByClass<>(ResultCollector.class)
testPlanTree.traverse(listenerSearch)
Collection<ResultCollector> listeners = listenerSearch.getSearchResults()
listeners.each { listener ->
def files = listener.files
files.each { file ->
file.value.pw.close()
}
}
new File('z:/result.csv').delete()
More information on Groovy scripting in JMeter: Apache Groovy - Why and How You Should Use It

In Spark Streaming how to process old data and delete processed Data

We are running a Spark streaming job that retrieves files from a directory (using textFileStream).
One concern we are having is the case where the job is down but files are still being added to the directory.
Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed.
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
2) Is there a way to delete the processed files?
The article below pretty much covers all your questions.
https://blog.yanchen.ca/2016/06/28/fileinputdstream-in-spark-streaming/
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
Stream reader initiates batch window using the system clock when a job/application is launched. Apparently all the files created before would be ignored. Try enabling checkpointing.
2) Is there a way to delete the processed files?
Deleting files might be unnecessary. If checkpointing works, the files not being processed are identified by Spark. If for some reason the files are to be deleted, implement a custom input format and reader (please refer article) to capture the file name and use this information as appropriate. But I wouldn't recommend this approach.
Is there a way to delete the processed files?
In my experience, I can´t get to work the checkpointing feature so I had to delete/move the processed files that have entered each batch.
The way for getting those files is a bit tricky, but basically we can say that they are ancestors (dependencies) of the current RDD. What I use then, is a recursive method that crawls that structure and recovers the names of the RDDs that begin with hdfs.
/**
* Recursive method to extract original metadata files involved in this batch.
* #param rdd Each RDD created for each batch.
* #return All HDFS files originally read.
*/
def extractSourceHDFSFiles(rdd: RDD[_]): Set[String] = {
def extractSourceHDFSFilesWithAcc(rdd: List[RDD[_]]) : Set[String] = {
rdd match {
case Nil => Set()
case head :: tail => {
val name = head.toString()
if (name.startsWith("hdfs")){
Set(name.split(" ")(0)) ++ extractSourceHDFSFilesWithAcc(head.dependencies.map(_.rdd).toList) ++ extractSourceHDFSFilesWithAcc(tail)
}
else {
extractSourceHDFSFilesWithAcc(head.dependencies.map(_.rdd).toList) ++ extractSourceHDFSFilesWithAcc(tail)
}
}
}
}
extractSourceHDFSFilesWithAcc(rdd.dependencies.map(_.rdd).toList)
}
So, in the forEachRDD method you can easily invoke it:
stream.forEachRDD(rdd -> {
val filesInBatch = extractSourceHDFSFiles(rdd)
logger.info("Files to be processed:")
// Process them
// Delete them when you are done
})
The answer to your second question,
It is now possible in Spark 3. You can use "cleanSource" option for readStream.
Thanks to documentation https://spark.apache.org/docs/latest/structuread-streaming-programming-guide.html and this video https://www.youtube.com/watch?v=EM7T34Uu2Gg.
After searching for many hours, finally got the solution

Reading files from SFTP with file name filter on poll basis in mule

I am using groovy script with file name filter to read file from SFTP location on Poll basis using below code
<poll doc:name="Poll">
<schedulers:cron-scheduler expression="${payment.schedule}"/>
<scripting:transformer doc:name="Groovy">
<scripting:script engine="Groovy">
<scripting:text><![CDATA[
def endpointBuilder = muleContext.endpointFactory.getEndpointBuilder(
"sftp://${sftp.user}:${sftp.password}#${sftp.host}:${sftp.port}/${sftp.root.path}")
endpointBuilder.addMessageProcessor(new org.mule.routing.MessageFilter(new org.mule.transport.file.filters.FilenameWildcardFilter('payment_*')))
def inboundEndpoint = endpointBuilder.buildInboundEndpoint()
inboundEndpoint.request(3000L) ;
]]></scripting:text>
</scripting:script>
</scripting:transformer>
</poll>
Issue which I am facing here is that, Its reading only one file on each poll schedule.
However, my expectation is, it should read all the file which satisfies the condition of filter from SFTP location on each poll schedule.
How can I achieve this?
Thanks.
You may want to use the SFTP Inbound Endpoint with file:filename-wildcard-filter instead of relying on a Groovy script wrapped in a poll. Such as:
<sftp:inbound-endpoint host="${sftp.host}" port="${sftp.port}" path="/home/test/sftp-files" user="${sftp.user}" password="${sftp.password}">
<file:filename-wildcard-filter pattern="payment_*"/>
</sftp:inbound-endpoint>
See the related example in SFTP documentation.

Process parallel multiple read in spring Integration

I have one file which contains some record and i have second file which also contains same record but with more details , so i want to process in a way that read one record from first file and search in second.
How to read two files in parallel ?
Update :
return IntegrationFlows.from(readFilefromDirectory(), fileInboundPollingConsumer())
.split(fileSplitter())
.channel(c ->c.executor(id, executor))
here in first line we are reading first file and then splitting it sending to executer channel so i want to know where should i write logic to read second file means to read from directory and then search record in that excel file .
All the parallel work in Spring Integration can be done via ExecutorChannel.
I'd suggest to use FileSplitter for the first file and an ExecutorChannel as an output.
As for the second file... Well, I'd read it once into the memory e.g. as a Map if you have some delimiter between record and its details. And use that memory store for incoming record.
UPDATE
If the second file is very big, you can do like this:
Scanner scanner=new Scanner(file);
while(scanner.hasNextLine()) {
String line = scanner.nextLine();
if(line.startsWith(payload)) {
return line;
}
}
UPDATE 2
return IntegrationFlows.from(readFilefromDirectory(), fileInboundPollingConsumer())
.split(fileSplitter())
.channel(c ->c.executor(id, executor))
.handle("searchService", "searchMethod")
where searchService is some service #Bean (or just #Service) and searchMethod utilize logic to accept the String line and implement logic to do the search in other file. That search operation will be done in parallel for each splitted like.
The logic to read Excel file is out of Spring Integration scope. That is fully different question.

Need Help in Setting Appender Name through a separate configuration file like app.config or web.config

I have four appenders namely as follows
appender name= LogFileAppender // to write general logs in File
appender name=LogDatabaseAppender// to write general logs in db via Oracle StoredProc
appender name=ExceptionFileAppender // to write exception logs in File
appender name=ExceptionDatabaseAppender // to write exception logs in db via Oracle StoredProc
I want to have a appconfig file where I can set which appender to use.
Moreover , I have methods as follows
Method_WriteLogOnly ---> which will use appender 1 or 2
Method_WriteExceptionLogs---> which will use appender 3 or 4
Problem is I dont know if I am using the same log4net.config.xml file for both the methods , then how to set the appender .
What is the best practice , either to set appender programmatically or through another configuration place like if I have an app.config or web.config file , and there I write a key value pair (some sort of code like this) for choosing the appender ?
I think you should not decide in code what appender to use: you should decide what is logged, but the people that run your application should decide how it is logged.
While I can understand that you want a separate file for exception, I wonder a bit why you want to use two database appenders. If you need to write to different tables then you can easily do this inside your stored procedure. This has several advantages: Configuration will be easier, you will only have one database connection...
Assuming that it would be good enough for you to say that "exceptions == messages with level ERROR" then you can easily create two appenders and use filters to make sure only messages with level ERROR end up in the "exception" log file.

Resources