How to transfer data from Kafka to Cassandra using Nifi? - cassandra

I want to collect the data from Kafka using Nifi in Cassandra. I created a flow like this for this.
My database connection configuration is like this:
This is my configurations for my ConvertJsonToSQL processor:
I encounter the following error on my ConvertJsonToSQL processor.
ConvertJSONToSQL[id=d25a7e27-0167-1000-2d9a-2c969b33482a] ConvertJSONToSQL[id=d25a7e27-0167-1000-2d9a-2c969b33482a] failed to process session due to null; Processor Administratively Yielded for 1 sec: java.lang.NullPointerException
Note: I added dbschema driver jar to Nifi library.
What do you think I should do to solve this problem?

Based on the available information it is difficult to troubleshoot the error, the most likely reason for the ConvertJSONToSQL to fail is an invalid JSON. Just one point from the documentation:
The incoming FlowFile is expected to be "flat" JSON message, meaning that it consists of a single JSON element and each field maps to a simple type.
I cannot see what you did in the AttributesToJSON processor, but I believe twitter will typically return a nested JSON, and that you might not have flattened it enough.
A simple generic way to troubleshoot this, is to start processors from the top, and inspect the queue before/after each processor untill you see something you don't expect.
With this you should be able to pinpoint the problem exactly, and if needed you can use the information discovered in this way to create a reproducible example and ask a more detailed question.

Related

Accessing lineage of a NiFi flow file

I'm developing some kind of error handling for flow files for NiFi, e.g. a database sub-system refuses to write the data from a flow file, because the data is not as expected, because the source system of this data is missing some master data.
So this error handling writes the data into a MongoDB with more information what went wrong.
One of those 'more information' is some kind of stacktrace for this flow file, meaning the data lineage. For this purpose I wrote an InvokeScriptedProcessor with a Groovy script to achieve this.
Here is the important part of the script:
ArrayList getStacktrace(flowfileUuid){
def lineage = this.provenanceRepository.createLineageQuery(flowfileUuid)
def lineageData = this.provenanceRepository.getLineageData(lineage.id)
if (lineageData.results == null || lineageData.results.nodes.size() == 0){
println "cannot find stacktrace for ${flowfileUuid}."
return []
}
def eventIds = lineageData.results.nodes.findAll {n -> n.type == 'EVENT'}.collect {n -> n.id }.sort()
def provenanceEvents = []
for (eventId in eventIds){
provenanceEvents << this.provenanceRepository.getProvenanceEvent(eventId).provenanceEvent.componentName
}
this.provenanceRepository.deleteLineageQuery(lineage.id)
return provenanceEvents
}
For createLineageQuery I'm POSTING to the nifi-api with /nifi-api/provenance/lineage adding the uuid of the flow file in the body. The result is, among others, the ID of the query. I'm using this ID to getLineageData; there is also a property finished and I'm waiting until the query is finished.
With this lineage data I getProvenanceEvent data and write the name of the component (processor) into an array.
After that I deleteLineageQuery as stated in the documentation.
So this would be my stack trace.
The problem now is, the the lineage data is empty when the flow file first hits this InvokeScriptedProcessor. I tried a lot of things, like waiting and stuff. Doesn't help.
Now the odd thing is, that the lineage data is not empty, when I replay the flow file for this processor.
So the behavior is not deterministic as I'm expecting it.
Sometimes the lineage data is not empty when I'm processing the flow file for the first time.
I also tried the thing with Fiddler, there it worked all the time.
Is there a problem with my approach?
I'm currently using NiFi 1.6.0.
EDIT:
I'll take the answer of Bryan as solution.
I'll investigate that as soon as I've got the time, but sounds correct. Nevertheless, I tried my solution with NiFi 1.8.0 and it works as intended. So currently I'm fine with the way I implemented it in the first step, but I'll improve my solution with Bryan's suggestion.
I'm not totally sure what the problem is, but in general provenance data is not really meant to be accessed from a processor, which is why there is no API provided by the session or context that lets you retrieve provenance events, only creating events is allowed.
In order to run a provenance query the events need to be indexed, and there is no guarantees about when the indexing will take place related to when the flow file is being processed. So it is possible the events are not visible yet.
A ReportingTask is the intended way to access provenance events and can be used to push them out of NiFi to some external system for longer term storage.

Run time Application Logging during spark execution

I have an application written for Spark using Scala language. My application code is kind of ready and the job runs for around 10-15 mins.
There is an additional requirement to provide status of the application execution when spark job is executing at run time. I know that spark runs in lazy way and it is not nice to retrieve data back to the driver program during spark execution. Typically, I would be interested in providing status at regular intervals.
Eg. if there 20 functional points configured in the spark application then I would like to provide status of each of these functional points as and when they are executed/ or steps are over during spark execution.
These incoming status of function points will then be taken to some custom User Interface to display the status of the job.
Can some one give me some pointers on how this can be achieved.
There are few things you can do on this front that I can think of.
If your job contains multiple actions, you can write a script to poll for the expected output of those actions. For example, imagine your script have 4 different DataFrame save calls. You could have your status script poll HDFS/S3 to see if the data has showed up in the expected output location yet. Another example, I have used Spark to index to ElasticSearch, and I have written status logging to poll for how many records are in the index to print periodic progress.
Another thing I tried before is use Accumulators to try and keep rough track of progress and how much data has been written. This works ok, but it is a little arbitrary when Spark updates the visible totals with information from the executors so I haven't found it to be too helpfully for this purpose generally.
The other approach you could do is poll Spark's status and metric APIs directly. You will be able to pull all of the information backing the Spark UI into your code and do with it whatever you want. It won't necessarily tell you exactly where you are in your driver code, but if you manually figure out how your driver maps to stages you could figure that out. For reference, here are is the documentation on polling the status API:
https://spark.apache.org/docs/latest/monitoring.html#rest-api

Learning Spark Streaming

I am learning spark streaming using the book "Learning spark Streaming". In the book i found the following on a section talking about Dstream, RDD, block/partition.
Finally, one important point that is glossed over in this schema is that the Receiver interface also has the option of connecting to a data source that delivers a collection (think Array) of data pieces. This is particularly relevant in some de-serialization uses, for example. In this case, the Receiver does not go through a block interval wait to deal with the segmentation of data into partitions, but instead considers the whole collection reflects the segmentation of the data into blocks, and creates one block for each element of the collection. This operation is demanding on the part of the Producer of data, since it requires it to be producing blocks at the ratio of the block interval to batch interval to function reliably (delivering the correct number of blocks on every batch). But some have found it can provide superior performance, provided an implementation that is able to quickly make many blocks available for serialization.
I have been banging my head around and can't simply understand what the Author is talking about, although i feel like i should understand it. Can someone give me some pointers on that ?
Disclosure: I'm co-author of the book.
What we want to express there is that the custom receiver API has 2 working modes: one where the producing side delivers one-message-at-time and the other where the receiver may deliver many messages at once (bulk).
In the one-message-at-time mode, Spark is responsible of buffering and collecting the data into blocks for further processing.
In the bulk mode, the burden of buffering and grouping is on the producing side, but it might be more efficient in some scenarios.
This is reflected in the API:
def store(dataBuffer: ArrayBuffer[T]): Unit
Store an ArrayBuffer of received data as a data block into Spark's memory.
def store(dataItem: T): Unit
Store a single item of received data to Spark's memory.
I agree with you that the paragraph is convoluted and might not convey the message as clear as we would like. I'll take care of improving it.
Thanks for your feedback!

What is the most simple way to write to kafka from spark stream

I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.

Parallelism of Streams in Spark Streaming Context

I have multiple input sources (~200) coming in on Kafka topics - the data for each is similar, but each must be run separately because there are differences in schemas - and we need to perform aggregate health checks on the feeds (so we can't throw them all into 1 topic in a simple way, without creating more work downstream). I've created a spark app with a spark streaming context, and everything seems to be working, except that it is only running the streams sequentially. There are certain bottlenecks in each stream which make this very inefficient, and I would like all streams to run at the same time - is this possible? I haven't been able to find a simple way to do this. I've seen the concurrentJobs parameter, but that doesn't worked as desired. Any design suggestions are also welcome, if there is not an easy technical solution.
Thanks
The answer was here:
https://spark.apache.org/docs/1.3.1/job-scheduling.html
with the fairscheduler.xml file.
By default it is FIFO... it only worked for me once I explicitly wrote the file (couldn't set it programmatically for some reason).

Resources