Accessing lineage of a NiFi flow file - groovy

I'm developing some kind of error handling for flow files for NiFi, e.g. a database sub-system refuses to write the data from a flow file, because the data is not as expected, because the source system of this data is missing some master data.
So this error handling writes the data into a MongoDB with more information what went wrong.
One of those 'more information' is some kind of stacktrace for this flow file, meaning the data lineage. For this purpose I wrote an InvokeScriptedProcessor with a Groovy script to achieve this.
Here is the important part of the script:
ArrayList getStacktrace(flowfileUuid){
def lineage = this.provenanceRepository.createLineageQuery(flowfileUuid)
def lineageData = this.provenanceRepository.getLineageData(lineage.id)
if (lineageData.results == null || lineageData.results.nodes.size() == 0){
println "cannot find stacktrace for ${flowfileUuid}."
return []
}
def eventIds = lineageData.results.nodes.findAll {n -> n.type == 'EVENT'}.collect {n -> n.id }.sort()
def provenanceEvents = []
for (eventId in eventIds){
provenanceEvents << this.provenanceRepository.getProvenanceEvent(eventId).provenanceEvent.componentName
}
this.provenanceRepository.deleteLineageQuery(lineage.id)
return provenanceEvents
}
For createLineageQuery I'm POSTING to the nifi-api with /nifi-api/provenance/lineage adding the uuid of the flow file in the body. The result is, among others, the ID of the query. I'm using this ID to getLineageData; there is also a property finished and I'm waiting until the query is finished.
With this lineage data I getProvenanceEvent data and write the name of the component (processor) into an array.
After that I deleteLineageQuery as stated in the documentation.
So this would be my stack trace.
The problem now is, the the lineage data is empty when the flow file first hits this InvokeScriptedProcessor. I tried a lot of things, like waiting and stuff. Doesn't help.
Now the odd thing is, that the lineage data is not empty, when I replay the flow file for this processor.
So the behavior is not deterministic as I'm expecting it.
Sometimes the lineage data is not empty when I'm processing the flow file for the first time.
I also tried the thing with Fiddler, there it worked all the time.
Is there a problem with my approach?
I'm currently using NiFi 1.6.0.
EDIT:
I'll take the answer of Bryan as solution.
I'll investigate that as soon as I've got the time, but sounds correct. Nevertheless, I tried my solution with NiFi 1.8.0 and it works as intended. So currently I'm fine with the way I implemented it in the first step, but I'll improve my solution with Bryan's suggestion.

I'm not totally sure what the problem is, but in general provenance data is not really meant to be accessed from a processor, which is why there is no API provided by the session or context that lets you retrieve provenance events, only creating events is allowed.
In order to run a provenance query the events need to be indexed, and there is no guarantees about when the indexing will take place related to when the flow file is being processed. So it is possible the events are not visible yet.
A ReportingTask is the intended way to access provenance events and can be used to push them out of NiFi to some external system for longer term storage.

Related

Handling Out-Of-Order Event Windowing in Apache Beam from a Multitenant Kafka Topic

I’ve been mulling over how to solve a given problem in Beam and thought I’d reach out to a larger audience for some advice. At present things seem to be working sparsely and I was curious if someone could provide a sounding-board to see if this workflow makes sense.
The primary high-level goal is to read records from Kafka that may be out of order and need to be windowed in Event Time according to another property found on the records and eventually emitting the contents of those windows and writing them out to GCS.
The current pipeline looks roughly like the following:
val partitionedEvents = pipeline
.apply("Read Events from Kafka",
KafkaIO
.read<String, Log>()
.withBootstrapServers(options.brokerUrl)
.withTopic(options.incomingEventsTopic)
.withKeyDeserializer(StringDeserializer::class.java)
.withValueDeserializerAndCoder(
SpecificAvroDeserializer<Log>()::class.java,
AvroCoder.of(Log::class.java)
)
.withReadCommitted()
.commitOffsetsInFinalize()
// Set the watermark to use a specific field for event time
.withTimestampPolicyFactory { _, previousWatermark -> WatermarkPolicy(previousWatermark) }
.withConsumerConfigUpdates(
ImmutableMap.of<String, Any?>(
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest",
ConsumerConfig.GROUP_ID_CONFIG, "log-processor-pipeline",
"schema.registry.url", options.schemaRegistryUrl
)
).withoutMetadata()
)
.apply("Logging Incoming Logs", ParDo.of(Events.log()))
.apply("Rekey Logs by Tenant", ParDo.of(Events.key()))
.apply("Partition Logs by Source",
// This is a custom function that will partition incoming records by a specific
// datasource field
Partition.of(dataSources.size, Events.partition<KV<String, Log>>(dataSources))
)
dataSources.forEach { dataSource ->
// Store a reference to the data source name to avoid serialization issues
val sourceName = dataSource.name
val tempDirectory = Directories.resolveTemporaryDirectory(options.output)
// Grab all of the events for this specific partition and apply the source-specific windowing
// strategies
partitionedEvents[dataSource.partition]
.apply(
"Building Windows for $sourceName",
SourceSpecificWindow.of<KV<String, Log>>(dataSource)
)
.apply("Group Windowed Logs by Key for $sourceName", GroupByKey.create())
.apply("Log Events After Windowing for $sourceName", ParDo.of(Events.logAfterWindowing()))
.apply(
"Writing Windowed Logs to Files for $sourceName",
FileIO.writeDynamic<String, KV<String, MutableIterable<Log>>>()
.withNumShards(1)
.by { row -> "${row.key}/${sourceName}" }
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(SerializableFunction { logs -> Files.stringify(logs.value) }), TextIO.sink())
.to(options.output)
.withNaming { partition -> Files.name(partition)}
.withTempDirectory(tempDirectory)
)
}
In a simpler, bulleted form, it might look like this:
Read records from single Kafka topic
Key all records by their tenant
Partition stream by another event properly
Iterate through known partitions in previous step
Apply custom windowing rules for each partition (related to datasource, custom window rules)
Group windowed items by key (tenant)
Write tenant-key pair groupings to GCP via FileIO
The problem is that the incoming Kafka topic contains out-of-order data across multiple tenants (e.g. events for tenant1 might be streaming in now, but then a few minutes later you’ll get them for tenant2 in the same partition, etc.). This would cause the watermark to bounce back and forth in time as each incoming record would not be guaranteed to continually increase, which sounds like it would be a problem, but I'm not certain. It certainly seems that while data is flowing through, some files are simply not being emitted at all.
The custom windowing function is extremely simple and was aimed to emit a single window once the allowed lateness and windowing duration has elapsed:
object SourceSpecificWindow {
fun <T> of(dataSource: DataSource): Window<T> {
return Window.into<T>(FixedWindows.of(dataSource.windowDuration()))
.triggering(Never.ever())
.withAllowedLateness(dataSource.allowedLateness(), Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes()
}
}
However, it seemed inconsistent since we'd see logging come out after the closing of the window, but not necessarily files being written out to GCS.
Does anything seem blatantly wrong or incorrect with this approach? Since the data can come in out of order within the source (i.e. right now, 2 hours ago, 5 minutes from now) and covers data across multiple tenants, but the aim is try and ensure that one tenant that keeps up to date won't drown out tenants that might come in the past.
Would we potentially need another Beam application or something to "split" this single stream of events into sub-streams that are each processed independently (so that each watermark processes on their own)? Is that where a SplittableDoFn would come in? Since I'm running on the SparkRunner, which doesn't appear to support that - but it seems as though it'd be a valid use case.
Any advice would be greatly appreciated or even just another set of eyes. I'd be happy to provide any additional details that I could.
Environment
Currently running against SparkRunner
While this may not be the most helpful response, I'll be transparent as far as the end result. Eventually the logic required for this specific use-case extended far beyond the built-in capabilities of those in Apache Beam, primarily in the area around windowing/governance of time.
The solution that was landed on was to switch the preferred streaming technology from Apache Beam to Apache Flink, which as you might imagine was quite a leap. The stateful-centric nature of Flink allowed us to more easily handle our use cases, define custom eviction criteria (and ordering) around windowing, while losing a layer of abstraction over it.

How to transfer data from Kafka to Cassandra using Nifi?

I want to collect the data from Kafka using Nifi in Cassandra. I created a flow like this for this.
My database connection configuration is like this:
This is my configurations for my ConvertJsonToSQL processor:
I encounter the following error on my ConvertJsonToSQL processor.
ConvertJSONToSQL[id=d25a7e27-0167-1000-2d9a-2c969b33482a] ConvertJSONToSQL[id=d25a7e27-0167-1000-2d9a-2c969b33482a] failed to process session due to null; Processor Administratively Yielded for 1 sec: java.lang.NullPointerException
Note: I added dbschema driver jar to Nifi library.
What do you think I should do to solve this problem?
Based on the available information it is difficult to troubleshoot the error, the most likely reason for the ConvertJSONToSQL to fail is an invalid JSON. Just one point from the documentation:
The incoming FlowFile is expected to be "flat" JSON message, meaning that it consists of a single JSON element and each field maps to a simple type.
I cannot see what you did in the AttributesToJSON processor, but I believe twitter will typically return a nested JSON, and that you might not have flattened it enough.
A simple generic way to troubleshoot this, is to start processors from the top, and inspect the queue before/after each processor untill you see something you don't expect.
With this you should be able to pinpoint the problem exactly, and if needed you can use the information discovered in this way to create a reproducible example and ask a more detailed question.

Spring batch remote chunking - returning data from slave node

I am using spring batch remote chunking for distributed processing.
When a slave node is done with processing a chunk I would like to return some additional data along with ChunkResponse.
For example if a chunk consist of 10 user Ids I would like to return in response how many user ids were processed successfully.
The response could include some other data as well. I have spent considerable time trying to figure out ways to achieve this
but without any success.
For example I have tried to extend ChunkResponse class and add some additional fields to it. And then extend ChunkProcessorChunkHandler
and return customized ChunkResponse from it. But I am not sure if this is proper approach.
I also need a way on master node to read the ChunkResponse in some callback. I guess I can use afterChunk(ChunkContext) method of ChunkListener
but I couldn't find a way to get ChunkResponse from ChunkContext in the method.
So to sump it up I would like to know how can I pass data from slave to master per chunk and on master node how can I read this data.
Thanks a lot.
EDIT
In my case master node reads user records and slave nodes process these records. At the end of the job
master needs to take conditional action based on whether processing of a particular user failed or succeeded. The fail/success on
slave node is not based on any exception thrown there but based on some business rules. And there is other data that master needs to know about, for example
how many emails were sent for each user. Now if I was using remote partitioning I could use jobContext to put and get this data but in remote chunking
jobContext is not available. So I was wondering if along with ChunkResponse I could send back some additional data from slave to master.

Learning Spark Streaming

I am learning spark streaming using the book "Learning spark Streaming". In the book i found the following on a section talking about Dstream, RDD, block/partition.
Finally, one important point that is glossed over in this schema is that the Receiver interface also has the option of connecting to a data source that delivers a collection (think Array) of data pieces. This is particularly relevant in some de-serialization uses, for example. In this case, the Receiver does not go through a block interval wait to deal with the segmentation of data into partitions, but instead considers the whole collection reflects the segmentation of the data into blocks, and creates one block for each element of the collection. This operation is demanding on the part of the Producer of data, since it requires it to be producing blocks at the ratio of the block interval to batch interval to function reliably (delivering the correct number of blocks on every batch). But some have found it can provide superior performance, provided an implementation that is able to quickly make many blocks available for serialization.
I have been banging my head around and can't simply understand what the Author is talking about, although i feel like i should understand it. Can someone give me some pointers on that ?
Disclosure: I'm co-author of the book.
What we want to express there is that the custom receiver API has 2 working modes: one where the producing side delivers one-message-at-time and the other where the receiver may deliver many messages at once (bulk).
In the one-message-at-time mode, Spark is responsible of buffering and collecting the data into blocks for further processing.
In the bulk mode, the burden of buffering and grouping is on the producing side, but it might be more efficient in some scenarios.
This is reflected in the API:
def store(dataBuffer: ArrayBuffer[T]): Unit
Store an ArrayBuffer of received data as a data block into Spark's memory.
def store(dataItem: T): Unit
Store a single item of received data to Spark's memory.
I agree with you that the paragraph is convoluted and might not convey the message as clear as we would like. I'll take care of improving it.
Thanks for your feedback!

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});
You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

Resources