Flink fish tagging - log4j

I wonder if anyone has any experience with fish tagging* Flink batch runs.
*Just as a fish can be tagged and have its movement tracked, stamping log events with a common tag or set of data elements allows the complete flow of a transaction or a request to be tracked. We call this Fish Tagging.
source
Specifically, I would like to make sure that a batch ID is added to each line of the log which has anything to do with that particular batch execution. This will allow me to track batches in Kibana
I don't see how using log4j's MDC will would propagate through multiple Flink nodes, and using a system property lookup to inject an ID through VM params would not allow me to run batches concurrently (would it?)
Thanks in advance for any pointers

Related

How to run apache-beam in batches on a bounded data?

I am trying to understand how the apache beam works and im not quite sure if i do. So, i want someone to tell me if my understanding is right:
Beam is a layer of abstraction over big data frameworks like spark,hadoop,google data flow etc. Now quite every functionality but almost that is the case
Beam treats data in two forms bounded and unbounded. Bounded like a .csv and unbounded like a kafka subscription. There are different i/o read methods for each. For unbounded data we need to implement windowing (attaching a timestamp to each data point) and trigger (a timestamp). A batch here would be all the datapoints in a window till a trigger is hit. For the bounded datasets however, all the dataset is loaded in RAM (? if yes, how do i make beam work on batches?). The output of a i/o method is a pCollection
There are pTransformations (these are the operations i want run on the data) that apply to each element of the of the pCollection. I can make these pTransformations apply over a spark or flint cluster (this choice goes in the initial options set for the pipeline). each pTransformation emits a pCollection and that is how we chain various pTransformations together. End is a pCollection that can be saved to disk
End of the pipeline could be a save on some file system (How does this happen when i am reading a .csv in batches?)
Please point out to me any lapses in my understanding
Beam is not like google cloud dataflow, Cloud Dataflow is a runner on top of Apache Beam. It executes Apache Beam pipelines. But you can run an Apache Beam job with a local runner not on the cloud. There are plenty of different runners that you can find in the documentation : https://beam.apache.org/documentation/#available-runners
One specific aspect of Beam is that it's the same pipeline for Batch and Stream and that's the purpose. You can specify --streaming as an argument to execute your pipeline, withou it it should execute it in batch. But it mostly depends on you inputs, the data will just flow into the pipeline. And that's one important point, PCollections do not contain persistent data just like RDD's for Spark RDD.
You can apply a PTransform on part of your data, it's not necessarly on all the data. All the PTranforms together forms the pipeline.
It really depends where and what format you want for your output...

How the akka-persistence-cassandra and Tags play together?

until now I only used akka-persistence-cassandra with journal plugin and didn't paid too much attention to Tags.
Lately I experimented little bit to understand how it works, but there are some points that are really confusing me, so I like to ask those here...
Now, I understand that Tags exists so that Cassandra can creates over tags and timebuckets partitions to be able to prevent hotspots in Cassandra.
When I configure the cassandra-journal plugin and Event Tags, I see the following Tables in Cassandra Key Space, messages, metadata, tag_scanning, tag_views, tag_write_progress...
Now if no Tags are configured, journal plugins writes only to messages table but if event tags are also configured, it persist both to messages and tag_scanning, tag_views, tag_write_progress tables...
So first question, what is the advantage or reason of writing both messages and tag_scanning, tag_views, tag_write_progress, does this not mean more load for Cassandra?
Second questions, am I doing something wrong do I have to turn off something in Journal Plugin somehow, so it will not persist to messages...
What am I missing here?
Thx for answers...
Tags are not specific to the Cassandra journal plugin, but a generic concept for Akka Persistence/Akka Persistence Query allowing an event sourced application to tag a subset of the events and separately consume those events as a stream.
Tagging is commonly used to split/shard updating a projection up into several workers, see for example the CQRS Akka samples here: https://github.com/akka/akka-samples/tree/2.6/akka-sample-cqrs-scala https://github.com/akka/akka-samples/tree/2.6/akka-sample-cqrs-java
The events are always stored in the messages table, even if they are tagged, so tagging rather leads to some additional writes.
The tag_write_progress and tag_scanning are implementation details related to consistency and ordering of the tagged events.
If you are not using tags in your application you can disable events-by-tag support in the plugin completely since there is some overhead attached to maintaining the related tables.

Run time Application Logging during spark execution

I have an application written for Spark using Scala language. My application code is kind of ready and the job runs for around 10-15 mins.
There is an additional requirement to provide status of the application execution when spark job is executing at run time. I know that spark runs in lazy way and it is not nice to retrieve data back to the driver program during spark execution. Typically, I would be interested in providing status at regular intervals.
Eg. if there 20 functional points configured in the spark application then I would like to provide status of each of these functional points as and when they are executed/ or steps are over during spark execution.
These incoming status of function points will then be taken to some custom User Interface to display the status of the job.
Can some one give me some pointers on how this can be achieved.
There are few things you can do on this front that I can think of.
If your job contains multiple actions, you can write a script to poll for the expected output of those actions. For example, imagine your script have 4 different DataFrame save calls. You could have your status script poll HDFS/S3 to see if the data has showed up in the expected output location yet. Another example, I have used Spark to index to ElasticSearch, and I have written status logging to poll for how many records are in the index to print periodic progress.
Another thing I tried before is use Accumulators to try and keep rough track of progress and how much data has been written. This works ok, but it is a little arbitrary when Spark updates the visible totals with information from the executors so I haven't found it to be too helpfully for this purpose generally.
The other approach you could do is poll Spark's status and metric APIs directly. You will be able to pull all of the information backing the Spark UI into your code and do with it whatever you want. It won't necessarily tell you exactly where you are in your driver code, but if you manually figure out how your driver maps to stages you could figure that out. For reference, here are is the documentation on polling the status API:
https://spark.apache.org/docs/latest/monitoring.html#rest-api

Records processed metric for intermediate datasets

I have created a spark job using DATASET API. There is chain of operations performed until the final result which is collected on HDFS.
But I also need to know how many records were read for each intermediate dataset. Lets say I apply 5 operations on dataset (could be map, groupby etc), I need to know how many records were there for each of 5 intermediate dataset. Can anybody suggest how this can be obtained at dataset level. I guess I can find this out at task level (using listeners) but not sure how to get it at dataset level.
Thanks
The nearest from Spark documentation related to metrics is Accumulators. However this is good only for actions and they mentioned that acucmulators will not be updated for transformations.
You can still use count to get the latest counts after each operation. But should keep in mind that its an extra step like any other and you need to see if the ingestion should be done faster with less metrics or slower with all metrics.
Now coming back to listerners, I see that a SparkListener can receive events about when applications, jobs, stages, and tasks start and complete as well as other infrastructure-centric events like drivers being added or removed, when an RDD is unpersisted, or when environment properties change. All the information you can find about the health of Spark applications and the entire infrastructure is in the WebUI.
Your requirement is more of a custom implementation. Not sure if you can achieve this. Some info regarding exporting metrics is here.
All metrics which you can collect are at job start, job end, task start and task end. You can check the docs here
Hope the above info might guide you in finding a better solutions

Per-user stream processing

I need to process data from a set of streams, applying the same elaboration to each stream independently from the other streams.
I've already seen frameworks like storm, but it appears that it allows the processing of static streams only (i.e. tweets form twitter), while I need to process data from each user separately.
A simple example of what I mean could be a system where each user can track his gps location and see statistics like average velocity, acceleration, burnt calories and so on in real time. Of course, each user would have his own stream(s) and the system should process the stream of each user separately, as if each user had its own dedicated topology processing his data.
Is there a way to achieve this with a framework like storm, spark streaming or samza?
It would be even better if python is supported, since I already have a lot of code I'd like to reuse.
Thank you very much for your help
Using Storm, you can group data using fields-grouping connection pattern if you have a user-id in your tuples. This ensures, that data is partitioned by user-id and thus you get logical substreams. Your code only needs to be able to process multiple groups/substreams, because a single bolt instance gets multiple groups for processing. But Storm supports your use case for sure. It also can run Python code.
In Samza, similar to Storm, one would partition the individual streams on some user ID. This would guarantee that the same processor would see all the events for some particular user (as well as other user IDs that the partition function [a hash, for instance] assigns to that processor). Your description sounds like something that would more likely run on the client's system rather than being a server-side operation, however.
Non-JVM language support has been proposed for Samza, but not yet implemented.
You can use WSO2 Stream Processor to achieve this. You can partition the input stream by user-name and process events pertain to each user separately. The processing logic has to be written in Siddhi QL which is a SQL like language.
WSO2 SP also has a python wrapper to, it will allow you do perform administrative tasks such as submitting, editing jobs. But you can't write processing logic using python code.

Resources