Spark : get Multiple DStream out of a single DStream - apache-spark

Is is possible to get multiple DStream out of a single DStream in spark.
My use case is follows: I am getting Stream of log data from HDFS file.
The log line contains an id (id=xyz).
I need to process log line differently based on the id.
So I was trying to different Dstream for each id from input Dstream.
I couldnt find anything related in documentation.
Does anyone know how this can be achieved in Spark or point to any link for this.
Thanks

You cannot Split multiple DStreams from Single DStreams.
The best you can do is: -
Modify your source system to have different streams for different ID's and then you can have different jobs to process different Streams
In case your source cannot change and provide you stream which is mix of ID, then you need to write custom logic to identify the ID and then perform the appropriate operation.
I would always prefer #1 as that is cleaner solution but there are exceptions for which #2 needs to be implemented.

Related

Use RichMap in Flink like Scala MapPartition

In Spark, we have MapPartition function, which is used to do some initialization for a group of entries, like some db operation.
Now I want to do the same thing in Flink. After some research I found out that I can use RichMap for the same use but it has a drawback that the operation can be done only at the open method which will be at the start of a streaming job. I will explain my use case which will clarify the situtaion.
Example : I am getting data for a millions of users in kafka, but I only want the data of some users to be finally persisted. Now this list of users is dynamic and is available in a db. I wanted to lookup the current users every 10mins, so that I filter out and store the data for only those users. In Spark(MapPartition), it would do the user lookup for every group and there I had configured to get users from the DB after every 10mins. But with Flink using RichMap I can do that only in the open function when my job starts.
How can I do the following operation in Flink?
It seems that what You want to do is stream-table join. There are multiple ways of doing that, but seems that the easiest one would be to use Broadcast state pattern here.
The idea is to define custom DataSource that periodically queries data from SQL table (or even better use CDC), use that tableStream as broadcast state and connect it with actual users stream.
Inside the ProcessFunction for the connected streams You will have access to the broadcasted table data and You can perform lookup for every user You receive and decide what to do with that.

How to prevent Spark from keeping old data leading to out of memory in Spark Structured Streaming

I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.
Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba

How to compute difference between timestamps with PySpark Structured Streaming

I have the following problem with PySpark Structured Streaming.
Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps.
For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds".
Is there anyone who knows how to achieve this? I tried to use the window functions examples of the Structured Streaming documentation but it was useless.
Thank you very much
Since we're speaking about Structured Streaming and "every line and for every user" that tells me that you should use a streaming query with some sort of streaming aggregation (groupBy and groupByKey).
For streaming aggregation you can only rely on micro-batch stream execution in Structured Streaming. That gives that records for a single user could be part of two different micro-batches. That gives that you need a state.
That all together gives that you need a stateful streaming aggregation.
With that, I think you want one of the Arbitrary Stateful Operations, i.e. KeyValueGroupedDataset.mapGroupsWithState or KeyValueGroupedDataset.flatMapGroupsWithState (see KeyValueGroupedDataset):
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger.
Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state.
A state would be per user with the last record found. That looks doable.
My concerns would be:
How many users is this streaming query going to deal with? (the more the bigger the state)
When to clean up the state (of users that are no longer expected in a stream)? (which would keep the state of a reasonable size)

What is the most simple way to write to kafka from spark stream

I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.

Spark transformations and ordering

I am working on parsing different types of files (text,xml,csv etc.) into a specific text file format using spark java API. This output file maintains the order of file header, start tag, data header, data and end tag. All of these element are extracted from input file at some point.
I tried to achieve this in below 2 ways:
Read file to RDD using sparks textFile and perform parsing by using map or mapPartions which returns new RDD.
Read file using sparks textFile , reduce to 1 partition using coalesce and perform parsing by using mapPartions which returns new RDD.
While I am not concerned about sequencing of actual data, with first approach I am not able to keep the required order of File Header, Start Tag, Data Header and End Tag.
The latter works for me, but I know it is not efficient way and may cause problem in case of BIG files.
Is there any efficient way to achieve this?
You are correct in you assumptions. The second choice simply cancels the distributional aspect of your application, so it's not scalable. For the order issue, as the concept is asynchronous, we cannot keep track of order when the data reside in different nodes. What you could do is some preprocessing that would cancel the need for order. Meaning, merge lines up to the point where the line order does not matter and only then distribute your file. Unless you can make assumptions about the file structure, such as number of lines that belong together, I would go with the above.

Resources