Read a topic with multiple events - apache-spark

What should be done in a Spark Structured Streaming job so that it can read a multi event Kafka topic?
I am trying to read a topic that has multiple type of events. Every event may have difference schema. How does a streaming job determine the type of event or which schema of the event to use ?

Kafka Dataframes are always bytes. You use a UDF to deserialize the key/value columns.
For example, assuming data is JSON, you first cast bytes to string, then you can use get_json_object to extract/filter specific fields.
If data is in other format, then you could use Kafka record headers added by the producer (you'll need to add those in your own producer code) to designate what event type each record is, then filter based on those, and add logic for processing different sub-dataframes. Or you could wrap binary data in a more consistent schema such as CloudEvents spec, which includes a type field and nested binary content, which needs further deserialized.

Related

How to prevent Spark from keeping old data leading to out of memory in Spark Structured Streaming

I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.
Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba

Eventhub Stream not catching schema mismatch

We are trying to implement badRecordsPath when we are reading in events from an eventhub, as an example to try get it working I have put in schema that should fail the event:
eventStreamDF = (spark.readStream
.format("eventhubs")
.options(**eventHubsConf)
.option("badRecordsPath", "/tmp/badRecordsPath/test1")
.schema(badSchema)
.load()
)
Yet this never fails and always reads the events, is this the behaviour of the readstream for the eventhub for databricks? Currently the work around is to check the inferSchema against our own schema.
The schema of the data in EventHubs is fixed (see docs) (same is for Kafka) - the actual payload is always encoded as binary field with name body, and it's up the developer to decode this binary payload according to the "contact" between producer(s) of the data and consumers of that data. So even if you specify the schema, and badRecordsPath option, they aren't used.
You will need to implement some function that will decode data from JSON, or something else, that would for example return null if data is broken, and then you'll have a filter for null values to split stream into two substreams - for good & bad data.

How to compute difference between timestamps with PySpark Structured Streaming

I have the following problem with PySpark Structured Streaming.
Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps.
For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds".
Is there anyone who knows how to achieve this? I tried to use the window functions examples of the Structured Streaming documentation but it was useless.
Thank you very much
Since we're speaking about Structured Streaming and "every line and for every user" that tells me that you should use a streaming query with some sort of streaming aggregation (groupBy and groupByKey).
For streaming aggregation you can only rely on micro-batch stream execution in Structured Streaming. That gives that records for a single user could be part of two different micro-batches. That gives that you need a state.
That all together gives that you need a stateful streaming aggregation.
With that, I think you want one of the Arbitrary Stateful Operations, i.e. KeyValueGroupedDataset.mapGroupsWithState or KeyValueGroupedDataset.flatMapGroupsWithState (see KeyValueGroupedDataset):
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger.
Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state.
A state would be per user with the last record found. That looks doable.
My concerns would be:
How many users is this streaming query going to deal with? (the more the bigger the state)
When to clean up the state (of users that are no longer expected in a stream)? (which would keep the state of a reasonable size)

Can we channel different kind of message through same kafka topic?

I have scenario where I have different type of messages to be streamed from kafka producer.
If I dont want to use different topic per different message type how to handle it at spark-structured-streaming consumer side ?
i.e. only one topic I want to use for different type of messages ...say Student record , Customer record....etc.
How to identify which message is been received from Kafka topic?
Please let me know how to handle this scenario at kafka consumer side?
Kafka topics don't inheriently have "types of data". It's all bytes, so yes you can serialize completely separate objects into the same topic, but consumers must then add logic to know what are all possible types will get added into the topic.
That being said, Structured Streaming is built on the idea of having structured data with a schema, so it likely will not work if you had completely different types in the same topic without at least performing a filter first based on some inner attribute that is always present among all types.
Yes you can do this by adding "some attribute" to the message itself when producing which signifies a logical topic, or operation, and differentiating on the Spark side - e.g. Structured Streaming KAFKA integration. E.g. checking the message content for "some attribute" and process accordingly.
Partitioning is used of course for ordering always.

Spark : get Multiple DStream out of a single DStream

Is is possible to get multiple DStream out of a single DStream in spark.
My use case is follows: I am getting Stream of log data from HDFS file.
The log line contains an id (id=xyz).
I need to process log line differently based on the id.
So I was trying to different Dstream for each id from input Dstream.
I couldnt find anything related in documentation.
Does anyone know how this can be achieved in Spark or point to any link for this.
Thanks
You cannot Split multiple DStreams from Single DStreams.
The best you can do is: -
Modify your source system to have different streams for different ID's and then you can have different jobs to process different Streams
In case your source cannot change and provide you stream which is mix of ID, then you need to write custom logic to identify the ID and then perform the appropriate operation.
I would always prefer #1 as that is cleaner solution but there are exceptions for which #2 needs to be implemented.

Resources