Structured streaming multiple watermarks - apache-spark

I use Spark 2.3.0 if that matters.
According to the Structured Streaming documentation, it handles late data using watermarks. It also mentions that streaming deduplication is also achieved by using watermarking to keep a limit on how much of enter link description here an intermediate state is stored.
So, my question is if these watermarks can have different values or is the watermark specified only once? I ask this because I will be deduplicating values after aggregation so the tolerance for handling late data is different.

From the Policy for handling multiple watermarks:
A streaming query can have multiple input streams that are unioned or joined together. Each of the input streams can have a different threshold of late data that needs to be tolerated for stateful operations. You specify these thresholds using withWatermarks("eventTime", delay) on each of the input streams.
While executing the query, Structured Streaming individually tracks the maximum event time seen in each input stream, calculates watermarks based on the corresponding delay, and chooses a single global watermark with them to be used for stateful operations. By default, the minimum is chosen as the global watermark because it ensures that no data is accidentally dropped as too late if one of the streams falls behind the others (for example, one of the streams stop receiving data due to upstream failures). In other words, the global watermark will safely move at the pace of the slowest stream and the query output will be delayed accordingly.
Since Spark 2.4, you can set the multiple watermark policy to choose the maximum value as the global watermark by setting the SQL configuration spark.sql.streaming.multipleWatermarkPolicy to max (default is min).
In fact, this also applies to any watermark-sensitive operator.

Related

How to prevent Spark from keeping old data leading to out of memory in Spark Structured Streaming

I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.
Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba

How to compute difference between timestamps with PySpark Structured Streaming

I have the following problem with PySpark Structured Streaming.
Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps.
For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds".
Is there anyone who knows how to achieve this? I tried to use the window functions examples of the Structured Streaming documentation but it was useless.
Thank you very much
Since we're speaking about Structured Streaming and "every line and for every user" that tells me that you should use a streaming query with some sort of streaming aggregation (groupBy and groupByKey).
For streaming aggregation you can only rely on micro-batch stream execution in Structured Streaming. That gives that records for a single user could be part of two different micro-batches. That gives that you need a state.
That all together gives that you need a stateful streaming aggregation.
With that, I think you want one of the Arbitrary Stateful Operations, i.e. KeyValueGroupedDataset.mapGroupsWithState or KeyValueGroupedDataset.flatMapGroupsWithState (see KeyValueGroupedDataset):
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger.
Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state.
A state would be per user with the last record found. That looks doable.
My concerns would be:
How many users is this streaming query going to deal with? (the more the bigger the state)
When to clean up the state (of users that are no longer expected in a stream)? (which would keep the state of a reasonable size)

Kappa architecture: when insert to batch/analytic serving layer happens

As you know, Kappa architecture is some kind of simplification of Lambda architecture. Kappa doesn't need batch layer, instead speed layer have to guarantee computation precision and enough throughput (more parallelism/resources) on historical data re-computation.
Still Kappa architecture requires two serving layers in case when you need to do analytic based on historical data. For example, data that have age < 2 weeks are stored at Redis (streaming serving layer), while all older data are stored somewhere at HBase (batch serving layer).
When (due to Kappa architecture) I have to insert data to batch serving layer?
If streaming layer inserts data immidiately to both batch & stream serving layers - than how about late data arrival? Or streaming layer should backup speed serving layer to batch serving layer on regular basis?
Example: let say source of data is Kafka, data are processed by Spark Structured Streaming or Flink, sinks are Redis and HBase. When write to Redis & HBase should happen?
If we perform stream processing, we want to make sure that output data is firstly made available as a data stream. In your example that means we write to Kafka as a primary sink.
Now you have two options:
have secondary jobs that reads from that Kafka topic and writes to Redis and HBase. That is the Kafka way, in that Kafka Streams does not support writing directly to any of these systems and you set up a Kafka connect job. These secondary jobs can then be tailored to the specific sinks, but they add additional operations overhead. (That's a bit of the backup option that you mentioned).
with Spark and Flink you also have the option to have secondary sinks directly in your job. You may add additional processing steps to transform the Kafka output into a more suitable form for the sink, but you are more limited when configuring the job. For example in Flink, you need to use the same checkpointing settings for the Kafka sink and the Redis/HBase sink. Nevertheless, if the settings work out, you just need to run one streaming job instead of 2 or 3.
Late events
Now the question is what to do with late data. The best solution is to let the framework handle that through watermarks. That is, data is only committed at all sinks, when the framework is sure that no late data arrives. If that doesn't work out because you really need to process late events even if they arrive much, much later and still want to have temporary results, you have to use update events.
Update events
(as requested by the OP, I will add more details to the update events)
In Kafka Streams, elements are emitted through a continuous refinement mechanism by default. That means, windowed aggregations emit results as soon as they have any valid data point and update that result while receiving new data. Thus, any late event is processed and yield an updated result. While this approach nicely lowers the burden to users, as they do not need to understand watermarks, it has some severe short-comings that led the Kafka Streams developers to add Suppression in 2.1 and onward.
The main issue is that it poses quite big challenges to downward users to process intermediate results as also explained in the article about Suppression. If it's not obvious if a result is temporary or "final" (in the sense that all expected events have been processed) then many applications are much harder to implement. In particular, windowing operations need to be replicated on consumer side to get the "final" value.
Another issue is that the data volume is blown up. If you'd have a strong aggregation factor, using watermark-based emission will reduce your data volume heavily after the first operation. However, continuous refinement will add a constant volume factor as each record triggers a new (intermediate) record for all intermediate steps.
Lastly, and particularly interesting for you is how to offload data to external systems if you have update events. Ideally, you would offload the data with some time lag continuously or periodically. That approach simulates the watermark-based emission again on consumer side.
Mixing the options
It's possible to use watermarks for the initial emission and then use update events for late events. The volume is then reduced for all "on-time" events. For example, Flink offers allowed lateness to make windows trigger again for late events.
This setup makes offloading data much easier as data only needs to be re-emitted to the external systems if a late event actually happened. The system should be tweaked that a late event is a rare case though.

Spark 2.2 Structured Streaming - duplicate check without watermark - controlling the duplicates state size

We are implementing a real-time aggregation process with Spark 2.2 structured streaming.
The windows calculated can be based on event time that comes as an attribute on the object being processed in the stream and also can be calculated according to the processing time of the system like was possible in previous versions.
In addition we need to implement duplicate check to prevent the processing of duplicate events.
In the documentation it is stated that
Without watermark - Since there are no bounds on when a duplicate record may arrive, the query stores the data from all the past records as state. (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication)
The problem is that looks like this state can grow forever and cause performance and memory size issues. There seems to be no way in the API to control it
The question is is there a way to implement duplicate check on events that do not have a notion of event time and still maintain control on the duplicate check sate like clearing it etc.
Thanks!

Spark Streaming - TIMESTAMP field based processing

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.
The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTAMP.
At the moment I'm loading this file and extracting all the events in a JavaRDD and I would like to pass them to Spark Streaming in order to collect some stats based on the TIMESTAMP (a sort of replay).
My question is if it is possible to process these event using the EVENT TIMESTAMP as temporal reference instead of the actual time of the machine (sorry for the silly question).
In case it is possible, will I need simply spark streaming or I need to switch to Structured Streaming?
I found a similar question here:
Aggregate data based on timestamp in JavaDStream of spark streaming
Thanks in advance
TL;DR
yes you could use either Spark Streaming or Structured Streaming, but I wouldn't if I were you.
Detailed answer
Sorry, no simple answer to this one. Spark Streaming might be better for the per-event processing if you need to individually examine each event. Structured Streaming will be a nicer way to perform aggregations and any processing where per-event work isn't necessary.
However, there is a whole bunch of complexity in your requirements, how much of the complexity you address depends on the cost of inaccuracy in the Streaming job output.
Spark Streaming makes no guarantee that events will be processed in any kind of order. To impose ordering, you will need to setup a window in which to do your processing that minimises the risk of out-of-order processing to an acceptable level. You will need to use a big enough window of data to accurately capture your temporal ordering.
You'll need to give these points some thought:
If a batch fails and is retried, how will that affect your counters?
If events arrive late, will you ignore them, re-process the whole affected window, or update the output? If the latter how can you guarantee the update is done safely?
Will you minimise risk of corruption by keeping hold of a large window of events, or accept any inaccuracies that may arise from a smaller window?
Will the partitioning of events cause complexity in the order that they are processed?
My opinion is that, unless you have relaxed constraints over accuracy, Spark is not the right tool for the job.
I hope that helps in some way.
It is easy to do aggregations based on event-time with Spark SQL (in either batch or structured streaming). You just need to group by a time window over your timestamp column. For example, the following will bucket you data into 1 minute intervals and give you the count for each bucket.
df.groupBy(window($"timestamp", "1 minute") as 'time)
.count()

Resources