Workaround for joining two streams in structured streaming in Spark 2.x - apache-spark

I have a stream of configurations (not changed often, but if there's an update, it will be a message), and another stream of raw data points.
As I understand that for now spark doesn't support joining to streaming datasets or dataframes. Is there a good way to workaround this?
Is it possible to "snapshot" one of the streaming dataset to a static dataset (probably the configuration one, since it has less updates), then join with the other streaming dataset?
Open to suggestions!

"Workaround" is to use current master branch ;)
It's not released yet, but current master branch already has stream-stream inner join and there is outer join in progress. See this Jira ticket for reference, in sub-task you see possible joins to use.
There's no other easy workaround. Streaming joins requires saving state of streams and then correct updates of state. You can see code in pull requests, it's quite complex to implement stream-stream join.

So here is what I'm doing at the end.
Put the stream with less updates into a memory sink. Then do a select from the that table. By this time, it is a static instance and can be joined with the other stream. No trigger needed. Of course, you need to update the table correctly by yourself.
This is not very robust, but that's the best one I can come up with before the official support.

Related

How to prevent Spark from keeping old data leading to out of memory in Spark Structured Streaming

I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.
Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba

Ordering of events in structured streaming window aggregates in append mode

I am facing an issue in structured streaming with spark.
Current setup : I have a datastream coming from kafka. Each message has an eventtime. I am using these eventtimes to make window aggregates and a watermark rule to discard state.
The mode of output is append mode.
Aim: I need to get the window aggregates in order as they expire so that I can process these events in order of eventime windows. I expect the windows state to expire sequencially because of my sliding window.
Problem: Some times the order of messages printed is not sequential on basis of windows . For ex
|[2020-06-11 08:02:00, 2020-06-11 08:03:00]|
|[2020-06-11 08:01:00, 2020-06-11 08:02:00]|
Why are the windows not dropped in order? I wanted this to be ordered .
Please help
It's not in the architecture (yet). Many posts here on SO allay that notion.
As cricket_007 states in numerous posts, you can best leave the sorting to the down-stream system(s) in general. It is also more flexible that way, the whole notion of the RDBMS is that fixing the data sort order is less valid - clustering aside.
If you look at this use case https://mapr.com/blog/real-time-analysis-popular-uber-locations-spark-structured-streaming-machine-learning-kafka-and-mapr-db/ then you see that sorting plays no role. That said, I see many requests, but many can achieve the goal without sorting.
Single Partition Topics are an outcome for lower volumes, providing the sort order is set by the 'producer' and that order is acceptable. Also, you can consider KSQL and writing to a single partition KAFKA Topic to read in from subsequently or KAFKA Streams with Java, Scala.
I think the issue is that as from a post I saw a few years ago:
The basic tenet of structured streaming is that a query should return
the same answer in streaming or batch mode. We support sorting in
complete mode because we have all the data and can sort it correctly
and return the full answer. In update or append mode, sorting would
only return a correct answer if we could promise that records that
sort lower are going to arrive later (and we can't). Therefore, it is
disallowed.

Kappa architecture: when insert to batch/analytic serving layer happens

As you know, Kappa architecture is some kind of simplification of Lambda architecture. Kappa doesn't need batch layer, instead speed layer have to guarantee computation precision and enough throughput (more parallelism/resources) on historical data re-computation.
Still Kappa architecture requires two serving layers in case when you need to do analytic based on historical data. For example, data that have age < 2 weeks are stored at Redis (streaming serving layer), while all older data are stored somewhere at HBase (batch serving layer).
When (due to Kappa architecture) I have to insert data to batch serving layer?
If streaming layer inserts data immidiately to both batch & stream serving layers - than how about late data arrival? Or streaming layer should backup speed serving layer to batch serving layer on regular basis?
Example: let say source of data is Kafka, data are processed by Spark Structured Streaming or Flink, sinks are Redis and HBase. When write to Redis & HBase should happen?
If we perform stream processing, we want to make sure that output data is firstly made available as a data stream. In your example that means we write to Kafka as a primary sink.
Now you have two options:
have secondary jobs that reads from that Kafka topic and writes to Redis and HBase. That is the Kafka way, in that Kafka Streams does not support writing directly to any of these systems and you set up a Kafka connect job. These secondary jobs can then be tailored to the specific sinks, but they add additional operations overhead. (That's a bit of the backup option that you mentioned).
with Spark and Flink you also have the option to have secondary sinks directly in your job. You may add additional processing steps to transform the Kafka output into a more suitable form for the sink, but you are more limited when configuring the job. For example in Flink, you need to use the same checkpointing settings for the Kafka sink and the Redis/HBase sink. Nevertheless, if the settings work out, you just need to run one streaming job instead of 2 or 3.
Late events
Now the question is what to do with late data. The best solution is to let the framework handle that through watermarks. That is, data is only committed at all sinks, when the framework is sure that no late data arrives. If that doesn't work out because you really need to process late events even if they arrive much, much later and still want to have temporary results, you have to use update events.
Update events
(as requested by the OP, I will add more details to the update events)
In Kafka Streams, elements are emitted through a continuous refinement mechanism by default. That means, windowed aggregations emit results as soon as they have any valid data point and update that result while receiving new data. Thus, any late event is processed and yield an updated result. While this approach nicely lowers the burden to users, as they do not need to understand watermarks, it has some severe short-comings that led the Kafka Streams developers to add Suppression in 2.1 and onward.
The main issue is that it poses quite big challenges to downward users to process intermediate results as also explained in the article about Suppression. If it's not obvious if a result is temporary or "final" (in the sense that all expected events have been processed) then many applications are much harder to implement. In particular, windowing operations need to be replicated on consumer side to get the "final" value.
Another issue is that the data volume is blown up. If you'd have a strong aggregation factor, using watermark-based emission will reduce your data volume heavily after the first operation. However, continuous refinement will add a constant volume factor as each record triggers a new (intermediate) record for all intermediate steps.
Lastly, and particularly interesting for you is how to offload data to external systems if you have update events. Ideally, you would offload the data with some time lag continuously or periodically. That approach simulates the watermark-based emission again on consumer side.
Mixing the options
It's possible to use watermarks for the initial emission and then use update events for late events. The volume is then reduced for all "on-time" events. For example, Flink offers allowed lateness to make windows trigger again for late events.
This setup makes offloading data much easier as data only needs to be re-emitted to the external systems if a late event actually happened. The system should be tweaked that a late event is a rare case though.

Spark Structured Streaming Checkpoint Compatibility

Am I safe to use Kafka and Spark Structured Streaming (SSS) (>=v2.2) with checkpointing on HDFS in cases where I have to upgrade the Spark library or when changing the query? I'd like to seamlessly continue with the offset left behind even in those cases.
I've found different answers when searching the net for compatibility issues in SSS's (>=2.2) checkpoint mechanism. Maybe someone out there can lighten up the situation ... in best case backed up with facts/references or first-person experience ?
In Spark's programming guide (current=v2.3) they just claim "..should be a directory in an HDFS-compatible" but don't even leave a single word about constraints in terms of compatibility.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks at least gives some hints that this is an issue at all.
https://docs.databricks.com/spark/latest/structured-streaming/production.html#recover-after-changes-in-a-streaming-query
A Cloudera blog recommends storing the offset rather in Zookeeper, but this actually refers to the "old" Spark Streaming implementation. If this is relates to structured streaming, too, is unclear.
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
A guy in this conversation claims that there is no problem on that regard anymore ...but without pointing to facts.
How to get Kafka offsets for structured query for manual and reliable offset management?
Help is highly appreciated.
Checkpoints are great when you don't need to change the code, fire and forget procedure are perfect use cases.
I read the post from Databricks you posted, the truth is that you can't know what kind of changes are called to do until you have to do them. I wonder how they can predict the future.
About the link on Cloudera, yes they are speaking about the old procedure, but with Structured Streaming still code changes void your checkpoints.
So, in my opinion, so much automation is good for Fire and Forget procedure.
If this is not your case, saving the Kafka offset elsewhere is a good way to restart from where you left last time; you know that Kafka can contain a lot of data and restart from zero to avoid data loss or accept the idea to restart from the latest offset sometimes is not always acceptable.
Remember: Any stream logic change will be ignored as long as there are checkpoints, so you can't make change to your job once deployed, unless you accept the idea to throwing away the checkpoints.
By throwing away the checkpoints you must force the job to reprocess the entire Kafka topic (earliest), or start right at the end (latest) skipping unprocessed data.
It's great, is it not?

Spark Streaming - TIMESTAMP field based processing

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.
The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTAMP.
At the moment I'm loading this file and extracting all the events in a JavaRDD and I would like to pass them to Spark Streaming in order to collect some stats based on the TIMESTAMP (a sort of replay).
My question is if it is possible to process these event using the EVENT TIMESTAMP as temporal reference instead of the actual time of the machine (sorry for the silly question).
In case it is possible, will I need simply spark streaming or I need to switch to Structured Streaming?
I found a similar question here:
Aggregate data based on timestamp in JavaDStream of spark streaming
Thanks in advance
TL;DR
yes you could use either Spark Streaming or Structured Streaming, but I wouldn't if I were you.
Detailed answer
Sorry, no simple answer to this one. Spark Streaming might be better for the per-event processing if you need to individually examine each event. Structured Streaming will be a nicer way to perform aggregations and any processing where per-event work isn't necessary.
However, there is a whole bunch of complexity in your requirements, how much of the complexity you address depends on the cost of inaccuracy in the Streaming job output.
Spark Streaming makes no guarantee that events will be processed in any kind of order. To impose ordering, you will need to setup a window in which to do your processing that minimises the risk of out-of-order processing to an acceptable level. You will need to use a big enough window of data to accurately capture your temporal ordering.
You'll need to give these points some thought:
If a batch fails and is retried, how will that affect your counters?
If events arrive late, will you ignore them, re-process the whole affected window, or update the output? If the latter how can you guarantee the update is done safely?
Will you minimise risk of corruption by keeping hold of a large window of events, or accept any inaccuracies that may arise from a smaller window?
Will the partitioning of events cause complexity in the order that they are processed?
My opinion is that, unless you have relaxed constraints over accuracy, Spark is not the right tool for the job.
I hope that helps in some way.
It is easy to do aggregations based on event-time with Spark SQL (in either batch or structured streaming). You just need to group by a time window over your timestamp column. For example, the following will bucket you data into 1 minute intervals and give you the count for each bucket.
df.groupBy(window($"timestamp", "1 minute") as 'time)
.count()

Resources