Spark stream-stream join where one stream is CDC - apache-spark

In a stream-stream join, when one of the streams is referential data for the other (with no watermark), is there a way to alter the state of one of the streams buffer?
I’m looking to implement something close to this: https://aws.amazon.com/blogs/big-data/handle-fast-changing-reference-data-in-an-aws-glue-streaming-etl-job/
But instead of the ProductPriority table being a manually refreshed static table, I am hoping to replace it with a CDC stream. That would mean when deletes or updates come in, they need to alter the rows living in the state buffer for that stream.
Is this possible?

Related

Apply ForEach (Structured Streaming Programming) to move data from one delta table to another

Let's say that I have a delta table saved which was processed using the ForEachBatch to apply transformations and finally saved a final delta table (let's call these table Table1).
However for some requeriments the data of this table need to be merged or appended to another delta table (Table2) which is being updated by another stream.
My question here is how can I use the ForEach option instead of the ForEachBatch in the new streaming to save that data in the Table2? Considering that for requeriments we need to do append the data of the Table1 to the Table2 record by record 'cause using the option ForEachBatch when the process fail it generates duplicate data and ends breaking up the streaming?
Or is there another way to aproach the problem not using it?
It is important to consider that each table is an streaming table.
We have tried to implement these idea using two stream writting using the ForEachBatch however we have got error and duplicates in different scenarios. First due the fact that we need to use a surrogate key (an indentity) it makes the two streamings to fail.
We avoided it doing one staging table without the identity and later applying it, but the problem is that in some moments if the stream fail using the foreachbatch generates duplicated data and breaks the whole process.
That's why we tought that we could use the foreach to append the data to the table2 but we have no idea how it works and how implement it because it must be record to record and we havent' find an example or anything about how to implement it.
So any help would be appreciate.
If code it's needed, I could try to provide it.

How to prevent Spark from keeping old data leading to out of memory in Spark Structured Streaming

I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.
Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba

Preventing Spark from storing state in stream/stream joins

I have two streaming datasets, let's call them fastStream and slowStream.
The fastStream is a streaming dataset that I am consuming from Kafka via the structured streaming API. I am expecting to receive potentially thousands of messages a second.
The slowStream is actually a reference (or lookup) table that is being 'upserted' by another stream and contains data that I want to join on to each message in the fastStream before I save the records to a table. The slowStream is only updated when someone changes the metadata, which can happen at any time but we would expect to change maybe once every few days.
Each record in the fastStream will have exactly one corresponding message in the slowStream and I essentially want to make that join happen immediately with whatever data is in the slowStream table. I don't want to wait to see if a potential match could occur if new data arrives in the slowStream.
The problem that I have is that according to the Spark docs:
Hence, for both the input streams, we buffer past input as streaming state, so that we can match every future input with past input and accordingly generate joined results.
I have tried adding a watermark to the fastStream but I think it has no effect since the docs indicate that the watermarked columns need to be referenced in the join
Ideally I would write something like:
# Apply a watermark to the fast stream
fastStream = spark.readStream \
.format("delta") \
.load("dbfs:/mnt/some_file/fastStream") \
.withWatermark("timestamp", "1 hour") \
.alias("fastStream")
# The slowStream cannot be watermarked since it is only slowly changing
slowStream = spark.readStream \
.format("delta") \
.load("dbfs:/mnt/some_file/slowStream") \
.alias("slowStream")
# Prevent the join from buffering the fast stream by 'telling' spark that there will never be new matches.
fastStream.join(
slowStrean,
expr("""
fastStream.slow_id = slowStream.id
AND fastStream.timestamp > watermark
"""
),
"inner"
).select("fastStream.*", "slowStream.metadata")
But I don't think you can reference the watermark in the SQL expression.
Essentially, while I'm happy to have the slowStream buffered (so the whole table is in memory) I can't have the fastStream buffered as this table will quickly consume all memory. Instead, I would simply like to drop messages from the fastStream that aren't matched instead of retaining them to see if they might match in future.
Any help very gratefully appreciated.
For inner Stream-Stream joins watermarking and event-time constraints (join condition) are optional.
If an unbounded state is not an issue for you in terms of volume you can choose not to specify them. In that case, all data will be buffered and your data from the fastStream will immediately be joined with all the data from the slowStream.
Only when both parameters are specified your state will be cleaned up. Note the purpose of those two parameters:
Event-time constraint (time range join condition): What ist the maximum time range between the generation of the two events at their respective sources?
Watermark: What is the maximum duration an event can be delayed in transit between the source and the processing engine?
To define the two parameters you need to first answer the above mentioned questions (which are quoted from the book "Learning Apache Spark, 2nd edition" published by O`Reilly).
Regarding your code comment:
"Prevent the join from buffering the fast stream by 'telling' spark that there will never be new matches."
Remember that buffering in stream-stream join is necessary. Otherwise you would just be able to join the data that is available within the current micro-batch. As the slowStream does not have regular updates but the fastStream is updating its data quite fast you would probably never get any join matches at all without buffering the data.
Overall, for the use case you are describing ("Join fast changing data with slow changing metadata") it is usually better to use a stream-static join approach where the slow changing data becomes the static part.
In a stream-static join every row in the stream data will be joined with the full static data whereas the static table is loaded in every single micro-batch. If loading the static table reduces your performance you may think about caching it and have it updated regularly as described in Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically.
Answering my own question with what I ended up going with. It's certainly not ideal but for all my searching, there doesn't seem to be the control within Spark structured streaming to address this use case.
So my solution was to read the dataset and conduct the join inside a foreachBatch. This way I prevent Spark from storing a ton of unnecessary state and get the joins conducted immediately. On the downside, there seems to be no way to incrementally read a stream table so instead, I am re-reading the entire table every time...
def join_slow_stream(df, batchID):
# Read as a table rather than a stream
slowdf = spark.read \
.format("delta") \
.load("dbfs:/mnt/some_file/slowStream") \
.alias("slowStream")
out_df = df.join(
slowdf,
expr("""
fastStream.slow_id = slowStream.id
"""
),
"inner"
).select("fastStream.*", "slowStream.metadata")
# write data to database
db_con.write(out_df)
fastStream.writeStream.foreachBatch(join_slow_stream)
If you are interested in referencing the "time that was watermarked" i.e. the 1 hour, you may replace watermark in the expression with current_timestamp - interval '1' hour.
Since you are attempting to join two streams, spark will insist that both use watermarks
Reference
Spark Stream to Stream Joins

Apache Spark Structured Streaming for Window Aggregation and Custom Triggering

Say I have some a streaming data of the schema as follows:
uid: string
ts: timestamp
Now assuming the data has been partitioned by uid (in each partition, the data is minimal, e.g. less than 1 row/sec).
I would like to put the data (in each partition) into windows based on event time ts, then sort all the elements within each window (based on ts as well), at last apply a custom transformation on each of the element in the window in order.
Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?
Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?
Before I answer the questions let me point out that Spark Structured Streaming offers KeyValueGroupedDataset.flatMapGroupsWithState (after Dataset.groupByKey) for arbitrary stateful streaming aggregation (with explicit state logic) that gives you the most for a manual streaming state management.
Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?
That sounds like a streaming join where you have the input stream on your left and the aggregated stream (streaming aggregation) on your right. That should be doable (but leaving it with no example code as I'm still unsure if I understood your question right).
Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?
Use window standard function to define the window and a watermark to "close" windows at proper times. That is also doable (but no example again as I'm not sure of the merit of the question).

Get all rows of a window in Spark structured streaming

I have a use case where we need to find patterns in data within a window. We are experimenting with Structured Streaming. We have a continues stream of events and are looking for patterns like event A (device disconnect) is followed by event B (device reconnect) within 10 seconds. or event A (disconnect) is not followed by event B (reconnect) within 10 seconds.
I was thinking of using a window function grouping dataset into 10 seconds window buckets and checking for the pattern every time the window values are updated. It looks like the window function is really used as a groupBy in structured streaming which forces me to use aggregate functions to get high level agg on column values.
I am wondering if there is a way to loop through all values of the column when using window function in structured streaming.
You might want to try using mapGroupsWithState (structured streaming) or mapWithState (DStreams), it sounds like it could work well for your case.
You can keep arbitrary state for any key and update the state everytime an update comes. You can also set a timeout for each key after which its state will get removed. For your use case, you could store the initial state for event A as the timestamp of when A arrived, and when event B comes you can check if the timestamp is within 10s of A. If it is, generate an event.
You might also be able to use timeouts for this, e.g. set the initial state when A comes, set the timeout to 10s, and if A is still around when B comes then generate an event.
Good blog post on the differences b/w mapGroupsWithState and mapWithState

Resources