Preventing Spark from storing state in stream/stream joins - apache-spark

I have two streaming datasets, let's call them fastStream and slowStream.
The fastStream is a streaming dataset that I am consuming from Kafka via the structured streaming API. I am expecting to receive potentially thousands of messages a second.
The slowStream is actually a reference (or lookup) table that is being 'upserted' by another stream and contains data that I want to join on to each message in the fastStream before I save the records to a table. The slowStream is only updated when someone changes the metadata, which can happen at any time but we would expect to change maybe once every few days.
Each record in the fastStream will have exactly one corresponding message in the slowStream and I essentially want to make that join happen immediately with whatever data is in the slowStream table. I don't want to wait to see if a potential match could occur if new data arrives in the slowStream.
The problem that I have is that according to the Spark docs:
Hence, for both the input streams, we buffer past input as streaming state, so that we can match every future input with past input and accordingly generate joined results.
I have tried adding a watermark to the fastStream but I think it has no effect since the docs indicate that the watermarked columns need to be referenced in the join
Ideally I would write something like:
# Apply a watermark to the fast stream
fastStream = spark.readStream \
.format("delta") \
.load("dbfs:/mnt/some_file/fastStream") \
.withWatermark("timestamp", "1 hour") \
.alias("fastStream")
# The slowStream cannot be watermarked since it is only slowly changing
slowStream = spark.readStream \
.format("delta") \
.load("dbfs:/mnt/some_file/slowStream") \
.alias("slowStream")
# Prevent the join from buffering the fast stream by 'telling' spark that there will never be new matches.
fastStream.join(
slowStrean,
expr("""
fastStream.slow_id = slowStream.id
AND fastStream.timestamp > watermark
"""
),
"inner"
).select("fastStream.*", "slowStream.metadata")
But I don't think you can reference the watermark in the SQL expression.
Essentially, while I'm happy to have the slowStream buffered (so the whole table is in memory) I can't have the fastStream buffered as this table will quickly consume all memory. Instead, I would simply like to drop messages from the fastStream that aren't matched instead of retaining them to see if they might match in future.
Any help very gratefully appreciated.

For inner Stream-Stream joins watermarking and event-time constraints (join condition) are optional.
If an unbounded state is not an issue for you in terms of volume you can choose not to specify them. In that case, all data will be buffered and your data from the fastStream will immediately be joined with all the data from the slowStream.
Only when both parameters are specified your state will be cleaned up. Note the purpose of those two parameters:
Event-time constraint (time range join condition): What ist the maximum time range between the generation of the two events at their respective sources?
Watermark: What is the maximum duration an event can be delayed in transit between the source and the processing engine?
To define the two parameters you need to first answer the above mentioned questions (which are quoted from the book "Learning Apache Spark, 2nd edition" published by O`Reilly).
Regarding your code comment:
"Prevent the join from buffering the fast stream by 'telling' spark that there will never be new matches."
Remember that buffering in stream-stream join is necessary. Otherwise you would just be able to join the data that is available within the current micro-batch. As the slowStream does not have regular updates but the fastStream is updating its data quite fast you would probably never get any join matches at all without buffering the data.
Overall, for the use case you are describing ("Join fast changing data with slow changing metadata") it is usually better to use a stream-static join approach where the slow changing data becomes the static part.
In a stream-static join every row in the stream data will be joined with the full static data whereas the static table is loaded in every single micro-batch. If loading the static table reduces your performance you may think about caching it and have it updated regularly as described in Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically.

Answering my own question with what I ended up going with. It's certainly not ideal but for all my searching, there doesn't seem to be the control within Spark structured streaming to address this use case.
So my solution was to read the dataset and conduct the join inside a foreachBatch. This way I prevent Spark from storing a ton of unnecessary state and get the joins conducted immediately. On the downside, there seems to be no way to incrementally read a stream table so instead, I am re-reading the entire table every time...
def join_slow_stream(df, batchID):
# Read as a table rather than a stream
slowdf = spark.read \
.format("delta") \
.load("dbfs:/mnt/some_file/slowStream") \
.alias("slowStream")
out_df = df.join(
slowdf,
expr("""
fastStream.slow_id = slowStream.id
"""
),
"inner"
).select("fastStream.*", "slowStream.metadata")
# write data to database
db_con.write(out_df)
fastStream.writeStream.foreachBatch(join_slow_stream)

If you are interested in referencing the "time that was watermarked" i.e. the 1 hour, you may replace watermark in the expression with current_timestamp - interval '1' hour.
Since you are attempting to join two streams, spark will insist that both use watermarks
Reference
Spark Stream to Stream Joins

Related

How to prevent Spark from keeping old data leading to out of memory in Spark Structured Streaming

I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.
Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba

Spark Structured Streaming Deduplication with Watermark

I would like to use Spark Structured Streaming for an ETL job where each event is of form:
{
"signature": "uuid",
"timestamp: "2020-01-01 00:00:00",
"payload": {...}
}
The events can arrive late up to 30 days and can include duplicates. I would like to deduplicate them based on the "signature" field.
If I use the recommended solution:
streamingDf \
.withWatermark("timestamp", "30 days") \
.dropDuplicates("signature", "timestamp")
.write
would that track (keep in memory, store etc) a buffer of the full event content (which can be quite large) or will it just track the "signature" field values ?
Also, would the simple query like the above write new events immediately as new data arrives or would it "block" for 30 days?
"would that track (keep in memory, store etc) a buffer of the full event content (which can be quite large) or will it just track the "signature" field values ?"
Yes, it will keep all columns of streamingDf and not only the signature and timestamp columns.
"Also, would the simple query like the above write new events immediately as new data arrives or would it "block" for 30 days?"
This query will write events immediately as new data arrives and only keep the state at least for 30 days in order to be able to identify duplicates.
From my personal experience with streaming applications, I really do not recommend your approach for de-duplicating messages. Keeping the state for up to 30 days is quite challenging from an operational point of view. Remember, that any small network glitch, power outage, planned/unplanned maintenance of your OS etc. could cause your application to fail or produce wrong results.
I highly recommend to de-duplicate your data through another approach, like e.g. writing the data into a Delta Table or any other format or database.

Does Spark guarantee consistency when reading data from S3?

I have a Spark Job that reads data from S3. I apply some transformations and write 2 datasets back to S3. Each write action is treated as a separate job.
Question: Does Spark guarantees that I read the data each time in the same order? For example, if I apply the function:
.withColumn('id', f.monotonically_increasing_id())
Will the id column have the same values for the same records each time?
You state very little, but the following is easily testable and should serve as a guideline:
If you re-read the same files again with same content you will get the same blocks / partitions again and the same id using f.monotonically_increasing_id().
If the total number of rows differs on the successive read(s) with different partitioning applied before this function, then typically you will get different id's.
If you have more data second time round and apply coalesce(1) then the prior entries will have same id still, newer rows will have other ids. A less than realistic scenario of course.
Blocks for files at rest remain static (in general) on HDFS. So partition 0..N will be the same upon reading from rest. Otherwise zipWithIndex would not be usable either.
I would never rely on the same data being in same place when read twice unless there were no updates (you could cache as well).

Can intermediate state be dropped/controlled in Spark structured streaming in Complete Output mode? (Spark 2.4.0)

I have a scenario where I want to process data from a kafka topic. I have this particular java code to read the data as a stream from kafka topic.
Dataset<Row> streamObjs = sparkSession.readStream().format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers).option("subscribe", streamTopic)
.option("failOnDataLoss", false).load();
I cast it to String, define the schema, then try to use watermark (for late data) and window (for grouping and aggregations) and finally output to kafka sink.
Dataset<Row> selectExprImporter = streamObjs.selectExpr("CAST(value AS STRING)");
StructType streamSchema = new StructType().add("id", DataTypes.StringType)
.add("timestamp", DataTypes.LongType)
.add("values", new MapType(DataTypes.StringType, DataTypes.DoubleType, false));
Dataset<Row> selectValueImporter = selectExprImporter
.select(functions.from_json(new Column("value"), streamSchema ).alias("data"));
.
.
(More transformations/operations)
.
.
Dataset<Row> aggCount_15min = streamData.withWatermark("timestamp", "2 minute")
.withColumn("frequency", functions.lit(15))
.groupBy(new Column("id"), new Column("frequency"),
functions.window(new Column("timestamp"), "15 minute").as("time_range"))
.agg(functions.mean("value").as("mean_value"), functions.sum("value").as("sum"),
functions.count(functions.lit(1)).as("number_of_values"))
.filter("mean_value > 35").orderBy("id", "frequency", "time_range");
aggCount_15min.selectExpr("to_json(struct(*)) AS value").writeStream()
.outputMode(OutputMode.Complete()).format("kafka").option("kafka.bootstrap.servers", bootstrapServers)
.option("topic", outputTopic).option("checkpointLocation", checkpointLocation).start().awaitTermination();
Questions
Am I correct in understanding that when using Complete Output mode in the kafka sink, the intermediate state will keep on increasing forever until I get OutOfMemory exception?
Also, what is the ideal use case for Complete Output mode? Use it only when intermediate data/state doesn't increase?
Complete Output mode is needed in my case as I want to use the orderBy clause. Is there some way so that I can force spark to drop the state it has after every say 30 mins and work again with new data?
Is there a better way to not use Complete Output mode but still get the desired result? Should I use something else other than spark structured streaming?
The desired result being aggregating and grouping data as per the query above, then when 1st batch has been created, drop all state and start fresh for next batch. Here batch can be a function of last processed timestamp. Like say drop all state and start fresh when current timestamp has crossed 20 min from the first received timestamp or better if its a function of window time (15min in this example) like say when 4 batches of 15 min windows have been processed and timestamp for 5th batch arrives drop state for previous 4 batches and start fresh for this batch.
The question asks many things and focuses less on what Spark Structured Streaming (SSS) actually does. Answering your numbered questions, title question and non-numbered question then:
A. Title Question:
Not as such, but Complete mode only stores aggregates, so not all data
is stored but a state allowing re-computation based on incremental
adding of data. I find the manual misleading in terms of its
description, but it may be jus me. But you will get this error
otherwise:
org.apache.spark.sql.AnalysisException: Complete output mode not
supported when there are no streaming aggregations on streaming
DataFrames/Datasets
Am I correct in understanding that when using Complete Output mode in the kafka sink, the intermediate state will keep on increasing forever until I get OutOfMemory exception?
The kafka sink does not figure here. The intermediate state is what
Spark Structured Streaming needs to store. It stores aggregates and
discards the newer data. But in the end you would get an OOM due to
this or some other error I suspect.
Also, what is the ideal use case for Complete Output mode? Use it only when intermediate data/state doesn't increase?
For aggregations over all data received. 2nd part of your question is not logical and I cannot answer therefore. The state will generally increase over time.
Complete Output mode is needed in my case as I want to use the orderBy clause. Is there some way so that I can force spark to drop the state it has after every say 30 mins and work again with new data?
No, there is not. Even trying to stop gracefully is not an idea and
then re-starting as the period is not really 15 mins then. And, it's against the SSS approach anyway. From the manuals: Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode. You cannot drop the state as you would like, again aggregates discussion.
Is there a better way to not use Complete Output mode but still get the desired result? Should I use something else other than spark structured streaming?
No, as you have many requirements that cannot be satisfied by the
current implementation. Unless you drop order by and do
non-overlapping window operation (15,15) in Append mode with a
minuscule watermark, if memory serves correctly. You would then rely
on sorting later on by down-stream processing as order by not
supported.
Final overall question: The desired result being aggregating and grouping data as per the query above, then when 1st batch has been created, drop all state and start fresh for next batch. Here batch can be a function of last processed timestamp. Like say drop all state and start fresh when current timestamp has crossed 20 min from the first received timestamp or better if its a function of window time (15min in this example) like say when 4 batches of 15 min windows have been processed and timestamp for 5th batch arrives drop state for previous 4 batches and start fresh for this batch.
Whilst your ideas may be considered understandable, the SSS-framework
does not support it all and specifically what you want(, just yet).

Spark Streaming - TIMESTAMP field based processing

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.
The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTAMP.
At the moment I'm loading this file and extracting all the events in a JavaRDD and I would like to pass them to Spark Streaming in order to collect some stats based on the TIMESTAMP (a sort of replay).
My question is if it is possible to process these event using the EVENT TIMESTAMP as temporal reference instead of the actual time of the machine (sorry for the silly question).
In case it is possible, will I need simply spark streaming or I need to switch to Structured Streaming?
I found a similar question here:
Aggregate data based on timestamp in JavaDStream of spark streaming
Thanks in advance
TL;DR
yes you could use either Spark Streaming or Structured Streaming, but I wouldn't if I were you.
Detailed answer
Sorry, no simple answer to this one. Spark Streaming might be better for the per-event processing if you need to individually examine each event. Structured Streaming will be a nicer way to perform aggregations and any processing where per-event work isn't necessary.
However, there is a whole bunch of complexity in your requirements, how much of the complexity you address depends on the cost of inaccuracy in the Streaming job output.
Spark Streaming makes no guarantee that events will be processed in any kind of order. To impose ordering, you will need to setup a window in which to do your processing that minimises the risk of out-of-order processing to an acceptable level. You will need to use a big enough window of data to accurately capture your temporal ordering.
You'll need to give these points some thought:
If a batch fails and is retried, how will that affect your counters?
If events arrive late, will you ignore them, re-process the whole affected window, or update the output? If the latter how can you guarantee the update is done safely?
Will you minimise risk of corruption by keeping hold of a large window of events, or accept any inaccuracies that may arise from a smaller window?
Will the partitioning of events cause complexity in the order that they are processed?
My opinion is that, unless you have relaxed constraints over accuracy, Spark is not the right tool for the job.
I hope that helps in some way.
It is easy to do aggregations based on event-time with Spark SQL (in either batch or structured streaming). You just need to group by a time window over your timestamp column. For example, the following will bucket you data into 1 minute intervals and give you the count for each bucket.
df.groupBy(window($"timestamp", "1 minute") as 'time)
.count()

Resources