Spark Structured Streaming Deduplication with Watermark

Spark Structured Streaming Deduplication with Watermark - apache-spark

I would like to use Spark Structured Streaming for an ETL job where each event is of form:
{
"signature": "uuid",
"timestamp: "2020-01-01 00:00:00",
"payload": {...}
}
The events can arrive late up to 30 days and can include duplicates. I would like to deduplicate them based on the "signature" field.
If I use the recommended solution:
streamingDf \
.withWatermark("timestamp", "30 days") \
.dropDuplicates("signature", "timestamp")
.write
would that track (keep in memory, store etc) a buffer of the full event content (which can be quite large) or will it just track the "signature" field values ?
Also, would the simple query like the above write new events immediately as new data arrives or would it "block" for 30 days?

"would that track (keep in memory, store etc) a buffer of the full event content (which can be quite large) or will it just track the "signature" field values ?"
Yes, it will keep all columns of streamingDf and not only the signature and timestamp columns.
"Also, would the simple query like the above write new events immediately as new data arrives or would it "block" for 30 days?"
This query will write events immediately as new data arrives and only keep the state at least for 30 days in order to be able to identify duplicates.
From my personal experience with streaming applications, I really do not recommend your approach for de-duplicating messages. Keeping the state for up to 30 days is quite challenging from an operational point of view. Remember, that any small network glitch, power outage, planned/unplanned maintenance of your OS etc. could cause your application to fail or produce wrong results.
I highly recommend to de-duplicate your data through another approach, like e.g. writing the data into a Delta Table or any other format or database.

Related

References to Streaming Delta Live Tables

It was my understanding that references to streaming delta live tables require the use of the function STREAM(), supplying the table name as an argument.
Given below is a code snippet that I found in one of the demo notebooks that Databricks provide. Here, I see the use of STREAM() in the FROM clause, but it has not been used in the LEFT JOIN, even though that table is also a streaming table. This query still works.
What exactly is the correct syntax here?
CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_cleaned(
CONSTRAINT valid_order_number EXPECT (order_number IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT "The cleaned sales orders with valid order_number(s) and partitioned by order_datetime."
AS
SELECT f.customer_id, f.customer_name, f.number_of_line_items,
timestamp(from_unixtime((cast(f.order_datetime as long)))) as order_datetime,
date(from_unixtime((cast(f.order_datetime as long)))) as order_date,
f.order_number, f.ordered_products, c.state, c.city, c.lon, c.lat, c.units_purchased, c.loyalty_segment
FROM STREAM(LIVE.sales_orders_raw) f
LEFT JOIN LIVE.customers c
ON c.customer_id = f.customer_id
AND c.customer_name = f.customer_name
Just for reference, given below are the other two tables that act as inputs to the above query,
CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_raw
COMMENT "The raw sales orders, ingested from /databricks-datasets."
AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/sales_orders/", "json", map("cloudFiles.inferColumnTypes", "true"))
CREATE OR REFRESH STREAMING LIVE TABLE customers
COMMENT "The customers buying finished products, ingested from /databricks-datasets."
AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv");

There are different types of joins on the Spark streams:
stream-static join. (doc) This is exactly your case, when you have STREAM(LIVE.sales_orders_raw) for orders, but the customers stream is considered static (it's read on each microbatch, and represents the state at moment of invocation). This is usually a case for your kind of functionality.
stream-stream join. In this case, both streams may need to align against each other, because data may come later, etc. In this case both streams will use STREAM(LIVE....) syntax. But it may not be the best case for you, because both streams need to wait until late data come, etc. - You will need to define a watermark for both streams, etc. Look for Spark documentation regarding that.

How to prevent Spark from keeping old data leading to out of memory in Spark Structured Streaming

I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.

Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba

Preventing Spark from storing state in stream/stream joins

I have two streaming datasets, let's call them fastStream and slowStream.
The fastStream is a streaming dataset that I am consuming from Kafka via the structured streaming API. I am expecting to receive potentially thousands of messages a second.
The slowStream is actually a reference (or lookup) table that is being 'upserted' by another stream and contains data that I want to join on to each message in the fastStream before I save the records to a table. The slowStream is only updated when someone changes the metadata, which can happen at any time but we would expect to change maybe once every few days.
Each record in the fastStream will have exactly one corresponding message in the slowStream and I essentially want to make that join happen immediately with whatever data is in the slowStream table. I don't want to wait to see if a potential match could occur if new data arrives in the slowStream.
The problem that I have is that according to the Spark docs:
Hence, for both the input streams, we buffer past input as streaming state, so that we can match every future input with past input and accordingly generate joined results.
I have tried adding a watermark to the fastStream but I think it has no effect since the docs indicate that the watermarked columns need to be referenced in the join
Ideally I would write something like:
# Apply a watermark to the fast stream
fastStream = spark.readStream \
.format("delta") \
.load("dbfs:/mnt/some_file/fastStream") \
.withWatermark("timestamp", "1 hour") \
.alias("fastStream")
# The slowStream cannot be watermarked since it is only slowly changing
slowStream = spark.readStream \
.format("delta") \
.load("dbfs:/mnt/some_file/slowStream") \
.alias("slowStream")
# Prevent the join from buffering the fast stream by 'telling' spark that there will never be new matches.
fastStream.join(
slowStrean,
expr("""
fastStream.slow_id = slowStream.id
AND fastStream.timestamp > watermark
"""
),
"inner"
).select("fastStream.*", "slowStream.metadata")
But I don't think you can reference the watermark in the SQL expression.
Essentially, while I'm happy to have the slowStream buffered (so the whole table is in memory) I can't have the fastStream buffered as this table will quickly consume all memory. Instead, I would simply like to drop messages from the fastStream that aren't matched instead of retaining them to see if they might match in future.
Any help very gratefully appreciated.

For inner Stream-Stream joins watermarking and event-time constraints (join condition) are optional.
If an unbounded state is not an issue for you in terms of volume you can choose not to specify them. In that case, all data will be buffered and your data from the fastStream will immediately be joined with all the data from the slowStream.
Only when both parameters are specified your state will be cleaned up. Note the purpose of those two parameters:
Event-time constraint (time range join condition): What ist the maximum time range between the generation of the two events at their respective sources?
Watermark: What is the maximum duration an event can be delayed in transit between the source and the processing engine?
To define the two parameters you need to first answer the above mentioned questions (which are quoted from the book "Learning Apache Spark, 2nd edition" published by O`Reilly).
Regarding your code comment:
"Prevent the join from buffering the fast stream by 'telling' spark that there will never be new matches."
Remember that buffering in stream-stream join is necessary. Otherwise you would just be able to join the data that is available within the current micro-batch. As the slowStream does not have regular updates but the fastStream is updating its data quite fast you would probably never get any join matches at all without buffering the data.
Overall, for the use case you are describing ("Join fast changing data with slow changing metadata") it is usually better to use a stream-static join approach where the slow changing data becomes the static part.
In a stream-static join every row in the stream data will be joined with the full static data whereas the static table is loaded in every single micro-batch. If loading the static table reduces your performance you may think about caching it and have it updated regularly as described in Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically.

Answering my own question with what I ended up going with. It's certainly not ideal but for all my searching, there doesn't seem to be the control within Spark structured streaming to address this use case.
So my solution was to read the dataset and conduct the join inside a foreachBatch. This way I prevent Spark from storing a ton of unnecessary state and get the joins conducted immediately. On the downside, there seems to be no way to incrementally read a stream table so instead, I am re-reading the entire table every time...
def join_slow_stream(df, batchID):
# Read as a table rather than a stream
slowdf = spark.read \
.format("delta") \
.load("dbfs:/mnt/some_file/slowStream") \
.alias("slowStream")
out_df = df.join(
slowdf,
expr("""
fastStream.slow_id = slowStream.id
"""
),
"inner"
).select("fastStream.*", "slowStream.metadata")
# write data to database
db_con.write(out_df)
fastStream.writeStream.foreachBatch(join_slow_stream)

If you are interested in referencing the "time that was watermarked" i.e. the 1 hour, you may replace watermark in the expression with current_timestamp - interval '1' hour.
Since you are attempting to join two streams, spark will insist that both use watermarks
Reference
Spark Stream to Stream Joins

Can intermediate state be dropped/controlled in Spark structured streaming in Complete Output mode? (Spark 2.4.0)

I have a scenario where I want to process data from a kafka topic. I have this particular java code to read the data as a stream from kafka topic.
Dataset<Row> streamObjs = sparkSession.readStream().format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers).option("subscribe", streamTopic)
.option("failOnDataLoss", false).load();
I cast it to String, define the schema, then try to use watermark (for late data) and window (for grouping and aggregations) and finally output to kafka sink.
Dataset<Row> selectExprImporter = streamObjs.selectExpr("CAST(value AS STRING)");
StructType streamSchema = new StructType().add("id", DataTypes.StringType)
.add("timestamp", DataTypes.LongType)
.add("values", new MapType(DataTypes.StringType, DataTypes.DoubleType, false));
Dataset<Row> selectValueImporter = selectExprImporter
.select(functions.from_json(new Column("value"), streamSchema ).alias("data"));
.
.
(More transformations/operations)
.
.
Dataset<Row> aggCount_15min = streamData.withWatermark("timestamp", "2 minute")
.withColumn("frequency", functions.lit(15))
.groupBy(new Column("id"), new Column("frequency"),
functions.window(new Column("timestamp"), "15 minute").as("time_range"))
.agg(functions.mean("value").as("mean_value"), functions.sum("value").as("sum"),
functions.count(functions.lit(1)).as("number_of_values"))
.filter("mean_value > 35").orderBy("id", "frequency", "time_range");
aggCount_15min.selectExpr("to_json(struct(*)) AS value").writeStream()
.outputMode(OutputMode.Complete()).format("kafka").option("kafka.bootstrap.servers", bootstrapServers)
.option("topic", outputTopic).option("checkpointLocation", checkpointLocation).start().awaitTermination();
Questions
Am I correct in understanding that when using Complete Output mode in the kafka sink, the intermediate state will keep on increasing forever until I get OutOfMemory exception?
Also, what is the ideal use case for Complete Output mode? Use it only when intermediate data/state doesn't increase?
Complete Output mode is needed in my case as I want to use the orderBy clause. Is there some way so that I can force spark to drop the state it has after every say 30 mins and work again with new data?
Is there a better way to not use Complete Output mode but still get the desired result? Should I use something else other than spark structured streaming?
The desired result being aggregating and grouping data as per the query above, then when 1st batch has been created, drop all state and start fresh for next batch. Here batch can be a function of last processed timestamp. Like say drop all state and start fresh when current timestamp has crossed 20 min from the first received timestamp or better if its a function of window time (15min in this example) like say when 4 batches of 15 min windows have been processed and timestamp for 5th batch arrives drop state for previous 4 batches and start fresh for this batch.

The question asks many things and focuses less on what Spark Structured Streaming (SSS) actually does. Answering your numbered questions, title question and non-numbered question then:
A. Title Question:
Not as such, but Complete mode only stores aggregates, so not all data
is stored but a state allowing re-computation based on incremental
adding of data. I find the manual misleading in terms of its
description, but it may be jus me. But you will get this error
otherwise:
org.apache.spark.sql.AnalysisException: Complete output mode not
supported when there are no streaming aggregations on streaming
DataFrames/Datasets
Am I correct in understanding that when using Complete Output mode in the kafka sink, the intermediate state will keep on increasing forever until I get OutOfMemory exception?
The kafka sink does not figure here. The intermediate state is what
Spark Structured Streaming needs to store. It stores aggregates and
discards the newer data. But in the end you would get an OOM due to
this or some other error I suspect.
Also, what is the ideal use case for Complete Output mode? Use it only when intermediate data/state doesn't increase?
For aggregations over all data received. 2nd part of your question is not logical and I cannot answer therefore. The state will generally increase over time.
Complete Output mode is needed in my case as I want to use the orderBy clause. Is there some way so that I can force spark to drop the state it has after every say 30 mins and work again with new data?
No, there is not. Even trying to stop gracefully is not an idea and
then re-starting as the period is not really 15 mins then. And, it's against the SSS approach anyway. From the manuals: Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode. You cannot drop the state as you would like, again aggregates discussion.
Is there a better way to not use Complete Output mode but still get the desired result? Should I use something else other than spark structured streaming?
No, as you have many requirements that cannot be satisfied by the
current implementation. Unless you drop order by and do
non-overlapping window operation (15,15) in Append mode with a
minuscule watermark, if memory serves correctly. You would then rely
on sorting later on by down-stream processing as order by not
supported.
Final overall question: The desired result being aggregating and grouping data as per the query above, then when 1st batch has been created, drop all state and start fresh for next batch. Here batch can be a function of last processed timestamp. Like say drop all state and start fresh when current timestamp has crossed 20 min from the first received timestamp or better if its a function of window time (15min in this example) like say when 4 batches of 15 min windows have been processed and timestamp for 5th batch arrives drop state for previous 4 batches and start fresh for this batch.
Whilst your ideas may be considered understandable, the SSS-framework
does not support it all and specifically what you want(, just yet).

Cassandra aggregation

The Cassandra database is not very good for aggregation and that is why I decided to do the aggregation before write. I am storing some data (eg. transaction) for each user which I am aggregating by hour. That means for one user there will be only one row for each our.
Whenever I receive new data, I read the row for current hour, aggregate it with received data and write it back.I use this data to generate hourly reports.
This works fine with low velocity data but I observed considerably high data loss when velocity is very high (eg 100 records for 1 user in a min). This is because reads and writes are happening very fast and because of "delayed write", I am not getting updated data.
I think my approach "aggregate before write" itself is wrong. I was thinking about UDF but I am not sure how will it impact on performance.
What is the best way to store aggregated data in Cassandra ?

My idea would be:
Model data in Cassandra on hour-by-hour buckets.
Store plain data into Cassandra immediately when they arrive.
Process at X all the data of the X-1 hour and store the aggregate result on another table
This would allow you to have very fast incoming rates, process data only once, store the aggregates into another table to have fast reads.

I use Cassandra to pre-aggregate also. I have different tables for hourly, daily, weekly, and monthly. I think you are probably getting data loss as you are selecting the data before your last inserts have replicated to other nodes.
Look into the counter data type to get around this.
You may also be able to specify a higher consistency level in either the inserts or selects to ensure you're getting the most recent data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string