Accessing the current sliding window in foreachBatch in Spark Structured Streaming - apache-spark

I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I receive about 50 rows. On this sliding sliding window I am calculating manually my desired metrices like min, max, etc.
Spark provides also a Sliding Window function. But I have two problems with it:
The interval when the sliding window is updated can only be configured based on a time period, but there seems no possibility to force an update with each single microbatch coming in. Is there a possibility that I do not see?
The bigger problem: It seems I can only do aggregations using grouping like:
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
But I do not want to group over multiple sliding windows. I need something like the existing foreachBatch() that allows me to access not only the current batch but also/or the current sliding window. Is there something like that?
Thank you for your time!

You can probably use flatMapGroupsWithState feature to achieve this.
Basically you can store/keep updating previous batches in an internal state(only the information you need) and use it in the next batch
You can refer below links
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-demo-arbitrary-stateful-streaming-aggregation-flatMapGroupsWithState.html

Related

how sliding window functions work in spark?

I've a doubt regarding sliding window function in spark. I'll be receiving json events from stream, and I would like to see top value out of all messages I've received in last 1 hour.
So If I apply this logic using tumbling or sliding window function, then does it means, I'll get the output after every one hour ?
Does spark stores all events for the time frame I've mentioned in window function to get largest/top value ?
Please help me to understand.
Spark supports three types of time windows: tumbling (fixed), sliding and session.
tumbling window is different from sliding window
tumbling window is fixed-size.If you have used tumbling window, you can get the output every one hour with configuration. Yet if you have used sliding window, you can get topN of all messages you've received in last 1 hour(previous 60 minutes)
Spark we use the state store provider to handle the stateful operations.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#types-of-time-windows
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#state-store

Single Sliding Window in Spark Structured Streaming

I am trying to implement a system which is measuring something based on data from last n hours. Here is the simple example:
I am constantly getting new messages with the following format: {"temp": <temp>, createdAt: <timestamp>}. I want to calculate the average temperature for last 1 hour, so if at this moment the time is 15:10 I want to consider only those records, which have createdAt field set to time after 14:10. Others should be dropped.
Is there anyway I could do it gently in spark structured streaming and keep with only one window? I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
If you don't want overlapping windows, then what you are looking for is a Tumbling window. From the code perspective, you use the same syntax as for Sliding window, but you leave the slideDuration param to its default None.
PySpark docs on this topic - https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html#pyspark-sql-functions-window

Windowed lag/delta with Spark Structured Streaming

First of all, I'm pretty new to spark, so apologies if I'm missing the obvious!
I'm developing a POC using Spark, which consumes a stream of data from Apache Kafka. My first goal was general moving averages which was simple using the 'window' function in Spark, and calculate some averages based on some keys.
My next goal is to calculate the 'delta' since the last window. So, if I have a parameter call 'noise', the 'window' function calculates avg(noise). But I want to also include the the delta of avg(noise) between the current window the the previous window.
I tried using the lag function, however it does not look like it is supposed:
Non-time-based windows are not supported on streaming DataFrames/Datasets
My question is, does Spark Structure Streaming provide some way to calculate this out of the box? I've contemplated using MapGroupsWithStateFunction which I think might work, but if there is a built in approach that is obviously preferable.
My bit of code for this is:
WindowSpec w = Window.partitionBy(functions.col("window.start"), functions.col("keyName")).orderBy(functions.col("window.start"));
Dataset<Row> outputDS = inputDataset.withWatermark("timeStamp", "1 days")
.groupBy(functions.col("keyName"), functions.window(functions.col("timeStamp"), "1 hours", "1 hours"))
.avg("noise").withColumn("delta", functions.lag("avg(noise)", 1).over(w));

Bad performance with window function in streaming job

I use Spark 2.0.2, Kafka 0.10.1 and the spark-streaming-kafka-0-8 integration. I want to do the following:
I extract features in a streaming job out of NetFlow connections and than apply the records to a k-means model. Some of the features are simple ones which are calculated directly from the record. But I also have more complex features which depend on records from a specified time window before. They count how many connections in the last second were to the same host or service as the current one. I decided to use the SQL window functions for this.
So I build window specifications:
val hostCountWindow = Window.partitionBy("plainrecord.ip_dst").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
val serviceCountWindow = Window.partitionBy("service").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
And a function which is called to extract this features on every batch:
def extractTrafficFeatures(dataset: Dataset[Row]) = {
dataset
.withColumn("host_count", count(dataset("plainrecord.ip_dst")).over(hostCountWindow))
.withColumn("srv_count", count(dataset("service")).over(serviceCountWindow))
}
And use this function as follows
stream.map(...).map(...).foreachRDD { rdd =>
val dataframe = rdd.toDF(featureHeaders: _*).transform(extractTrafficFeatures(_))
...
}
The problem is that this has a very bad performance. A batch needs between 1 and 3 seconds for a average input rate of less than 100 records per second. I guess it comes from the partitioning, which produces a lot of shuffling?
I tried to use the RDD API and countByValueAndWindow(). This seems to be much faster, but the code looks way nicer and cleaner with the DataFrame API.
Is there a better way to calculate these features on the streaming data? Or am I doing something wrong here?
Relatively low performance is to be expected here. Your code has to shuffle and sort data twice, once for:
Window
.partitionBy("plainrecord.ip_dst")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
and once for:
Window
.partitionBy("service")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
This will have a huge impact on the runtime and if these are the hard requirements you won't be able to do much better.

Spark Streaming Window width and slide

I wanted to use spark streaming to process events from kafka and want to set the window width and slide in terms of the number of messages instead of time. Is this possible? I didn't see anything obvious in the api for this and instead only saw time based window options.
That's not possible. You can only set window duration.
This gets near enough to what I would want:
[Output an event based on a number of messages within a specific time period]
https://damieng.com/blog/2015/06/27/time-window-events-with-apache-spark-streaming

Resources