Windowed lag/delta with Spark Structured Streaming - apache-spark

First of all, I'm pretty new to spark, so apologies if I'm missing the obvious!
I'm developing a POC using Spark, which consumes a stream of data from Apache Kafka. My first goal was general moving averages which was simple using the 'window' function in Spark, and calculate some averages based on some keys.
My next goal is to calculate the 'delta' since the last window. So, if I have a parameter call 'noise', the 'window' function calculates avg(noise). But I want to also include the the delta of avg(noise) between the current window the the previous window.
I tried using the lag function, however it does not look like it is supposed:
Non-time-based windows are not supported on streaming DataFrames/Datasets
My question is, does Spark Structure Streaming provide some way to calculate this out of the box? I've contemplated using MapGroupsWithStateFunction which I think might work, but if there is a built in approach that is obviously preferable.
My bit of code for this is:
WindowSpec w = Window.partitionBy(functions.col("window.start"), functions.col("keyName")).orderBy(functions.col("window.start"));
Dataset<Row> outputDS = inputDataset.withWatermark("timeStamp", "1 days")
.groupBy(functions.col("keyName"), functions.window(functions.col("timeStamp"), "1 hours", "1 hours"))
.avg("noise").withColumn("delta", functions.lag("avg(noise)", 1).over(w));

Related

Single Sliding Window in Spark Structured Streaming

I am trying to implement a system which is measuring something based on data from last n hours. Here is the simple example:
I am constantly getting new messages with the following format: {"temp": <temp>, createdAt: <timestamp>}. I want to calculate the average temperature for last 1 hour, so if at this moment the time is 15:10 I want to consider only those records, which have createdAt field set to time after 14:10. Others should be dropped.
Is there anyway I could do it gently in spark structured streaming and keep with only one window? I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
If you don't want overlapping windows, then what you are looking for is a Tumbling window. From the code perspective, you use the same syntax as for Sliding window, but you leave the slideDuration param to its default None.
PySpark docs on this topic - https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html#pyspark-sql-functions-window

Accessing the current sliding window in foreachBatch in Spark Structured Streaming

I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I receive about 50 rows. On this sliding sliding window I am calculating manually my desired metrices like min, max, etc.
Spark provides also a Sliding Window function. But I have two problems with it:
The interval when the sliding window is updated can only be configured based on a time period, but there seems no possibility to force an update with each single microbatch coming in. Is there a possibility that I do not see?
The bigger problem: It seems I can only do aggregations using grouping like:
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
But I do not want to group over multiple sliding windows. I need something like the existing foreachBatch() that allows me to access not only the current batch but also/or the current sliding window. Is there something like that?
Thank you for your time!
You can probably use flatMapGroupsWithState feature to achieve this.
Basically you can store/keep updating previous batches in an internal state(only the information you need) and use it in the next batch
You can refer below links
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-demo-arbitrary-stateful-streaming-aggregation-flatMapGroupsWithState.html

Bad performance with window function in streaming job

I use Spark 2.0.2, Kafka 0.10.1 and the spark-streaming-kafka-0-8 integration. I want to do the following:
I extract features in a streaming job out of NetFlow connections and than apply the records to a k-means model. Some of the features are simple ones which are calculated directly from the record. But I also have more complex features which depend on records from a specified time window before. They count how many connections in the last second were to the same host or service as the current one. I decided to use the SQL window functions for this.
So I build window specifications:
val hostCountWindow = Window.partitionBy("plainrecord.ip_dst").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
val serviceCountWindow = Window.partitionBy("service").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
And a function which is called to extract this features on every batch:
def extractTrafficFeatures(dataset: Dataset[Row]) = {
dataset
.withColumn("host_count", count(dataset("plainrecord.ip_dst")).over(hostCountWindow))
.withColumn("srv_count", count(dataset("service")).over(serviceCountWindow))
}
And use this function as follows
stream.map(...).map(...).foreachRDD { rdd =>
val dataframe = rdd.toDF(featureHeaders: _*).transform(extractTrafficFeatures(_))
...
}
The problem is that this has a very bad performance. A batch needs between 1 and 3 seconds for a average input rate of less than 100 records per second. I guess it comes from the partitioning, which produces a lot of shuffling?
I tried to use the RDD API and countByValueAndWindow(). This seems to be much faster, but the code looks way nicer and cleaner with the DataFrame API.
Is there a better way to calculate these features on the streaming data? Or am I doing something wrong here?
Relatively low performance is to be expected here. Your code has to shuffle and sort data twice, once for:
Window
.partitionBy("plainrecord.ip_dst")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
and once for:
Window
.partitionBy("service")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
This will have a huge impact on the runtime and if these are the hard requirements you won't be able to do much better.

Spark DataSet filter performance

I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different.
The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class.
val df = spark.read.csv(csvFile).as[FireIncident]
A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below:
1) Use typed column (~ 500 ms on local host)
df.where($"UnitID" === "B02").count()
2) Use temp table and sql query (~ same as option 1)
df.createOrReplaceTempView("FireIncidentsSF")
spark.sql("SELECT * FROM FireIncidentsSF WHERE UnitID='B02'").count()
3) Use strong typed class field (14,987ms, i.e. 30 times as slow)
df.filter(_.UnitID.orNull == "B02").count()
I tested it again with the python API, for the same data set, the timing is 17,046 ms, comparable to the performance of the scala API option 3.
df.filter(df['UnitID'] == 'B02').count()
Could someone shed some light on how 3) and the python API are executed differently from the first two options?
It's because of step 3 here.
In the first two, spark doesn't need to deserialize the whole Java/Scala object - it just looks at the one column and moves on.
In the third, since you're using a lambda function, spark can't tell that you just want the one field, so it pulls all 33 fields out of memory for each row, so that you can check the one field.
I'm not sure why the fourth is so slow. It seems like it would work the same way as the first.
When running python what is happening is that first your code is loaded onto the JVM, interpreted, and then its finally compiled into bytecode. When using the Scala API, Scala natively runs on the JVM so you're cutting out the entire load python code into the JVM part.

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance
Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

Resources