Output top n records for the last hour every minute - apache-spark

What is the best way to keep updated table containing top n records for the last let's say 60 minutes for a stream using Spark Structured Streaming? I only want to update my table once a minute. I thought I would achieve this using window function, set window duration to 60 minutes and slide duration to 1 minute. The problem is I get data for all the sliding windows in the output, but I only need the last completely calculated (closed) window. Is there a way to achieve this? Or perhaps should I tackle the problem (keeping nearly real-time rankings for the past hour) in a different way?
My incomplete solution:
val entityCounts = entities
.withWatermark("timestamp", "1 minute")
.groupBy(
window(col("timestamp"), "60 minutes", "1 minute")
.as("time_window"),
col("entity_type"),
col("entity"))
.count()
val query = entityCounts.writeStream
.foreachBatch( { (batchDF, _) =>
batchDF
.withColumn(
"row_number", row_number() over Window
.partitionBy("entity_type")
.orderBy(col("count").desc))
.filter(col("row_number") <= 5)
.select(
col("entity_type"),
col("entity"),
col("count").as("occurrences"))
.write
.cassandraFormat("table", "keyspace")
.mode("append")
.save
})
.outputMode("append")
.start()

Related

Pyspark data aggregation with Window and sliding interval on index

I am currently running into the issue where I want to use a window and sliding interval on my csv and for each window perform data aggregation to get the most common category. However I do not have a timestamp and I want to perform the window sliding on the index column. Can anyone point me in the right direction on how to use windows + sliding intervals on the index?
In short i want to create windows+intervals over the index column.
Currently I have something like this:
schema = StructType().add("index", "string").add(
"Category", "integer")
dataframe = spark \
.readStream \
.option("sep", ",") \
.schema(schema) \
.csv("./tmp/input")
# TODO perform Window + sliding interval on dataframe, then perform aggregation per window
aggr = dataframe.groupBy("Category").count().orderBy("count", ascending=False).limit(3)
query = aggr \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
For aggregation data on per window basis you can use window function from pyspark.sql.functions package.
For time interval you need to add a timestamp column in your dataframe.
newDf = csvFile.withColumn("TimeStamp", current_timestamp())
This code adds the current time in the dataframe as the data is read from the csv.
trimmedDf2 = newDf.groupBy(window(col("TimeStamp"), "5 seconds")).agg(sum("value")).select("window.start", "window.end", "sum(value)")
display(trimmedDf2)
The above code sums up the value columns and group them in the 5 second timestamp window.
Here is the Output of the code
Weekly Aggregation using Windows Function in Spark
You can also use the above link for reference.

Create candle data from tick data using Apache Spark Stream in java

We are getting tick data on Kafka which we are streaming to Apache Spark. We need to create candle data from that stream data.
First option i thought to create dataframe and from there run sql queries like
SELECT t1.price AS open,
m.high,
m.low,
t2.price as close,
open_time
FROM (SELECT MIN(timeInMilliseconds) AS min_time,
MAX(timeInMilliseconds) AS max_time,
MIN(price) as low,
MAX(price) as high,
FLOOR(timeInMilliseconds/(1000*60)) as open_time
FROM ticks
GROUP BY open_time) m
JOIN ticks t1 ON t1.timeInMilliseconds = min_time
JOIN ticks t2 ON t2.timeInMilliseconds = max_time
But i am not sure would that be able to get data for old ticks
is it possible to use some methods of Spark library to create similar to this ?
please take a look at Window Operations on Event Time https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time
That is exactly what you need. Here is a scatch code
val windowedCounts = tickStream.groupBy(
window($"timeInMilliseconds", "1 minutes"))
).agg(
first("price").alias("open"),
min("price").alias("min"),
max ("price").alias("max"),
last("price").alias("close"))

Spark structured streaming watermark with OutputMode.Complete

I wrote simple query which should ignore data where created < last event time - 5 seconds. But this query doesn't work. All data is printed out.
Also I tried to use window function window($"created", "10 seconds", "10 seconds"), but that didn't help.
val inputStream = new MemoryStream[(Timestamp, String)](1, spark.sqlContext)
val df = inputStream.toDS().toDF("created", "animal")
val query = df
.withWatermark("created", "5 seconds")
.groupBy($"animal")
.count()
.writeStream
.format("console")
.outputMode(OutputMode.Complete())
.start()
You need more grouping by info like such:
val windowedCounts = words
.withWatermark("timestamp", "10 minutes")
.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word")
.count()
Moreover, from the manual:
Output mode must be Append or Update. Complete mode requires all aggregate data to be preserved, and hence cannot use watermarking to drop intermediate state.

Spark Structural Streaming Output Mode problem

I am currently looking for an applicable solution to solve the following problem using Spark Structured Streaming API. I have searched through a lot of blog posts and Stackoverflow. Unfortunately, I still can't find a solution to this. Hence raising this ticket to call for expert help.
Use Case
Let said I have a Kafka Topic (user_creation_log) that has all the real-time user_creation_event. For those users who didn't do any transaction within 10 secs, 20 secs, and 30 secs then we will assign them a certain voucher. ( time windows is shortened for testing purpose)
Flag and sending the timeout row (more than 10 sec, more than 20 sec , more than 30 secs) to Kafka is the most problematic part!!! Too much rules, or perhaps i should break it 10sec,20sec and 30 secs into different script
My Tracking Table
I am able to track user no_action_sec by no_action_10sec,no_action_20sec,no_action_30sec flag(shown in code below). The no_action_sec is derived from (current_time - creation_time) which will be calculated in every microbatch.
Complete Output Mode
outputMode("complete") writes all the rows of a Result Table (and corresponds to a traditional batch structured query).
Update Output Mode
outputMode("update") writes only the rows that were updated (every time there are updates).
In this case Update Output Mode seems very suitable because it will write an updated row to output. However, whenever the flag10, flag20, flag30 columns have been updated, the row didn't write to the desired location.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession \
.builder \
.appName("Notification") \
.getOrCreate()
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
split_col=split(lines.value, ' ')
df = lines.withColumn('user_id', split_col.getItem(0))
df = df.withColumn('create_date_time', split_col.getItem(1)) \
.groupBy("user_id","create_date_time").count()
df = df.withColumn("create_date_time",col("create_date_time").cast(LongType())) \
.withColumn("no_action_sec", current_timestamp().cast(LongType()) -col("create_date_time").cast(LongType()) ) \
.withColumn("no_action_10sec", when(col("no_action_sec") >= 10 ,True)) \
.withColumn("no_action_20sec", when(col("no_action_sec") >= 20 ,True)) \
.withColumn("no_action_30sec", when(col("no_action_sec") >= 30 ,True)) \
query = df \
.writeStream \
.outputMode("update") \
.format("console") \
.start()
query.awaitTermination()
Current Output
UserId = 0 is disappear in Batch 2. It's supposed to show up because no_action_30sec will changes from null to True.
Expected output
User Id should be write to output 3 times once it triggers the flag logic 10 sec, 20 sec and 30 sec
Can anyone shed light on this problem? Like what can i do to let rows write into output when no_action_10sec,no_action_20sec,no_action_30sec is flag to True?
Debug
OutputMode = Complete will output too much redundant data
Mock Data Generator
for i in {0..10000}; do echo "${i} $(date +%s)"; sleep 1; done | nc -lk 9999
Assume that the row has been showing up in console mode (.format("console") ) will send to Kafka for chaining action

How to do multiple window transformations in ApcheSparkStructuredStreaming

I am working on a project on spark structured streaming,in which i need to do aggregations on multiple windows(per min & per hour) on same data.
Facing error: Multiple aggregrations are not supported in streaming Dataframes.
for single window(per min) I am able to do the transformations. But no idea or luck on how to do multiple window tranformations on the same data.
df.withWatermark("timestamp", "60 seconds")
.groupBy(col("assetId"), col("organization"), col("tag"),
functions.window(col("timestamp"), "60 seconds", "60 seconds"),
functions.window(col("timestamp"), "3600 seconds", "3600 seconds"))
.mean("value");

Resources