Spark structured streaming watermark with OutputMode.Complete - apache-spark

I wrote simple query which should ignore data where created < last event time - 5 seconds. But this query doesn't work. All data is printed out.
Also I tried to use window function window($"created", "10 seconds", "10 seconds"), but that didn't help.
val inputStream = new MemoryStream[(Timestamp, String)](1, spark.sqlContext)
val df = inputStream.toDS().toDF("created", "animal")
val query = df
.withWatermark("created", "5 seconds")
.groupBy($"animal")
.count()
.writeStream
.format("console")
.outputMode(OutputMode.Complete())
.start()

You need more grouping by info like such:
val windowedCounts = words
.withWatermark("timestamp", "10 minutes")
.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word")
.count()
Moreover, from the manual:
Output mode must be Append or Update. Complete mode requires all aggregate data to be preserved, and hence cannot use watermarking to drop intermediate state.

Related

Output top n records for the last hour every minute

What is the best way to keep updated table containing top n records for the last let's say 60 minutes for a stream using Spark Structured Streaming? I only want to update my table once a minute. I thought I would achieve this using window function, set window duration to 60 minutes and slide duration to 1 minute. The problem is I get data for all the sliding windows in the output, but I only need the last completely calculated (closed) window. Is there a way to achieve this? Or perhaps should I tackle the problem (keeping nearly real-time rankings for the past hour) in a different way?
My incomplete solution:
val entityCounts = entities
.withWatermark("timestamp", "1 minute")
.groupBy(
window(col("timestamp"), "60 minutes", "1 minute")
.as("time_window"),
col("entity_type"),
col("entity"))
.count()
val query = entityCounts.writeStream
.foreachBatch( { (batchDF, _) =>
batchDF
.withColumn(
"row_number", row_number() over Window
.partitionBy("entity_type")
.orderBy(col("count").desc))
.filter(col("row_number") <= 5)
.select(
col("entity_type"),
col("entity"),
col("count").as("occurrences"))
.write
.cassandraFormat("table", "keyspace")
.mode("append")
.save
})
.outputMode("append")
.start()

Aggregation with distinct count in Spark structured streaming throwing error

i am trying to get unique id for group of Parentgroup,childgroup and MountingType in Spark structured streaming.
Code: the below code is throwing error
.withWatermark("timestamp", "1 minutes")
val aggDF = JSONDF.groupBy("Parentgroup","childgroup","MountingType")
.agg(countDistinct("id"))
Error:
Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark
may someone please help me how to do aggregate and write to the csv in structured streaming.
Thanks a lot
Data:
{"id":"7CE3A7CA","Faulttime":1544362500,"name":"Sony","Parentgroup":"TV","childgroup":"Other","MountingType":"SurfaceMount"}
{"id":"7CE3A7CA","Faulttime":1544362509,"name":"Sony","Parentgroup":"TV","childgroup":"Other","MountingType":"SurfaceMount"}
{"id":"010004FF,"Faulttime":1551339188,"name":"Philips","Parentgroup":"Light","childgroup":"Other","MountingType":"Solder"}
{"id":"010004FF","Faulttime":1551339188,"name":"Sony","Parentgroup":"TV","childgroup":"Other","MountingType":"Solder"}
{"id":"010004FF,"Faulttime":1551339191,"name":"Sansui","Parentgroup":"AC","childgroup":"Other","MountingType":"SurfaceMount"}
{"id":"CE361405","Faulttime":1552159061,"name":"Hyndai","Parentgroup":"SBAR","childgroup":"Other","MountingType":"SurfaceMount"}
{"id":"CE361405","Faulttime":1552159061,"name":"sony","Parentgroup":"TV","childgroup":"Other","MountingType":"SurfaceMount"}
{"id":"7BE446C0","Faulttime":1553022095,"name":"Sony","Parentgroup":"TV","childgroup":"Other","MountingType":"Solder"}
{"id":"7BE446C0","Faulttime":1553022095,"name":"Philips","Parentgroup":"LIGHT","childgroup":"Other","MountingType":"Solder"}
Group By operations need to specify window or time period in spark streaming.
Try this
psuedo code
val JSONDF = df.withWatermark("timestamp", "1 minutes")
val aggDF = JSONDF.groupBy(window("timestamp", "5 minutes", "1 minutes")).agg(countDistinct("id"),$"Parentgroup",$"childgroup",$"MountingType")
Reference :
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html

Why does streaming aggregation delay until two batches of data always?

I use Spark 2.3.0.
My issue is whenever I add third batch of data in my input directory, the first batch of data getting processed and printing to console. Why?
val spark = SparkSession
.builder()
.appName("micro1")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.streaming.checkpointLocation", "/user/sas/sparkCheckpoint")
.config("spark.sql.parquet.cacheMetadata","false")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
// Left side of a join
import org.apache.spark.sql.types._
val mySchema = new StructType()
.add("id", IntegerType)
.add("name", StringType)
.add("year", IntegerType)
.add("rating", DoubleType)
.add("duration", IntegerType)
val xmlData = spark
.readStream
.option("sep", ",")
.schema(mySchema)
.csv("tostack")
// Right side of a join
val mappingSchema = new StructType()
.add("id", StringType)
.add("megavol", StringType)
val staticData = spark
.read
.option("sep", ",")
.schema(mappingSchema)
.csv("input_tost_static.csv")
xmlData.createOrReplaceTempView("xmlupdates")
staticData.createOrReplaceTempView("mappingdata")
spark
.sql("select * from xmlupdates a join mappingdata b on a.id=b.id")
.withColumn(
"event_time",
to_utc_timestamp(current_timestamp, Calendar.getInstance().getTimeZone().getID()))
.withWatermark("event_time", "10 seconds")
.groupBy(window($"event_time", "10 seconds", "10 seconds"), $"year")
.agg(
sum($"rating") as "rating",
sum($"duration") as "duration",
sum($"megavol") as "sum_megavol")
.drop("window")
.writeStream
.outputMode("append")
.format("console")
.start
my output showing data as below: I have started the streaming first and later added data in to the particular folder. when i add my third file the first file aggregated results are getting printed. Why?
-------------------------------------------
Batch: 0
-------------------------------------------
+----+------+--------+-----------+
|year|rating|duration|sum_megavol|
+----+------+--------+-----------+
+----+------+--------+-----------+
-------------------------------------------
Batch: 1
-------------------------------------------
+----+------+--------+-----------+
|year|rating|duration|sum_megavol|
+----+------+--------+-----------+
+----+------+--------+-----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+----+------+--------+-----------+
|year|rating|duration|sum_megavol|
+----+------+--------+-----------+
|1963| 2.8| 5126| 46.0|
|1921| 6.0| 15212| 3600.0|
+----+------+--------+-----------+
The input data is as follows:
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1993,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1921,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1963,3.8,5333
7,Muriel's Wedding,1963,3.5,6323
8,Mother's Boys,1963,3.4,5733
input_tost_static.csv dataset is as follows:
3,3000
4,600
5,46
can some one help me why spark structued streaming showing this behaviour ? Do i need to add any settings here ?
UPDATE : I am getting results in batch 1 itself if i try to print the val before JOIN operation... the issue is coming after joining.. its delaying morethan 3 batches....
I have started the streaming first
Batch: 0 is executed right after you started the query and given no events were streamed, no output.
At this point, the event-time watermark is not set at all.
and later added data in to the particular folder.
That could be Batch: 1.
The event-time watermark was then set to current_timestamp. In order to get any output, we have to wait "10 seconds" (according to withWatermark("event_time", "10 seconds")).
when i add my third file the first file aggregated results are getting printed. Why?
That could be Batch: 2.
I assume the next time you added new files it was after previous current_timestamp + "10 seconds" and so you got the output.
Please note that a watermark can be just 0 which means that no late data is expected.

Spark Structured Streaming Kafka Microbatch count

I am using Spark structured streaming to read records from a Kafka topic; I intend to count the number of records received in each 'Micro batch' in Spark readstream
This is a snippet:
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.load()
I understand from the docs that kafka_df will be lazily evaluated when a streamingQuery is started (to come next), and as it is evaluated, it holds a micro-batch. So, I figured doing a groupBy on topic followed by a count should work.
Like this:
val counter = kafka_df
.groupBy("topic")
.count()
Now to evaluate all of this, we need a streaminQuery, lets say, a console sink query to print it on the console. And this is where i see the problem. A streamingQuery on aggregate DataFrames, such as kafka_df works only with outputMode complete/update and not on append.
This effectively means that, the count reported by the streamingQuery is cumulative.
Like this:
val counter_json = counter.toJSON //to jsonify
val count_query = counter_json
.writeStream.outputMode("update")
.format("console")
.start() // kicks of lazy evaluation
.awaitTermination()
In a controlled set up, where:
actual Published records: 1500
actual Received micro-batches : 3
aActual Received records: 1500
The count of each microbatch is supposed to be 500, so I hoped (wished) that the query prints to console:
topic: test-count
count: 500
topic: test-count
count: 500
topic: test-count
count: 500
But it doesn't. It actually prints:
topic: test-count
count: 500
topic: test-count
count:1000
topic: test-count
count: 1500
This I understand is because of 'outputMode' complete/update (cumulative)
My question: Is it possible to accurately get the count of each micro-batch is Spark-Kafka structured streaming?
From the docs, I found out about the watermark approach (to support append):
val windowedCounts = kafka_df
.withWatermark("timestamp", "10 seconds")
.groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"topic")
.count()
val console_query = windowedCounts
.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
But the results of this console_query are inaccurate and appears is way off mark.
TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated.
If you want to only process a specific number of records with every trigger within a Structured Streaming application using Kafka, use the option maxOffsetsPerTrigger
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.option("maxOffsetsPerTrigger", 500)
.load()
"TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated."
You can count the records fetched from Kafka by using a StreamingQueryListener (ScalaDocs).
This allows you to print out the exact number of rows that were received from the subscribed Kafka topic. The onQueryProgress API gets called during every micro-batch and contains lots of useful meta information on your query. If no data is flowing into the query the onQueryProgress is called every 10 seconds. Below is a simple example that prints out the number of input messages.
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("NumInputRows: " + queryProgress.progress.numInputRows)
}
})
In case you are validating the performance of your Structured Streaming query, it is usually best to keep an eye on the following two metrics:
queryProgress.progress.inputRowsPerSecond
queryProgress.progress.processedRowsPerSecond
In case input is higher than processed you might increase resources for your job or reduce the maximum limit (by reducing the readStream option maxOffsetsPerTrigger). If processed is higher, you may want to increase this limit.

How to write parquet files from streaming query?

I'm reading from a CSV file using Spark 2.2 structured streaming.
My query for writing the result to the console is this:
val consoleQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.writeStream
.format("console")
.option("truncate", value = false)
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Complete())
The result looks fine:
+---------------------------------------------+-------------+-----+
|window |id |count|
+---------------------------------------------+-------------+-----+
|[2017-02-17 09:00:00.0,2017-02-17 10:00:00.0]|EXC0000000001|1 |
|[2017-02-17 09:00:00.0,2017-02-17 10:00:00.0]|EXC0000000002|8 |
|[2017-02-17 08:00:00.0,2017-02-17 09:00:00.0]|EXC2200002 |1 |
+---------------------------------------------+-------------+-----+
But when writing it to a Parquet file
val parquetQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.coalesce(1)
.writeStream
.format("parquet")
.option("path", "src/main/resources/parquet")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "src/main/resources/checkpoint")
.outputMode(OutputMode.Append())
and reading it in with another job,
val data = spark.read.parquet("src/main/resources/parquet/")
the result is this:
+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+
TL;DR parquetQuery has not been started and so no output from a streaming query.
Check out the type of parquetQuery which is org.apache.spark.sql.streaming.DataStreamWriter which is simply a description of a query that at some point is supposed to be started. Since it was not, the query has never been able to do anything that would write the stream.
Add start at the very end of parquetQuery declaration (right after or as part of the call chain).
val parquetQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.coalesce(1)
.writeStream
.format("parquet")
.option("path", "src/main/resources/parquet")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "src/main/resources/checkpoint")
.outputMode(OutputMode.Append())
.start // <-- that's what you miss

Resources