Why does streaming aggregation delay until two batches of data always? - apache-spark

I use Spark 2.3.0.
My issue is whenever I add third batch of data in my input directory, the first batch of data getting processed and printing to console. Why?
val spark = SparkSession
.builder()
.appName("micro1")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.streaming.checkpointLocation", "/user/sas/sparkCheckpoint")
.config("spark.sql.parquet.cacheMetadata","false")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
// Left side of a join
import org.apache.spark.sql.types._
val mySchema = new StructType()
.add("id", IntegerType)
.add("name", StringType)
.add("year", IntegerType)
.add("rating", DoubleType)
.add("duration", IntegerType)
val xmlData = spark
.readStream
.option("sep", ",")
.schema(mySchema)
.csv("tostack")
// Right side of a join
val mappingSchema = new StructType()
.add("id", StringType)
.add("megavol", StringType)
val staticData = spark
.read
.option("sep", ",")
.schema(mappingSchema)
.csv("input_tost_static.csv")
xmlData.createOrReplaceTempView("xmlupdates")
staticData.createOrReplaceTempView("mappingdata")
spark
.sql("select * from xmlupdates a join mappingdata b on a.id=b.id")
.withColumn(
"event_time",
to_utc_timestamp(current_timestamp, Calendar.getInstance().getTimeZone().getID()))
.withWatermark("event_time", "10 seconds")
.groupBy(window($"event_time", "10 seconds", "10 seconds"), $"year")
.agg(
sum($"rating") as "rating",
sum($"duration") as "duration",
sum($"megavol") as "sum_megavol")
.drop("window")
.writeStream
.outputMode("append")
.format("console")
.start
my output showing data as below: I have started the streaming first and later added data in to the particular folder. when i add my third file the first file aggregated results are getting printed. Why?
-------------------------------------------
Batch: 0
-------------------------------------------
+----+------+--------+-----------+
|year|rating|duration|sum_megavol|
+----+------+--------+-----------+
+----+------+--------+-----------+
-------------------------------------------
Batch: 1
-------------------------------------------
+----+------+--------+-----------+
|year|rating|duration|sum_megavol|
+----+------+--------+-----------+
+----+------+--------+-----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+----+------+--------+-----------+
|year|rating|duration|sum_megavol|
+----+------+--------+-----------+
|1963| 2.8| 5126| 46.0|
|1921| 6.0| 15212| 3600.0|
+----+------+--------+-----------+
The input data is as follows:
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1993,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1921,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1963,3.8,5333
7,Muriel's Wedding,1963,3.5,6323
8,Mother's Boys,1963,3.4,5733
input_tost_static.csv dataset is as follows:
3,3000
4,600
5,46
can some one help me why spark structued streaming showing this behaviour ? Do i need to add any settings here ?
UPDATE : I am getting results in batch 1 itself if i try to print the val before JOIN operation... the issue is coming after joining.. its delaying morethan 3 batches....

I have started the streaming first
Batch: 0 is executed right after you started the query and given no events were streamed, no output.
At this point, the event-time watermark is not set at all.
and later added data in to the particular folder.
That could be Batch: 1.
The event-time watermark was then set to current_timestamp. In order to get any output, we have to wait "10 seconds" (according to withWatermark("event_time", "10 seconds")).
when i add my third file the first file aggregated results are getting printed. Why?
That could be Batch: 2.
I assume the next time you added new files it was after previous current_timestamp + "10 seconds" and so you got the output.
Please note that a watermark can be just 0 which means that no late data is expected.

Related

Spark structured streaming watermark with OutputMode.Complete

I wrote simple query which should ignore data where created < last event time - 5 seconds. But this query doesn't work. All data is printed out.
Also I tried to use window function window($"created", "10 seconds", "10 seconds"), but that didn't help.
val inputStream = new MemoryStream[(Timestamp, String)](1, spark.sqlContext)
val df = inputStream.toDS().toDF("created", "animal")
val query = df
.withWatermark("created", "5 seconds")
.groupBy($"animal")
.count()
.writeStream
.format("console")
.outputMode(OutputMode.Complete())
.start()
You need more grouping by info like such:
val windowedCounts = words
.withWatermark("timestamp", "10 minutes")
.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word")
.count()
Moreover, from the manual:
Output mode must be Append or Update. Complete mode requires all aggregate data to be preserved, and hence cannot use watermarking to drop intermediate state.

Empty CSV file is getting generated after processing spark structured streaming

when i try to write some spark structured streaming data to csv, i see empty part files getting generated on hdfs location. I have tried to write same on console and the data getting generated to console.
val spark =SparkSession.builder().appName("micro").
enableHiveSupport().config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("spark.sql.streaming.checkpointLocation", "/user/sasidhr1/sparkCheckpoint").
config("spark.debug.maxToStringFields",100).
getOrCreate()
val mySchema = StructType(Array(
StructField("id", IntegerType),
StructField("name", StringType),
StructField("year", IntegerType),
StructField("rating", DoubleType),
StructField("duration", IntegerType)
))
val xmlData = spark.readStream.option("sep", ",").schema(mySchema).csv("file:///home/sa1/kafdata/")
import java.util.Calendar
val df_agg_without_time= xmlData.withColumn("event_time", to_utc_timestamp(current_timestamp, Calendar.getInstance().getTimeZone().getID()))
val df_agg_with_time = df_agg_without_time.withWatermark("event_time", "10 seconds").groupBy(window($"event_time", "10 seconds", "5 seconds"),$"year").agg(sum($"rating").as("rating"),sum($"duration").as("duration"))
val pr = df_agg_with_time.drop("window")
pr.writeStream.outputMode("append").format("csv").
option("path", "hdfs://ccc/apps/hive/warehouse/rta.db/sample_movcsv/").start()
if i dont drop (window) column another issue will occur... that issue i have already posted here ... How to write windowed aggregation in CSV format?
can someone help on this? how to write to hdfs as csv file after aggregation..please help

Spark Streaming aggregation and filter in the same window

I've a fairly easy task - events are coming in and I want to filter those with higher value than the average per group by key in the same window.
I think this this is the relevant part of the code:
val avgfuel = events
.groupBy(window($"enqueuedTime", "30 seconds"), $"weatherCondition")
.agg(avg($"fuelEfficiencyPercentage") as "avg_fuel")
val joined = events.join(avgfuel, Seq("weatherCondition"))
.filter($"fuelEfficiencyPercentage" > $"avg_fuel")
val streamingQuery1 = joined.writeStream
.outputMode("append").
.trigger(Trigger.ProcessingTime("10 seconds")).
.option("checkpointLocation", checkpointLocation).
.format("json").option("path", containerOutputLocation).start()
events is a DStream.
The problem is that I'm getting empty files in the output location.
I'm using Databricks 3.5 - Spark 2.2.1 with Scala 2.11
What have I done wrong?
Thanks!
EDIT: a more complete code -
val inputStream = spark.readStream
.format("eventhubs") // working with azure event hubs
.options(eventhubParameters)
.load()
val schema = (new StructType)
.add("id", StringType)
.add("latitude", StringType)
.add("longitude", StringType)
.add("tirePressure", FloatType)
.add("fuelEfficiencyPercentage", FloatType)
.add("weatherCondition", StringType)
val df1 = inputStream.select($"body".cast("string").as("value")
, from_unixtime($"enqueuedTime").cast(TimestampType).as("enqueuedTime")
).withWatermark("enqueuedTime", "1 minutes")
val df2 = df1.select(from_json(($"value"), schema).as("body")
, $"enqueuedTime")
val df3 = df2.select(
$"enqueuedTime"
, $"body.id".cast("integer")
, $"body.latitude".cast("float")
, $"body.longitude".cast("float")
, $"body.tirePressure"
, $"body.fuelEfficiencyPercentage"
, $"body.weatherCondition"
)
val avgfuel = df3
.groupBy(window($"enqueuedTime", "10 seconds"), $"weatherCondition" )
.agg(avg($"fuelEfficiencyPercentage") as "fuel_avg", stddev($"fuelEfficiencyPercentage") as "fuel_stddev")
.select($"weatherCondition", $"fuel_avg")
val broadcasted = sc.broadcast(avgfuel)
val joined = df3.join(broadcasted.value, Seq("weatherCondition"))
.filter($"fuelEfficiencyPercentage" > $"fuel_avg")
val streamingQuery1 = joined.writeStream.
outputMode("append").
trigger(Trigger.ProcessingTime("10 seconds")).
option("checkpointLocation", checkpointLocation).
format("json").option("path", outputLocation).start()
This executes without errors and after a while results are starting to be written. I might be due to the broadcast of the aggregation result but I'm not sure.
Small investigation ;)
Events can't be a DStream, because you have option to use Dataset operations on it - it must be a Dataset
Stream-Stream joins are not allowed in Spark 2.2. I've tried to run your code with events as rate source and I get:
org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;;
Join Inner, (value#1L = eventValue#41L)
Result is quite unexpected - probably you used read instead of readStream and you didn't create a Streaming Dataset, but static. Change it to readStream and it will work - of course after upgrade to 2.3
Code - without comments above - is correct and should run correctly on Spark 2.3. Note that you must also change mode to complete instead of append, because you are doing aggregation

How to write parquet files from streaming query?

I'm reading from a CSV file using Spark 2.2 structured streaming.
My query for writing the result to the console is this:
val consoleQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.writeStream
.format("console")
.option("truncate", value = false)
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Complete())
The result looks fine:
+---------------------------------------------+-------------+-----+
|window |id |count|
+---------------------------------------------+-------------+-----+
|[2017-02-17 09:00:00.0,2017-02-17 10:00:00.0]|EXC0000000001|1 |
|[2017-02-17 09:00:00.0,2017-02-17 10:00:00.0]|EXC0000000002|8 |
|[2017-02-17 08:00:00.0,2017-02-17 09:00:00.0]|EXC2200002 |1 |
+---------------------------------------------+-------------+-----+
But when writing it to a Parquet file
val parquetQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.coalesce(1)
.writeStream
.format("parquet")
.option("path", "src/main/resources/parquet")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "src/main/resources/checkpoint")
.outputMode(OutputMode.Append())
and reading it in with another job,
val data = spark.read.parquet("src/main/resources/parquet/")
the result is this:
+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+
TL;DR parquetQuery has not been started and so no output from a streaming query.
Check out the type of parquetQuery which is org.apache.spark.sql.streaming.DataStreamWriter which is simply a description of a query that at some point is supposed to be started. Since it was not, the query has never been able to do anything that would write the stream.
Add start at the very end of parquetQuery declaration (right after or as part of the call chain).
val parquetQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.coalesce(1)
.writeStream
.format("parquet")
.option("path", "src/main/resources/parquet")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "src/main/resources/checkpoint")
.outputMode(OutputMode.Append())
.start // <-- that's what you miss

Executing separate streaming queries in spark structured streaming

I am trying to aggregate stream with two different windows and printing it into the console. However only the first streaming query is being printed. The tenSecsQ is not printed into the console.
SparkSession spark = SparkSession
.builder()
.appName("JavaStructuredNetworkWordCountWindowed")
.config("spark.master", "local[*]")
.getOrCreate();
Dataset<Row> lines = spark
.readStream()
.format("socket")
.option("host", host)
.option("port", port)
.option("includeTimestamp", true)
.load();
Dataset<Row> words = lines
.as(Encoders.tuple(Encoders.STRING(), Encoders.TIMESTAMP()))
.toDF("word", "timestamp");
// 5 second window
Dataset<Row> fiveSecs = words
.groupBy(
functions.window(words.col("timestamp"), "5 seconds"),
words.col("word")
).count().orderBy("window");
// 10 second window
Dataset<Row> tenSecs = words
.groupBy(
functions.window(words.col("timestamp"), "10 seconds"),
words.col("word")
).count().orderBy("window");
Trigger Streaming Query for both 5s and 10s aggregated streams. The output for 10s stream is not printed. Only 5s is printed into console
// Start writeStream() for 5s window
StreamingQuery fiveSecQ = fiveSecs.writeStream()
.queryName("5_secs")
.outputMode("complete")
.format("console")
.option("truncate", "false")
.start();
// Start writeStream() for 10s window
StreamingQuery tenSecsQ = tenSecs.writeStream()
.queryName("10_secs")
.outputMode("complete")
.format("console")
.option("truncate", "false")
.start();
tenSecsQ.awaitTermination();
I've been investigating this question.
Summary: Each query in Structured Streaming consumes the source data. The socket source creates a new connection for each query defined. The behavior seen in this case is because nc is only delivering the input data to the first connection.
Henceforth, it's not possible to define multiple aggregations over the socket connection unless we can ensure that the connected socket source delivers the same data to each connection open.
I discussed this question on the Spark mailing list.
Databricks developer Shixiong Zhu answered:
Spark creates one connection for each query. The behavior you observed is because how "nc -lk" works. If you use netstat to check the tcp connections, you will see there are two connections when starting two queries. However, "nc" forwards the input to only one connection.
I verified this behavior by defining a small experiment:
First, I created a SimpleTCPWordServer that delivers random words to each connection open and a basic Structured Streaming job that declares two queries. The only difference between them is that the 2nd query defines an extra constant column to differentiate its output:
val lines = spark
.readStream
.format("socket")
.option("host", "localhost")
.option("port", "9999")
.option("includeTimestamp", true)
.load()
val q1 = lines.writeStream
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
val q2 = lines.withColumn("foo", lit("foo")).writeStream
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("7 seconds"))
.start()
If StructuredStreaming would consume only one stream, then we should see the same words delivered by both queries. In the case that each query consumes a separate stream, then we will have different words reported by each query.
This is the observed output:
-------------------------------------------
Batch: 0
-------------------------------------------
+--------+-------------------+
| value| timestamp|
+--------+-------------------+
|champion|2017-08-14 13:54:51|
+--------+-------------------+
+------+-------------------+---+
| value| timestamp|foo|
+------+-------------------+---+
|belong|2017-08-14 13:54:51|foo|
+------+-------------------+---+
-------------------------------------------
Batch: 1
-------------------------------------------
+-------+-------------------+---+
| value| timestamp|foo|
+-------+-------------------+---+
| agenda|2017-08-14 13:54:52|foo|
|ceiling|2017-08-14 13:54:52|foo|
| bear|2017-08-14 13:54:53|foo|
+-------+-------------------+---+
-------------------------------------------
Batch: 1
-------------------------------------------
+----------+-------------------+
| value| timestamp|
+----------+-------------------+
| breath|2017-08-14 13:54:52|
|anticipate|2017-08-14 13:54:52|
| amazing|2017-08-14 13:54:52|
| bottle|2017-08-14 13:54:53|
| calculate|2017-08-14 13:54:53|
| asset|2017-08-14 13:54:54|
| cell|2017-08-14 13:54:54|
+----------+-------------------+
We can clearly see that the streams for each query are different. It would look like it's not possible to define multiple aggregations over the data delivered by the socket source unless we can guarantee that the TCP backend server delivers exactly the same data to each open connection.

Resources