Is there any way to access current watermark value in Spark Structured Streaming?
I'd like to process events in their event-time order to find patterns in sequences. To do it I was thinking of using flatMapGroupsWithState and buffer events till the watermark passes (and avoid buffering late events) and process them one-by-one. But I don't know how to access current watermark to do it. Is it event possible in Spark Structure Streaming?
You can access the StreamingQueryProgress from your StreamingQuery object:
query.lastProgress()/recentProgress()
It will contain an eventTime.watermark field
something like:
{
"id" : "eb7202da-9e60-4983-89fc-e1251aebf89d",
"runId" : "969555bd-6189-4b70-a101-3b5917cea965",
"name" : "my-query",
"timestamp" : "2023-01-05T16:46:43.372Z",
"batchId" : 6,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"latestOffset" : 7,
"triggerExecution" : 7
},
"eventTime" : {
"watermark" : "2023-01-01T09:44:11.000Z"
},
"stateOperators" : [ {
"operatorName" : "stateStoreSave",
...etc
}
Related
Pulsar allows multiple producers to subscribe to the same topic only if they have different producer names. Is there a way to check if a producer with the same name (and same topic) already exists?
You can use the stats command from the pulsar-admin CLI tool to list all of the producers attached to the topic as follows, then just look inside the publishers section of the JSON output for the producerName
root#6b40ffcc05ec:/pulsar# ./bin/pulsar-admin topics stats persistent://public/default/test-topic
{
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"bytesInCounter" : 65442,
"msgInCounter" : 1002,
"bytesOutCounter" : 0,
"msgOutCounter" : 0,
"averageMsgSize" : 63.0,
"msgChunkPublished" : false,
"storageSize" : 65442,
"backlogSize" : 0,
"publishers" : [ {
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"averageMsgSize" : 63.0,
"chunkedMessageRate" : 0.0,
"producerId" : 0,
"metadata" : { },
"producerName" : "standalone-3-1",
"connectedSince" : "2020-08-06T15:51:48.279Z",
"clientVersion" : "2.6.0",
"address" : "/127.0.0.1:53058"
} ],
"subscriptions" : { },
"replication" : { },
"deduplicationStatus" : "Disabled"
}
spark.streams.addListener(new StreamingQueryListener() {
......
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("Query made progress: " + queryProgress.progress)
}
......
})
When StreamingQueryListener is added to Spark Structured Streaming session and output the queryProgress continuously, one of the metrics you will get is durationMs:
Query made progress: {
......
"durationMs" : {
"addBatch" : 159136,
"getBatch" : 0,
"getEndOffset" : 0,
"queryPlanning" : 38,
"setOffsetRange" : 14,
"triggerExecution" : 159518,
"walCommit" : 182
}
......
}
Can anyone told me what do those sub-metrics in durationMs meaning in spark context? For example, what is the meaning of "addBatch 159136".
https://www.waitingforcode.com/apache-spark-structured-streaming/query-metrics-apache-spark-structured-streaming/read
This is an excellent site that addresses the aspects and more, passing the credit to this site therefore.
I have a simple Spark Structured streaming job that uses Kafka 0.10 API to read data from Kafka and write to our S3 storage. From the logs I could see that for the each batch that is triggered the streaming application is making progress and is consuming data from source because that endOffset is greater than startOffset and both are always increasing for each batch. But the numInputRows is always zero and there are no rows written to the S3.
Why is there an progressive increase in offsets but no data is consumed by the spark batch?
19/09/10 15:55:01 INFO MicroBatchExecution: Streaming query made progress: {
"id" : "90f21e5f-270d-428d-b068-1f1aa0861fb1",
"runId" : "f09f8eb4-8f33-42c2-bdf4-dffeaebf630e",
"name" : null,
"timestamp" : "2019-09-10T15:55:00.000Z",
"batchId" : 189,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 127,
"getBatch" : 0,
"getEndOffset" : 0,
"queryPlanning" : 24,
"setOffsetRange" : 36,
"triggerExecution" : 1859,
"walCommit" : 1032
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[my_kafka_topic]]",
"startOffset" : {
"my_kafka_topic" : {
"23" : 1206926686,
"8" : 1158514946,
"17" : 1258387219,
"11" : 1263091642,
"2" : 1226741128,
"20" : 1229560889,
"5" : 1170304913,
"14" : 1207333901,
"4" : 1274242728,
"13" : 1336386658,
"22" : 1260210993,
"7" : 1288639296,
"16" : 1247462229,
"10" : 1093157103,
"1" : 1219904858,
"19" : 1116269615,
"9" : 1238935018,
"18" : 1069224544,
"12" : 1256018541,
"3" : 1251150202,
"21" : 1256774117,
"15" : 1170591375,
"6" : 1185108169,
"24" : 1202342095,
"0" : 1165356330
}
},
"endOffset" : {
"my_kafka_topic" : {
"23" : 1206928043,
"8" : 1158516721,
"17" : 1258389219,
"11" : 1263093490,
"2" : 1226743225,
"20" : 1229562962,
"5" : 1170307882,
"14" : 1207335736,
"4" : 1274245585,
"13" : 1336388570,
"22" : 1260213582,
"7" : 1288641384,
"16" : 1247464311,
"10" : 1093159186,
"1" : 1219906407,
"19" : 1116271435,
"9" : 1238936994,
"18" : 1069226913,
"12" : 1256020926,
"3" : 1251152579,
"21" : 1256776910,
"15" : 1170593216,
"6" : 1185110032,
"24" : 1202344538,
"0" : 1165358262
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "FileSink[s3://my-s3-bucket/data/kafka/my_kafka_topic]"
}
}
A simplified version of the spark code is as shown below
val df = sparkSession
.readStream
.format"kafka")
.options(Map(
"kafka.bootstrap.servers" -> "host:1009",
"subscribe" -> "my_kafka-topic",
"kafka.client.id" -> "my-client-id",
"maxOffsetsPerTrigger" -> 1000,
"fetch.message.max.bytes" -> 6048576
))
.load()
df
.writeStream
.partitionBy("date", "hour")
.outputMode(OutputMode.Append())
.format("parquet")
.options(Map("checkpointLocation" -> "checkpoint", "path" -> "data"))
.trigger(Trigger.ProcessingTime(Duration("5m")))
.start()
.awaitTermination()
Edit: from the logs I also see these before each batch is executed
19/09/11 02:49:42 INFO Fetcher: [Consumer clientId=my_client_id, groupId=spark-kafka-source-5496988b-3f5c-4342-9361-917e4f3ece51-1340785812-driver-0] Resetting offset for partition my-topic-5 to offset 1168959116.
19/09/11 02:49:42 INFO Fetcher: [Consumer clientId=my_client_id, groupId=spark-kafka-source-5496988b-3f5c-4342-9361-917e4f3ece51-1340785812-driver-0] Resetting offset for partition my-topic-1 to offset 1218619371.
19/09/11 02:49:42 INFO Fetcher: [Consumer clientId=my_client_id, groupId=spark-kafka-source-5496988b-3f5c-4342-9361-917e4f3ece51-1340785812-driver-0] Resetting offset for partition my-topic-8 to offset 1157205346.
19/09/11 02:49:42 INFO Fetcher: [Consumer clientId=my_client_id, groupId=spark-kafka-source-5496988b-3f5c-4342-9361-917e4f3ece51-1340785812-driver-0] Resetting offset for partition my-topic-21 to offset 1255403059.
Can you check is any of the case related to output directory and checkpoint location mentioned in below link is applicable in your case?
https://kb.databricks.com/streaming/file-sink-streaming.html
This exact issue with updating offsets but no input rows happened to me when I cleaned my checkpoint location to start the streaming afresh but used the old target location (not cleared) for writing the streamed data. After cleaning (changing) both checkpoint and write location it worked just fine.
In this particular case as I cleared the checkpoint location the offsets were getting updated properly. But because I didn't clear the target location (as it had data from 5-6 months of continuous streaming i.e. 100s of 1000s of small files to delete) but apparently spark checks for the spark metadata and as it found old data in there it didn't consume any new data.
I'm using a custom sink in structured stream (spark 2.2.0) and noticed that spark produces incorrect metrics for number of input rows - it's always zero.
My stream construction:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(s3Path.toString())
.as(Encoders.bean(TestRecord.class))
.flatMap(
((FlatMapFunction<TestRecord, TestOutputRecord>) (u) -> {
List<TestOutputRecord> list = new ArrayList<>();
try {
TestOutputRecord result = transformer.convert(u);
list.add(result);
} catch (Throwable t) {
System.err.println("Failed to convert a record");
t.printStackTrace();
}
return list.iterator();
}),
Encoders.bean(TestOutputRecord.class))
.map(new DataReinforcementMapFunction<>(), Encoders.bean(TestOutputRecord.clazz))
.writeStream()
.trigger(Trigger.ProcessingTime(WRITE_FREQUENCY, TimeUnit.SECONDS))
.format(MY_WRITER_FORMAT)
.outputMode(OutputMode.Append())
.queryName("custom-sink-stream")
.start();
writeStream.processAllAvailable();
writeStream.stop();
Logs:
Streaming query made progress: {
"id" : "a8a7fbc2-0f06-4197-a99a-114abae24964",
"runId" : "bebc8a0c-d3b2-4fd6-8710-78223a88edc7",
"name" : "custom-sink-stream",
"timestamp" : "2018-01-25T18:39:52.949Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 781,
"triggerExecution" : 781
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[s3n://test-bucket/test]",
"startOffset" : {
"logOffset" : 0
},
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "com.mycompany.spark.MySink#f82a99"
}
}
Do I have to populate any metrics in my custom sink to be able to track progress? Or could it be a problem in FileStreamSource when it reads from s3 bucket?
The problem was related to using dataset.rdd in my custom sink that creates a new plan so that StreamExecution doesn't know about it and therefore is not able to get metrics.
Replacing data.rdd.mapPartitions with data.queryExecution.toRdd.mapPartitions fixes the issue.
I am familiar explain() (also WebUI). I was curious whether there are any tools that generate an image of the tree structure of the logical/physical plan before/after optimizations. That is the information returned by explain() as an image.
A picture like a PNG or JPG? Never heard of one myself, but you can see the physical plan using web UI (that you've already mentioned).
The other phases of query execution are available using TreeNode methods which (among many methods that could help you out) give you my favorite numberedTreeString.
scala> println(q.queryExecution.analyzed.numberedTreeString)
00 Range (0, 5, step=1, splits=Some(8))
scala> println(q.queryExecution.executedPlan.numberedTreeString)
00 *Range (0, 5, step=1, splits=8)
You can save the output as JSON using toJSON or prettyJson to generate PNG (but I've never tried it out myself).
scala> println(q.queryExecution.executedPlan.prettyJson)
[ {
"class" : "org.apache.spark.sql.execution.WholeStageCodegenExec",
"num-children" : 1,
"child" : 0
}, {
"class" : "org.apache.spark.sql.execution.RangeExec",
"num-children" : 0,
"range" : [ {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Range",
"num-children" : 0,
"start" : 0,
"end" : 5,
"step" : 1,
"numSlices" : 8,
"output" : [ [ {
"class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children" : 0,
"name" : "id",
"dataType" : "long",
"nullable" : false,
"metadata" : { },
"exprId" : {
"product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
"id" : 0,
"jvmId" : "cb497d01-3b90-42a7-9ebf-ebe85578f763"
},
"isGenerated" : false
} ] ]
} ]
} ]