Accessing current watermark in Spark Structured Streaming

Accessing current watermark in Spark Structured Streaming - apache-spark

Is there any way to access current watermark value in Spark Structured Streaming?
I'd like to process events in their event-time order to find patterns in sequences. To do it I was thinking of using flatMapGroupsWithState and buffer events till the watermark passes (and avoid buffering late events) and process them one-by-one. But I don't know how to access current watermark to do it. Is it event possible in Spark Structure Streaming?

You can access the StreamingQueryProgress from your StreamingQuery object:
query.lastProgress()/recentProgress()
It will contain an eventTime.watermark field
something like:
{
"id" : "eb7202da-9e60-4983-89fc-e1251aebf89d",
"runId" : "969555bd-6189-4b70-a101-3b5917cea965",
"name" : "my-query",
"timestamp" : "2023-01-05T16:46:43.372Z",
"batchId" : 6,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"latestOffset" : 7,
"triggerExecution" : 7
},
"eventTime" : {
"watermark" : "2023-01-01T09:44:11.000Z"
},
"stateOperators" : [ {
"operatorName" : "stateStoreSave",
...etc
}

Related

Is there a way to check if there exists a Pulsar producer with the same name on the same topic?

Pulsar allows multiple producers to subscribe to the same topic only if they have different producer names. Is there a way to check if a producer with the same name (and same topic) already exists?

You can use the stats command from the pulsar-admin CLI tool to list all of the producers attached to the topic as follows, then just look inside the publishers section of the JSON output for the producerName
root#6b40ffcc05ec:/pulsar# ./bin/pulsar-admin topics stats persistent://public/default/test-topic
{
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"bytesInCounter" : 65442,
"msgInCounter" : 1002,
"bytesOutCounter" : 0,
"msgOutCounter" : 0,
"averageMsgSize" : 63.0,
"msgChunkPublished" : false,
"storageSize" : 65442,
"backlogSize" : 0,
"publishers" : [ {
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"averageMsgSize" : 63.0,
"chunkedMessageRate" : 0.0,
"producerId" : 0,
"metadata" : { },
"producerName" : "standalone-3-1",
"connectedSince" : "2020-08-06T15:51:48.279Z",
"clientVersion" : "2.6.0",
"address" : "/127.0.0.1:53058"
} ],
"subscriptions" : { },
"replication" : { },
"deduplicationStatus" : "Disabled"
}

What do these metrics mean for Spark Structured Streaming?

spark.streams.addListener(new StreamingQueryListener() {
......
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("Query made progress: " + queryProgress.progress)
}
......
})
When StreamingQueryListener is added to Spark Structured Streaming session and output the queryProgress continuously, one of the metrics you will get is durationMs:
Query made progress: {
......
"durationMs" : {
"addBatch" : 159136,
"getBatch" : 0,
"getEndOffset" : 0,
"queryPlanning" : 38,
"setOffsetRange" : 14,
"triggerExecution" : 159518,
"walCommit" : 182
}
......
}
Can anyone told me what do those sub-metrics in durationMs meaning in spark context? For example, what is the meaning of "addBatch 159136".

https://www.waitingforcode.com/apache-spark-structured-streaming/query-metrics-apache-spark-structured-streaming/read
This is an excellent site that addresses the aspects and more, passing the credit to this site therefore.

Spark Kafka Streaming making progress but there is no data to be consumed

I have a simple Spark Structured streaming job that uses Kafka 0.10 API to read data from Kafka and write to our S3 storage. From the logs I could see that for the each batch that is triggered the streaming application is making progress and is consuming data from source because that endOffset is greater than startOffset and both are always increasing for each batch. But the numInputRows is always zero and there are no rows written to the S3.
Why is there an progressive increase in offsets but no data is consumed by the spark batch?
19/09/10 15:55:01 INFO MicroBatchExecution: Streaming query made progress: {
"id" : "90f21e5f-270d-428d-b068-1f1aa0861fb1",
"runId" : "f09f8eb4-8f33-42c2-bdf4-dffeaebf630e",
"name" : null,
"timestamp" : "2019-09-10T15:55:00.000Z",
"batchId" : 189,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 127,
"getBatch" : 0,
"getEndOffset" : 0,
"queryPlanning" : 24,
"setOffsetRange" : 36,
"triggerExecution" : 1859,
"walCommit" : 1032
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[my_kafka_topic]]",
"startOffset" : {
"my_kafka_topic" : {
"23" : 1206926686,
"8" : 1158514946,
"17" : 1258387219,
"11" : 1263091642,
"2" : 1226741128,
"20" : 1229560889,
"5" : 1170304913,
"14" : 1207333901,
"4" : 1274242728,
"13" : 1336386658,
"22" : 1260210993,
"7" : 1288639296,
"16" : 1247462229,
"10" : 1093157103,
"1" : 1219904858,
"19" : 1116269615,
"9" : 1238935018,
"18" : 1069224544,
"12" : 1256018541,
"3" : 1251150202,
"21" : 1256774117,
"15" : 1170591375,
"6" : 1185108169,
"24" : 1202342095,
"0" : 1165356330
}
},
"endOffset" : {
"my_kafka_topic" : {
"23" : 1206928043,
"8" : 1158516721,
"17" : 1258389219,
"11" : 1263093490,
"2" : 1226743225,
"20" : 1229562962,
"5" : 1170307882,
"14" : 1207335736,
"4" : 1274245585,
"13" : 1336388570,
"22" : 1260213582,
"7" : 1288641384,
"16" : 1247464311,
"10" : 1093159186,
"1" : 1219906407,
"19" : 1116271435,
"9" : 1238936994,
"18" : 1069226913,
"12" : 1256020926,
"3" : 1251152579,
"21" : 1256776910,
"15" : 1170593216,
"6" : 1185110032,
"24" : 1202344538,
"0" : 1165358262
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "FileSink[s3://my-s3-bucket/data/kafka/my_kafka_topic]"
}
}
A simplified version of the spark code is as shown below
val df = sparkSession
.readStream
.format"kafka")
.options(Map(
"kafka.bootstrap.servers" -> "host:1009",
"subscribe" -> "my_kafka-topic",
"kafka.client.id" -> "my-client-id",
"maxOffsetsPerTrigger" -> 1000,
"fetch.message.max.bytes" -> 6048576
))
.load()
df
.writeStream
.partitionBy("date", "hour")
.outputMode(OutputMode.Append())
.format("parquet")
.options(Map("checkpointLocation" -> "checkpoint", "path" -> "data"))
.trigger(Trigger.ProcessingTime(Duration("5m")))
.start()
.awaitTermination()
Edit: from the logs I also see these before each batch is executed
19/09/11 02:49:42 INFO Fetcher: [Consumer clientId=my_client_id, groupId=spark-kafka-source-5496988b-3f5c-4342-9361-917e4f3ece51-1340785812-driver-0] Resetting offset for partition my-topic-5 to offset 1168959116.
19/09/11 02:49:42 INFO Fetcher: [Consumer clientId=my_client_id, groupId=spark-kafka-source-5496988b-3f5c-4342-9361-917e4f3ece51-1340785812-driver-0] Resetting offset for partition my-topic-1 to offset 1218619371.
19/09/11 02:49:42 INFO Fetcher: [Consumer clientId=my_client_id, groupId=spark-kafka-source-5496988b-3f5c-4342-9361-917e4f3ece51-1340785812-driver-0] Resetting offset for partition my-topic-8 to offset 1157205346.
19/09/11 02:49:42 INFO Fetcher: [Consumer clientId=my_client_id, groupId=spark-kafka-source-5496988b-3f5c-4342-9361-917e4f3ece51-1340785812-driver-0] Resetting offset for partition my-topic-21 to offset 1255403059.

Can you check is any of the case related to output directory and checkpoint location mentioned in below link is applicable in your case?
https://kb.databricks.com/streaming/file-sink-streaming.html

This exact issue with updating offsets but no input rows happened to me when I cleaned my checkpoint location to start the streaming afresh but used the old target location (not cleared) for writing the streamed data. After cleaning (changing) both checkpoint and write location it worked just fine.
In this particular case as I cleared the checkpoint location the offsets were getting updated properly. But because I didn't clear the target location (as it had data from 5-6 months of continuous streaming i.e. 100s of 1000s of small files to delete) but apparently spark checks for the spark metadata and as it found old data in there it didn't consume any new data.

Number of input rows in spark structured streaming with custom sink

I'm using a custom sink in structured stream (spark 2.2.0) and noticed that spark produces incorrect metrics for number of input rows - it's always zero.
My stream construction:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(s3Path.toString())
.as(Encoders.bean(TestRecord.class))
.flatMap(
((FlatMapFunction<TestRecord, TestOutputRecord>) (u) -> {
List<TestOutputRecord> list = new ArrayList<>();
try {
TestOutputRecord result = transformer.convert(u);
list.add(result);
} catch (Throwable t) {
System.err.println("Failed to convert a record");
t.printStackTrace();
}
return list.iterator();
}),
Encoders.bean(TestOutputRecord.class))
.map(new DataReinforcementMapFunction<>(), Encoders.bean(TestOutputRecord.clazz))
.writeStream()
.trigger(Trigger.ProcessingTime(WRITE_FREQUENCY, TimeUnit.SECONDS))
.format(MY_WRITER_FORMAT)
.outputMode(OutputMode.Append())
.queryName("custom-sink-stream")
.start();
writeStream.processAllAvailable();
writeStream.stop();
Logs:
Streaming query made progress: {
"id" : "a8a7fbc2-0f06-4197-a99a-114abae24964",
"runId" : "bebc8a0c-d3b2-4fd6-8710-78223a88edc7",
"name" : "custom-sink-stream",
"timestamp" : "2018-01-25T18:39:52.949Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 781,
"triggerExecution" : 781
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[s3n://test-bucket/test]",
"startOffset" : {
"logOffset" : 0
},
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "com.mycompany.spark.MySink#f82a99"
}
}
Do I have to populate any metrics in my custom sink to be able to track progress? Or could it be a problem in FileStreamSource when it reads from s3 bucket?

The problem was related to using dataset.rdd in my custom sink that creates a new plan so that StreamExecution doesn't know about it and therefore is not able to get metrics.
Replacing data.rdd.mapPartitions with data.queryExecution.toRdd.mapPartitions fixes the issue.

What tools to use to visualize logical and physical query plans?

I am familiar explain() (also WebUI). I was curious whether there are any tools that generate an image of the tree structure of the logical/physical plan before/after optimizations. That is the information returned by explain() as an image.

A picture like a PNG or JPG? Never heard of one myself, but you can see the physical plan using web UI (that you've already mentioned).
The other phases of query execution are available using TreeNode methods which (among many methods that could help you out) give you my favorite numberedTreeString.
scala> println(q.queryExecution.analyzed.numberedTreeString)
00 Range (0, 5, step=1, splits=Some(8))
scala> println(q.queryExecution.executedPlan.numberedTreeString)
00 *Range (0, 5, step=1, splits=8)
You can save the output as JSON using toJSON or prettyJson to generate PNG (but I've never tried it out myself).
scala> println(q.queryExecution.executedPlan.prettyJson)
[ {
"class" : "org.apache.spark.sql.execution.WholeStageCodegenExec",
"num-children" : 1,
"child" : 0
}, {
"class" : "org.apache.spark.sql.execution.RangeExec",
"num-children" : 0,
"range" : [ {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Range",
"num-children" : 0,
"start" : 0,
"end" : 5,
"step" : 1,
"numSlices" : 8,
"output" : [ [ {
"class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children" : 0,
"name" : "id",
"dataType" : "long",
"nullable" : false,
"metadata" : { },
"exprId" : {
"product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
"id" : 0,
"jvmId" : "cb497d01-3b90-42a7-9ebf-ebe85578f763"
},
"isGenerated" : false
} ] ]
} ]
} ]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Accessing current watermark in Spark Structured Streaming - apache-spark

Related

Is there a way to check if there exists a Pulsar producer with the same name on the same topic?

What do these metrics mean for Spark Structured Streaming?

Spark Kafka Streaming making progress but there is no data to be consumed

Number of input rows in spark structured streaming with custom sink

What tools to use to visualize logical and physical query plans?

Categories

Resources