spark.streams.addListener(new StreamingQueryListener() {
......
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("Query made progress: " + queryProgress.progress)
}
......
})
When StreamingQueryListener is added to Spark Structured Streaming session and output the queryProgress continuously, one of the metrics you will get is durationMs:
Query made progress: {
......
"durationMs" : {
"addBatch" : 159136,
"getBatch" : 0,
"getEndOffset" : 0,
"queryPlanning" : 38,
"setOffsetRange" : 14,
"triggerExecution" : 159518,
"walCommit" : 182
}
......
}
Can anyone told me what do those sub-metrics in durationMs meaning in spark context? For example, what is the meaning of "addBatch 159136".
https://www.waitingforcode.com/apache-spark-structured-streaming/query-metrics-apache-spark-structured-streaming/read
This is an excellent site that addresses the aspects and more, passing the credit to this site therefore.
Related
Is there any way to access current watermark value in Spark Structured Streaming?
I'd like to process events in their event-time order to find patterns in sequences. To do it I was thinking of using flatMapGroupsWithState and buffer events till the watermark passes (and avoid buffering late events) and process them one-by-one. But I don't know how to access current watermark to do it. Is it event possible in Spark Structure Streaming?
You can access the StreamingQueryProgress from your StreamingQuery object:
query.lastProgress()/recentProgress()
It will contain an eventTime.watermark field
something like:
{
"id" : "eb7202da-9e60-4983-89fc-e1251aebf89d",
"runId" : "969555bd-6189-4b70-a101-3b5917cea965",
"name" : "my-query",
"timestamp" : "2023-01-05T16:46:43.372Z",
"batchId" : 6,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"latestOffset" : 7,
"triggerExecution" : 7
},
"eventTime" : {
"watermark" : "2023-01-01T09:44:11.000Z"
},
"stateOperators" : [ {
"operatorName" : "stateStoreSave",
...etc
}
Pulsar allows multiple producers to subscribe to the same topic only if they have different producer names. Is there a way to check if a producer with the same name (and same topic) already exists?
You can use the stats command from the pulsar-admin CLI tool to list all of the producers attached to the topic as follows, then just look inside the publishers section of the JSON output for the producerName
root#6b40ffcc05ec:/pulsar# ./bin/pulsar-admin topics stats persistent://public/default/test-topic
{
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"bytesInCounter" : 65442,
"msgInCounter" : 1002,
"bytesOutCounter" : 0,
"msgOutCounter" : 0,
"averageMsgSize" : 63.0,
"msgChunkPublished" : false,
"storageSize" : 65442,
"backlogSize" : 0,
"publishers" : [ {
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"averageMsgSize" : 63.0,
"chunkedMessageRate" : 0.0,
"producerId" : 0,
"metadata" : { },
"producerName" : "standalone-3-1",
"connectedSince" : "2020-08-06T15:51:48.279Z",
"clientVersion" : "2.6.0",
"address" : "/127.0.0.1:53058"
} ],
"subscriptions" : { },
"replication" : { },
"deduplicationStatus" : "Disabled"
}
I have an application with structured Spark streaming and I would like to get some metrics like scheduling delay, latency, etc. Usually, such metrics can be found in Spark UI Streaming tab, however, such functionality does not exist for structured streaming as far as I know.
So how can I get these metrics values?
For now, I have tried to use query progress but not all of the required metrics can be found in the results:
QueryProgress {
"timestamp" : "2019-11-19T20:14:07.011Z",
"batchId" : 1,
"numInputRows" : 8,
"inputRowsPerSecond" : 0.8429038036034138,
"processedRowsPerSecond" : 1.1210762331838564,
"durationMs" : {
"addBatch" : 6902,
"getBatch" : 1,
"getEndOffset" : 0,
"queryPlanning" : 81,
"setOffsetRange" : 20,
"triggerExecution" : 7136,
"walCommit" : 41
},
"stateOperators" : [ {
"numRowsTotal" : 2,
"numRowsUpdated" : 2,
"memoryUsedBytes" : 75415,
"customMetrics" : {
"loadedMapCacheHitCount" : 400,
"loadedMapCacheMissCount" : 0,
"stateOnCurrentVersionSizeBytes" : 17815
}
} ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[tweets]]",
"startOffset" : {
"tweets" : {
"0" : 579
}
},
"endOffset" : {
"tweets" : {
"0" : 587
}
},
"numInputRows" : 8,
"inputRowsPerSecond" : 0.8429038036034138,
"processedRowsPerSecond" : 1.1210762331838564
} ]
I'm using a custom sink in structured stream (spark 2.2.0) and noticed that spark produces incorrect metrics for number of input rows - it's always zero.
My stream construction:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(s3Path.toString())
.as(Encoders.bean(TestRecord.class))
.flatMap(
((FlatMapFunction<TestRecord, TestOutputRecord>) (u) -> {
List<TestOutputRecord> list = new ArrayList<>();
try {
TestOutputRecord result = transformer.convert(u);
list.add(result);
} catch (Throwable t) {
System.err.println("Failed to convert a record");
t.printStackTrace();
}
return list.iterator();
}),
Encoders.bean(TestOutputRecord.class))
.map(new DataReinforcementMapFunction<>(), Encoders.bean(TestOutputRecord.clazz))
.writeStream()
.trigger(Trigger.ProcessingTime(WRITE_FREQUENCY, TimeUnit.SECONDS))
.format(MY_WRITER_FORMAT)
.outputMode(OutputMode.Append())
.queryName("custom-sink-stream")
.start();
writeStream.processAllAvailable();
writeStream.stop();
Logs:
Streaming query made progress: {
"id" : "a8a7fbc2-0f06-4197-a99a-114abae24964",
"runId" : "bebc8a0c-d3b2-4fd6-8710-78223a88edc7",
"name" : "custom-sink-stream",
"timestamp" : "2018-01-25T18:39:52.949Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 781,
"triggerExecution" : 781
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[s3n://test-bucket/test]",
"startOffset" : {
"logOffset" : 0
},
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "com.mycompany.spark.MySink#f82a99"
}
}
Do I have to populate any metrics in my custom sink to be able to track progress? Or could it be a problem in FileStreamSource when it reads from s3 bucket?
The problem was related to using dataset.rdd in my custom sink that creates a new plan so that StreamExecution doesn't know about it and therefore is not able to get metrics.
Replacing data.rdd.mapPartitions with data.queryExecution.toRdd.mapPartitions fixes the issue.
How do I get elapsed query time in the REST interface with ArangoDB? (an additional json field with the elapsed time)
Thanks.
Its possible to get profile information for the different execution phases of AQL queries via setting the profile option to true.
It can be done in arangosh like this:
q = "FOR doc IN _users RETURN doc";
s = db._createStatement({ query: q, options: { profile: true } });
res = s.execute().getExtra();
The resulting json of the getExtra() will look like that:
{
"stats" : {
"writesExecuted" : 0,
"writesIgnored" : 0,
"scannedFull" : 1,
"scannedIndex" : 0,
"filtered" : 0
},
"profile" : {
"initializing" : 0.0000040531158447265625,
"parsing" : 0.00003600120544433594,
"optimizing ast" : 0.0000040531158447265625,
"instantiating plan" : 0.000010967254638671875,
"optimizing plan" : 0.000023126602172851562,
"executing" : 0.00004601478576660156
},
"warnings" : [ ]
}
For shure https://docs.arangodb.com/Aql/Invoke.html should and will mention this.