What do these metrics mean for Spark Structured Streaming?

What do these metrics mean for Spark Structured Streaming? - apache-spark

spark.streams.addListener(new StreamingQueryListener() {
......
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("Query made progress: " + queryProgress.progress)
}
......
})
When StreamingQueryListener is added to Spark Structured Streaming session and output the queryProgress continuously, one of the metrics you will get is durationMs:
Query made progress: {
......
"durationMs" : {
"addBatch" : 159136,
"getBatch" : 0,
"getEndOffset" : 0,
"queryPlanning" : 38,
"setOffsetRange" : 14,
"triggerExecution" : 159518,
"walCommit" : 182
}
......
}
Can anyone told me what do those sub-metrics in durationMs meaning in spark context? For example, what is the meaning of "addBatch 159136".

https://www.waitingforcode.com/apache-spark-structured-streaming/query-metrics-apache-spark-structured-streaming/read
This is an excellent site that addresses the aspects and more, passing the credit to this site therefore.

Related

Accessing current watermark in Spark Structured Streaming

Is there any way to access current watermark value in Spark Structured Streaming?
I'd like to process events in their event-time order to find patterns in sequences. To do it I was thinking of using flatMapGroupsWithState and buffer events till the watermark passes (and avoid buffering late events) and process them one-by-one. But I don't know how to access current watermark to do it. Is it event possible in Spark Structure Streaming?

You can access the StreamingQueryProgress from your StreamingQuery object:
query.lastProgress()/recentProgress()
It will contain an eventTime.watermark field
something like:
{
"id" : "eb7202da-9e60-4983-89fc-e1251aebf89d",
"runId" : "969555bd-6189-4b70-a101-3b5917cea965",
"name" : "my-query",
"timestamp" : "2023-01-05T16:46:43.372Z",
"batchId" : 6,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"latestOffset" : 7,
"triggerExecution" : 7
},
"eventTime" : {
"watermark" : "2023-01-01T09:44:11.000Z"
},
"stateOperators" : [ {
"operatorName" : "stateStoreSave",
...etc
}

Is there a way to check if there exists a Pulsar producer with the same name on the same topic?

Pulsar allows multiple producers to subscribe to the same topic only if they have different producer names. Is there a way to check if a producer with the same name (and same topic) already exists?

You can use the stats command from the pulsar-admin CLI tool to list all of the producers attached to the topic as follows, then just look inside the publishers section of the JSON output for the producerName
root#6b40ffcc05ec:/pulsar# ./bin/pulsar-admin topics stats persistent://public/default/test-topic
{
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"bytesInCounter" : 65442,
"msgInCounter" : 1002,
"bytesOutCounter" : 0,
"msgOutCounter" : 0,
"averageMsgSize" : 63.0,
"msgChunkPublished" : false,
"storageSize" : 65442,
"backlogSize" : 0,
"publishers" : [ {
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"averageMsgSize" : 63.0,
"chunkedMessageRate" : 0.0,
"producerId" : 0,
"metadata" : { },
"producerName" : "standalone-3-1",
"connectedSince" : "2020-08-06T15:51:48.279Z",
"clientVersion" : "2.6.0",
"address" : "/127.0.0.1:53058"
} ],
"subscriptions" : { },
"replication" : { },
"deduplicationStatus" : "Disabled"
}

Structured Spark streaming metrics retrieval

I have an application with structured Spark streaming and I would like to get some metrics like scheduling delay, latency, etc. Usually, such metrics can be found in Spark UI Streaming tab, however, such functionality does not exist for structured streaming as far as I know.
So how can I get these metrics values?
For now, I have tried to use query progress but not all of the required metrics can be found in the results:
QueryProgress {
"timestamp" : "2019-11-19T20:14:07.011Z",
"batchId" : 1,
"numInputRows" : 8,
"inputRowsPerSecond" : 0.8429038036034138,
"processedRowsPerSecond" : 1.1210762331838564,
"durationMs" : {
"addBatch" : 6902,
"getBatch" : 1,
"getEndOffset" : 0,
"queryPlanning" : 81,
"setOffsetRange" : 20,
"triggerExecution" : 7136,
"walCommit" : 41
},
"stateOperators" : [ {
"numRowsTotal" : 2,
"numRowsUpdated" : 2,
"memoryUsedBytes" : 75415,
"customMetrics" : {
"loadedMapCacheHitCount" : 400,
"loadedMapCacheMissCount" : 0,
"stateOnCurrentVersionSizeBytes" : 17815
}
} ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[tweets]]",
"startOffset" : {
"tweets" : {
"0" : 579
}
},
"endOffset" : {
"tweets" : {
"0" : 587
}
},
"numInputRows" : 8,
"inputRowsPerSecond" : 0.8429038036034138,
"processedRowsPerSecond" : 1.1210762331838564
} ]

Number of input rows in spark structured streaming with custom sink

I'm using a custom sink in structured stream (spark 2.2.0) and noticed that spark produces incorrect metrics for number of input rows - it's always zero.
My stream construction:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(s3Path.toString())
.as(Encoders.bean(TestRecord.class))
.flatMap(
((FlatMapFunction<TestRecord, TestOutputRecord>) (u) -> {
List<TestOutputRecord> list = new ArrayList<>();
try {
TestOutputRecord result = transformer.convert(u);
list.add(result);
} catch (Throwable t) {
System.err.println("Failed to convert a record");
t.printStackTrace();
}
return list.iterator();
}),
Encoders.bean(TestOutputRecord.class))
.map(new DataReinforcementMapFunction<>(), Encoders.bean(TestOutputRecord.clazz))
.writeStream()
.trigger(Trigger.ProcessingTime(WRITE_FREQUENCY, TimeUnit.SECONDS))
.format(MY_WRITER_FORMAT)
.outputMode(OutputMode.Append())
.queryName("custom-sink-stream")
.start();
writeStream.processAllAvailable();
writeStream.stop();
Logs:
Streaming query made progress: {
"id" : "a8a7fbc2-0f06-4197-a99a-114abae24964",
"runId" : "bebc8a0c-d3b2-4fd6-8710-78223a88edc7",
"name" : "custom-sink-stream",
"timestamp" : "2018-01-25T18:39:52.949Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 781,
"triggerExecution" : 781
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[s3n://test-bucket/test]",
"startOffset" : {
"logOffset" : 0
},
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "com.mycompany.spark.MySink#f82a99"
}
}
Do I have to populate any metrics in my custom sink to be able to track progress? Or could it be a problem in FileStreamSource when it reads from s3 bucket?

The problem was related to using dataset.rdd in my custom sink that creates a new plan so that StreamExecution doesn't know about it and therefore is not able to get metrics.
Replacing data.rdd.mapPartitions with data.queryExecution.toRdd.mapPartitions fixes the issue.

How do I get elapsed time in the rest interface with ArangoDB?

How do I get elapsed query time in the REST interface with ArangoDB? (an additional json field with the elapsed time)
Thanks.

Its possible to get profile information for the different execution phases of AQL queries via setting the profile option to true.
It can be done in arangosh like this:
q = "FOR doc IN _users RETURN doc";
s = db._createStatement({ query: q, options: { profile: true } });
res = s.execute().getExtra();
The resulting json of the getExtra() will look like that:
{
"stats" : {
"writesExecuted" : 0,
"writesIgnored" : 0,
"scannedFull" : 1,
"scannedIndex" : 0,
"filtered" : 0
},
"profile" : {
"initializing" : 0.0000040531158447265625,
"parsing" : 0.00003600120544433594,
"optimizing ast" : 0.0000040531158447265625,
"instantiating plan" : 0.000010967254638671875,
"optimizing plan" : 0.000023126602172851562,
"executing" : 0.00004601478576660156
},
"warnings" : [ ]
}
For shure https://docs.arangodb.com/Aql/Invoke.html should and will mention this.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

What do these metrics mean for Spark Structured Streaming? - apache-spark

https://www.waitingforcode.com/apache-spark-structured-streaming/query-metrics-apache-spark-structured-streaming/read This is an excellent site that addresses the aspects and more, passing the credit to this site therefore.

Related

Accessing current watermark in Spark Structured Streaming

Is there a way to check if there exists a Pulsar producer with the same name on the same topic?

Structured Spark streaming metrics retrieval

Number of input rows in spark structured streaming with custom sink

How do I get elapsed time in the rest interface with ArangoDB?

Categories

Resources