Structured Spark streaming metrics retrieval - apache-spark

I have an application with structured Spark streaming and I would like to get some metrics like scheduling delay, latency, etc. Usually, such metrics can be found in Spark UI Streaming tab, however, such functionality does not exist for structured streaming as far as I know.
So how can I get these metrics values?
For now, I have tried to use query progress but not all of the required metrics can be found in the results:
QueryProgress {
"timestamp" : "2019-11-19T20:14:07.011Z",
"batchId" : 1,
"numInputRows" : 8,
"inputRowsPerSecond" : 0.8429038036034138,
"processedRowsPerSecond" : 1.1210762331838564,
"durationMs" : {
"addBatch" : 6902,
"getBatch" : 1,
"getEndOffset" : 0,
"queryPlanning" : 81,
"setOffsetRange" : 20,
"triggerExecution" : 7136,
"walCommit" : 41
},
"stateOperators" : [ {
"numRowsTotal" : 2,
"numRowsUpdated" : 2,
"memoryUsedBytes" : 75415,
"customMetrics" : {
"loadedMapCacheHitCount" : 400,
"loadedMapCacheMissCount" : 0,
"stateOnCurrentVersionSizeBytes" : 17815
}
} ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[tweets]]",
"startOffset" : {
"tweets" : {
"0" : 579
}
},
"endOffset" : {
"tweets" : {
"0" : 587
}
},
"numInputRows" : 8,
"inputRowsPerSecond" : 0.8429038036034138,
"processedRowsPerSecond" : 1.1210762331838564
} ]

Related

Is there a way to check if there exists a Pulsar producer with the same name on the same topic?

Pulsar allows multiple producers to subscribe to the same topic only if they have different producer names. Is there a way to check if a producer with the same name (and same topic) already exists?
You can use the stats command from the pulsar-admin CLI tool to list all of the producers attached to the topic as follows, then just look inside the publishers section of the JSON output for the producerName
root#6b40ffcc05ec:/pulsar# ./bin/pulsar-admin topics stats persistent://public/default/test-topic
{
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"bytesInCounter" : 65442,
"msgInCounter" : 1002,
"bytesOutCounter" : 0,
"msgOutCounter" : 0,
"averageMsgSize" : 63.0,
"msgChunkPublished" : false,
"storageSize" : 65442,
"backlogSize" : 0,
"publishers" : [ {
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"averageMsgSize" : 63.0,
"chunkedMessageRate" : 0.0,
"producerId" : 0,
"metadata" : { },
"producerName" : "standalone-3-1",
"connectedSince" : "2020-08-06T15:51:48.279Z",
"clientVersion" : "2.6.0",
"address" : "/127.0.0.1:53058"
} ],
"subscriptions" : { },
"replication" : { },
"deduplicationStatus" : "Disabled"
}

How is bounce rate calculated in Google Analytics?

Pulling some of the metrics from Google Analytics (including bounce rate). Saved data given below.
Which metrics are used to calculate bounce rate?
How can I calculate bounce rate using other metrics values?
{
"_id" : ObjectId("5ecd09c83f80224b219b6827"),
"source" : "(direct)",
"medium" : "(none)",
"pagePath" : "/",
"channelGrouping" : "Direct",
"deviceCategory" : "desktop",
"date" : "20180326",
"users" : 6,
"sessions" : 6,
"bounces" : 3,
"avgSessionDuration" : 95.3333333333333,
"pageviews" : 6,
"newUsers" : 5,
"sessionDuration" : 572,
"pageviewsPerSession" : 1,
"bounceRate" : 50,
"goal" : 0,
"accId" : "92025510",
"agencyId" : ObjectId("5e3136e4c2a1b60c89ae07cc"),
"accountMongoId" : ObjectId("5e4ee454cdc4db6a02696405"),
"dataForDate" : "2018-04-01",
"dataForDateTime" : ISODate("2018-03-26T00:00:00.000Z")
}
/* 2 */
{
"_id" : ObjectId("5ecd09c83f80224b219b682c"),
"source" : "(direct)",
"medium" : "(none)",
"pagePath" : "/",
"channelGrouping" : "Direct",
"deviceCategory" : "desktop",
"date" : "20180401",
"users" : 1,
"sessions" : 1,
"bounces" : 1,
"avgSessionDuration" : 0,
"pageviews" : 1,
"newUsers" : 1,
"sessionDuration" : 0,
"pageviewsPerSession" : 1,
"bounceRate" : 100,
"goal" : 0,
"accId" : "92025510",
"agencyId" : ObjectId("5e3136e4c2a1b60c89ae07cc"),
"accountMongoId" : ObjectId("5e4ee454cdc4db6a02696405"),
"dataForDate" : "2018-04-01",
"dataForDateTime" : ISODate("2018-04-01T00:00:00.000Z")
}
bounce rate
The percentage of single-page session (i.e., session in which the person left the property from the first page).
This is an internal calculation done by google its not one that you can calculate yourself i suggest using ga:bouncerate

Number of input rows in spark structured streaming with custom sink

I'm using a custom sink in structured stream (spark 2.2.0) and noticed that spark produces incorrect metrics for number of input rows - it's always zero.
My stream construction:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(s3Path.toString())
.as(Encoders.bean(TestRecord.class))
.flatMap(
((FlatMapFunction<TestRecord, TestOutputRecord>) (u) -> {
List<TestOutputRecord> list = new ArrayList<>();
try {
TestOutputRecord result = transformer.convert(u);
list.add(result);
} catch (Throwable t) {
System.err.println("Failed to convert a record");
t.printStackTrace();
}
return list.iterator();
}),
Encoders.bean(TestOutputRecord.class))
.map(new DataReinforcementMapFunction<>(), Encoders.bean(TestOutputRecord.clazz))
.writeStream()
.trigger(Trigger.ProcessingTime(WRITE_FREQUENCY, TimeUnit.SECONDS))
.format(MY_WRITER_FORMAT)
.outputMode(OutputMode.Append())
.queryName("custom-sink-stream")
.start();
writeStream.processAllAvailable();
writeStream.stop();
Logs:
Streaming query made progress: {
"id" : "a8a7fbc2-0f06-4197-a99a-114abae24964",
"runId" : "bebc8a0c-d3b2-4fd6-8710-78223a88edc7",
"name" : "custom-sink-stream",
"timestamp" : "2018-01-25T18:39:52.949Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 781,
"triggerExecution" : 781
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[s3n://test-bucket/test]",
"startOffset" : {
"logOffset" : 0
},
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "com.mycompany.spark.MySink#f82a99"
}
}
Do I have to populate any metrics in my custom sink to be able to track progress? Or could it be a problem in FileStreamSource when it reads from s3 bucket?
The problem was related to using dataset.rdd in my custom sink that creates a new plan so that StreamExecution doesn't know about it and therefore is not able to get metrics.
Replacing data.rdd.mapPartitions with data.queryExecution.toRdd.mapPartitions fixes the issue.

mongodb taking too much time for old entries

i am new in mongodb and i am facing an issue, i have around millions of documents in my collectionand i am trying to find single entry using findOne({}) command and when i am trying to find recent entries then response comes in miliseconds but when i am trying to fetch older entries around 600 millionth document then it takes around 2 minutes on mongo shell and my node server gives
{ MongoErro : connection 1 to 127.0.0.1:27017 timed out }
and my nodejs server sends an empty response. can any one tell me what should i do to resolve this issueThanks in advance
explain gives me
db.contacts.find({"phoneNumber":"9165900137"}).explain("executionStats")
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "meanApp.contacts",
"indexFilterSet" : false,
"parsedQuery" : {
"phoneNumber" : {
"$eq" : "9165900137"
}
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"phoneNumber" : {
"$eq" : "9165900137"
}
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 321188,
"totalKeysExamined" : 0,
"totalDocsExamined" : 495587806,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"phoneNumber" : {
"$eq" : "9165900137"
}
},
"nReturned" : 1,
"executionTimeMillisEstimate" : 295230,
"works" : 495587808,
"advanced" : 1,
"needTime" : 495587806,
"needYield" : 0,
"saveState" : 3871779,
"restoreState" : 3871779,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 495587806
}
},
"serverInfo" : {
"host" : "li1025-15.members.linode.com",
"port" : 27017,
"version" : "3.2.16",
"gitVersion" : "056bf45128114e44c5358c7a8776fb582363e094"
},
"ok" : 1
}
As indicated in the explain plan results, the current query is doing Collection Scan. This means it has to scan every document in collection to produce the match and you have got about half a billion documents.
Try adding this index and it might take a bit to create it.
db.contacts.createIndex( { phoneNumber: 1 }, { background: true } )
Run the query once the index creation is successful, you must see a dramatic improvement in performance. To be certain whether index got picked up, try explain again and it should no longer say COLLSCAN.

Mongo Secondary Sync Stuck in Recovery State

I have one master node and two secondary node. Due to some issues with one secondary node, it got stopped few days back. Now I am trying to resync the secondary node but after some time it goes into recovery state and stuck there.
For sync I deleted the data directory and restarted mongod service.
DB Size - 1.1 TB
sms3:PRIMARY> rs.status()
{
"set" : "sms3",
"date" : ISODate("2015-09-01T08:33:40Z"),
"myState" : 1,
"members" : [
{
"_id" : 9,
"name" : "abc:27117",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 9415375,
"optime" : Timestamp(1441096420, 7),
"optimeDate" : ISODate("2015-09-01T08:33:40Z"),
"self" : true
},
{
"_id" : 10,
"name" : "def:27117",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 9411728,
"optime" : Timestamp(1441096418, 159),
"optimeDate" : ISODate("2015-09-01T08:33:38Z"),
"lastHeartbeat" : ISODate("2015-09-01T08:33:38Z"),
"lastHeartbeatRecv" : ISODate("2015-09-01T08:33:39Z"),
"pingMs" : 0,
"syncingTo" : "db330.oak1.omniture.com:27117"
},
{
"_id" : 11,
"name" : "ghi:27117",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 53615,
"optime" : Timestamp(1441042830, 300),
"optimeDate" : ISODate("2015-08-31T17:40:30Z"),
"lastHeartbeat" : ISODate("2015-09-01T08:33:39Z"),
"lastHeartbeatRecv" : ISODate("2015-09-01T08:33:39Z"),
"pingMs" : 0,
"syncingTo" : "db330.oak1.omniture.com:27117"
}
],
"ok" : 1
}
sms3:PRIMARY> rs.config()
{
"_id" : "sms3",
"version" : 87615,
"members" : [
{
"_id" : 9,
"host" : "abc:27117"
},
{
"_id" : 10,
"host" : "def:27117",
"priority" : 0.5
},
{
"_id" : 11,
"host" : "ghi:27117",
"priority" : 0.5
}
]
}
As I understand your capped collection is gone out of sync. So you will need to clear everything from secondary and then restart the sync.
I found the solution. I synced the data directory on secondary node which was down from another secondary node which was up and restarted mongo.

Resources