Number of input rows in spark structured streaming with custom sink

Number of input rows in spark structured streaming with custom sink - apache-spark

I'm using a custom sink in structured stream (spark 2.2.0) and noticed that spark produces incorrect metrics for number of input rows - it's always zero.
My stream construction:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(s3Path.toString())
.as(Encoders.bean(TestRecord.class))
.flatMap(
((FlatMapFunction<TestRecord, TestOutputRecord>) (u) -> {
List<TestOutputRecord> list = new ArrayList<>();
try {
TestOutputRecord result = transformer.convert(u);
list.add(result);
} catch (Throwable t) {
System.err.println("Failed to convert a record");
t.printStackTrace();
}
return list.iterator();
}),
Encoders.bean(TestOutputRecord.class))
.map(new DataReinforcementMapFunction<>(), Encoders.bean(TestOutputRecord.clazz))
.writeStream()
.trigger(Trigger.ProcessingTime(WRITE_FREQUENCY, TimeUnit.SECONDS))
.format(MY_WRITER_FORMAT)
.outputMode(OutputMode.Append())
.queryName("custom-sink-stream")
.start();
writeStream.processAllAvailable();
writeStream.stop();
Logs:
Streaming query made progress: {
"id" : "a8a7fbc2-0f06-4197-a99a-114abae24964",
"runId" : "bebc8a0c-d3b2-4fd6-8710-78223a88edc7",
"name" : "custom-sink-stream",
"timestamp" : "2018-01-25T18:39:52.949Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 781,
"triggerExecution" : 781
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[s3n://test-bucket/test]",
"startOffset" : {
"logOffset" : 0
},
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "com.mycompany.spark.MySink#f82a99"
}
}
Do I have to populate any metrics in my custom sink to be able to track progress? Or could it be a problem in FileStreamSource when it reads from s3 bucket?

The problem was related to using dataset.rdd in my custom sink that creates a new plan so that StreamExecution doesn't know about it and therefore is not able to get metrics.
Replacing data.rdd.mapPartitions with data.queryExecution.toRdd.mapPartitions fixes the issue.

Related

Accessing current watermark in Spark Structured Streaming

Is there any way to access current watermark value in Spark Structured Streaming?
I'd like to process events in their event-time order to find patterns in sequences. To do it I was thinking of using flatMapGroupsWithState and buffer events till the watermark passes (and avoid buffering late events) and process them one-by-one. But I don't know how to access current watermark to do it. Is it event possible in Spark Structure Streaming?

You can access the StreamingQueryProgress from your StreamingQuery object:
query.lastProgress()/recentProgress()
It will contain an eventTime.watermark field
something like:
{
"id" : "eb7202da-9e60-4983-89fc-e1251aebf89d",
"runId" : "969555bd-6189-4b70-a101-3b5917cea965",
"name" : "my-query",
"timestamp" : "2023-01-05T16:46:43.372Z",
"batchId" : 6,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"latestOffset" : 7,
"triggerExecution" : 7
},
"eventTime" : {
"watermark" : "2023-01-01T09:44:11.000Z"
},
"stateOperators" : [ {
"operatorName" : "stateStoreSave",
...etc
}

Is there a way to check if there exists a Pulsar producer with the same name on the same topic?

Pulsar allows multiple producers to subscribe to the same topic only if they have different producer names. Is there a way to check if a producer with the same name (and same topic) already exists?

You can use the stats command from the pulsar-admin CLI tool to list all of the producers attached to the topic as follows, then just look inside the publishers section of the JSON output for the producerName
root#6b40ffcc05ec:/pulsar# ./bin/pulsar-admin topics stats persistent://public/default/test-topic
{
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"bytesInCounter" : 65442,
"msgInCounter" : 1002,
"bytesOutCounter" : 0,
"msgOutCounter" : 0,
"averageMsgSize" : 63.0,
"msgChunkPublished" : false,
"storageSize" : 65442,
"backlogSize" : 0,
"publishers" : [ {
"msgRateIn" : 19.889469865137894,
"msgThroughputIn" : 1253.0366015036873,
"averageMsgSize" : 63.0,
"chunkedMessageRate" : 0.0,
"producerId" : 0,
"metadata" : { },
"producerName" : "standalone-3-1",
"connectedSince" : "2020-08-06T15:51:48.279Z",
"clientVersion" : "2.6.0",
"address" : "/127.0.0.1:53058"
} ],
"subscriptions" : { },
"replication" : { },
"deduplicationStatus" : "Disabled"
}

Structured Spark streaming metrics retrieval

I have an application with structured Spark streaming and I would like to get some metrics like scheduling delay, latency, etc. Usually, such metrics can be found in Spark UI Streaming tab, however, such functionality does not exist for structured streaming as far as I know.
So how can I get these metrics values?
For now, I have tried to use query progress but not all of the required metrics can be found in the results:
QueryProgress {
"timestamp" : "2019-11-19T20:14:07.011Z",
"batchId" : 1,
"numInputRows" : 8,
"inputRowsPerSecond" : 0.8429038036034138,
"processedRowsPerSecond" : 1.1210762331838564,
"durationMs" : {
"addBatch" : 6902,
"getBatch" : 1,
"getEndOffset" : 0,
"queryPlanning" : 81,
"setOffsetRange" : 20,
"triggerExecution" : 7136,
"walCommit" : 41
},
"stateOperators" : [ {
"numRowsTotal" : 2,
"numRowsUpdated" : 2,
"memoryUsedBytes" : 75415,
"customMetrics" : {
"loadedMapCacheHitCount" : 400,
"loadedMapCacheMissCount" : 0,
"stateOnCurrentVersionSizeBytes" : 17815
}
} ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[tweets]]",
"startOffset" : {
"tweets" : {
"0" : 579
}
},
"endOffset" : {
"tweets" : {
"0" : 587
}
},
"numInputRows" : 8,
"inputRowsPerSecond" : 0.8429038036034138,
"processedRowsPerSecond" : 1.1210762331838564
} ]

SubqueryAlias is null in Spark 2.4 Logical plan. Is it bug?

I am writing a program to analyze sql query. So I am using Spark logical plan.
Below is the code which I am using
object QueryAnalyzer {
val LOG = LoggerFactory.getLogger(this.getClass)
//Spark Conf
val conf = new
SparkConf().setMaster("local[2]").setAppName("LocalEdlExecutor")
//Spark Context
val sc = new SparkContext(conf)
//sql Context
val sqlContext = new SQLContext(sc)
//Spark Session
val sparkSession = SparkSession
.builder()
.appName("Spark User Data")
.config("spark.app.name", "LocalEdl")
.getOrCreate()
def main(args: Array[String]) {
var inputDfColumns = Map[String,List[String]]()
val dfSession = sparkSession.
read.
format("csv").
option("header", EdlConstants.TRUE).
option("inferschema", EdlConstants.TRUE).
option("delimiter", ",").
option("decoding", EdlConstants.UTF8).
option("multiline", true)
var oDF = dfSession.
load("C:\\Users\\tarun.khaneja\\data\\order.csv")
println("smaple data in oDF====>")
oDF.show()
var cusDF = dfSession.
load("C:\\Users\\tarun.khaneja\\data\\customer.csv")
println("smaple data in cusDF====>")
cusDF.show()
oDF.createOrReplaceTempView("orderTempView")
cusDF.createOrReplaceTempView("customerTempView")
//get input columns from all dataframe
inputDfColumns += ("orderTempView"->oDF.columns.toList)
inputDfColumns += ("customerTempView"->cusDF.columns.toList)
val res = sqlContext.sql("""select OID, max(MID+CID) as MID_new,ROW_NUMBER() OVER (
ORDER BY CID) as rn from
(select OID_1 as OID, CID_1 as CID, OID_1+CID_1 as MID from
(select min(ot.OrderID) as OID_1, ct.CustomerID as CID_1
from orderTempView as ot inner join customerTempView as ct
on ot.CustomerID = ct.CustomerID group by CID_1)) group by OID,CID""")
println(res.show(false))
val analyzedPlan = res.queryExecution.analyzed
println(analyzedPlan.prettyJson)
}
Now problem is, with Spark 2.2.1, I am getting below json. where I have SubqueryAlias which provide important information of alias name for table which we used in query, as shown below.
...
...
...
[ {
"class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children" : 0,
"name" : "OrderDate",
"dataType" : "string",
"nullable" : true,
"metadata" : { },
"exprId" : {
"product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
"id" : 2,
"jvmId" : "acefe6e6-e469-4c9a-8a36-5694f054dc0a"
},
"isGenerated" : false
} ] ]
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children" : 1,
"alias" : "ct",
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children" : 1,
"alias" : "customertempview",
"child" : 0
}, {
"class" : "org.apache.spark.sql.execution.datasources.LogicalRelation",
"num-children" : 0,
"relation" : null,
"output" :
...
...
...
But with Spark 2.4, I am getting SubqueryAlias name as null. As shown below in json.
...
...
{
"class":
"org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children": 0,
"name": "CustomerID",
"dataType": "integer",
"nullable": true,
"metadata": {},
"exprId": {
"product-class":
"org.apache.spark.sql.catalyst.expressions.ExprId",
"id": 19,
"jvmId": "3b0dde0c-0b8f-4c63-a3ed-4dba526f8331"
},
"qualifier": "[ct]"
}]
}, {
"class":
"org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children": 1,
"name": null,
"child": 0
}, {
"class":
"org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children": 1,
"name": null,
"child": 0
}, {
"class":
"org.apache.spark.sql.execution.datasources.LogicalRelation",
"num-children": 0,
"relation": null,
"output":
...
...
So, I am not sure if it is bug in Spark 2.4 because of which I am getting name as null in SubquerAlias.
Or if it is not bug then how can I get relation between alias name and real table name.
Any idea on this?

MongoDB-Query Optimization

I have a collection with a sub-document consisting of more than 40K records.
My aggregate query takes about 300 secs. I have tried optimizing the same using compound as well as multi-key indexing, which completes in 180 secs.
I still require a reduced query time execution.
here is my collection:
{
"_id" : ObjectId("545b32cc7e9b99112e7ddd97"),
"grp_id" : 654,
"user_id" : 2,
"mod_on" : ISODate("2014-11-06T08:35:40.857Z"),
"crtd_on" : ISODate("2014-11-06T08:35:24.791Z"),
"uploadTp" : 0,
"tp" : 1,
"status" : 3,
"id_url" : [
{"mid":"xyz12793"},
{"mid":"xyz12794"},
{"mid":"xyz12795"},
{"mid":"xyz12796"}
],
"incl" : 1,
"total_cnt" : 25,
"succ_cnt" : 25,
"fail_cnt" : 0
}
and following is my query
db.member_id_transactions.aggregate([ { '$match':
{ id_url: { '$elemMatch': { mid: 'xyz12794' } } } },
{ '$unwind': '$id_url' },
{ '$match': { grp_id: 654, 'id_url.mid': 'xyz12794' } } ])
has anyone faced the same issue?
here's the o/p for aggregate query with explain option
{
"result" : [
{
"_id" : ObjectId("546342467e6d1f4951b56285"),
"grp_id" : 685,
"user_id" : 2,
"mod_on" : ISODate("2014-11-12T11:24:01.336Z"),
"crtd_on" : ISODate("2014-11-12T11:19:34.682Z"),
"uploadTp" : 1,
"tp" : 1,
"status" : 3,
"id_url" : [
{"mid":"xyz12793"},
{"mid":"xyz12794"},
{"mid":"xyz12795"},
{"mid":"xyz12796"}
],
"incl" : 1,
"__v" : 0,
"total_cnt" : 21406,
"succ_cnt" : 21402,
"fail_cnt" : 4
}
],
"ok" : 1,
"$gleStats" : {
"lastOpTime" : Timestamp(0, 0),
"electionId" : ObjectId("545c8d37ab9cc679383a1b1b")
}
}

One way to reduce the number of records being filtered further is to include the field grp_id, in the first $match operator.
db.member_id_transactions.aggregate([
{$match:{ "id_url.mid": 'xyz12794',"grp_id": 654 } },
{$unwind: "$id_url" },
{$match: { "id_url.mid": "xyz12794" } }
])
See how the performance is now. Add grp_id to the index to get better response time.
The above aggregation query though it works, is unnecessary. since you are not altering the structure of the document, and you expect only one element in the array to match the filter condition, you could just use a simple find and project.
db.member_id_transactions.find(
{ "id_url.mid": "xyz12794","grp_id": 654 },
{"_id":0,"grp_id":1,"id_url":{$elemMatch:{"mid":"xyz12794"}},
"user_id":1,"mod_on":1,"crtd_on":1,"uploadTp":1,
"tp":1,"status":1,"incl":1,"total_cnt":1,
"succ_cnt":1,"fail_cnt":1
}
)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Number of input rows in spark structured streaming with custom sink - apache-spark

The problem was related to using dataset.rdd in my custom sink that creates a new plan so that StreamExecution doesn't know about it and therefore is not able to get metrics. Replacing data.rdd.mapPartitions with data.queryExecution.toRdd.mapPartitions fixes the issue.

Related

Accessing current watermark in Spark Structured Streaming

Is there a way to check if there exists a Pulsar producer with the same name on the same topic?

Structured Spark streaming metrics retrieval

SubqueryAlias is null in Spark 2.4 Logical plan. Is it bug?

MongoDB-Query Optimization

Categories

Resources