I am familiar explain() (also WebUI). I was curious whether there are any tools that generate an image of the tree structure of the logical/physical plan before/after optimizations. That is the information returned by explain() as an image.
A picture like a PNG or JPG? Never heard of one myself, but you can see the physical plan using web UI (that you've already mentioned).
The other phases of query execution are available using TreeNode methods which (among many methods that could help you out) give you my favorite numberedTreeString.
scala> println(q.queryExecution.analyzed.numberedTreeString)
00 Range (0, 5, step=1, splits=Some(8))
scala> println(q.queryExecution.executedPlan.numberedTreeString)
00 *Range (0, 5, step=1, splits=8)
You can save the output as JSON using toJSON or prettyJson to generate PNG (but I've never tried it out myself).
scala> println(q.queryExecution.executedPlan.prettyJson)
[ {
"class" : "org.apache.spark.sql.execution.WholeStageCodegenExec",
"num-children" : 1,
"child" : 0
}, {
"class" : "org.apache.spark.sql.execution.RangeExec",
"num-children" : 0,
"range" : [ {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Range",
"num-children" : 0,
"start" : 0,
"end" : 5,
"step" : 1,
"numSlices" : 8,
"output" : [ [ {
"class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children" : 0,
"name" : "id",
"dataType" : "long",
"nullable" : false,
"metadata" : { },
"exprId" : {
"product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
"id" : 0,
"jvmId" : "cb497d01-3b90-42a7-9ebf-ebe85578f763"
},
"isGenerated" : false
} ] ]
} ]
} ]
Related
** Ask through a translator.
I want to express 3d geometry in gltf. (want to use in cesium js)
Fist, convert the coordinates to ECEF coordinates.
Next, I created a gltf file.
However, it does not look neat on the preview screen. (ms code extension)
And trembling occurs when moving.
I want to know the cause.
** coordinates (epsg 4326)
-73.561001356474, 45.4966833629139, 19.9
-73.5610780383518, 45.4967314096038, 19.9
-73.5610843141094, 45.4967252603133, 19.9
-73.561001356474, 45.4966833629139, 19.9
** preview gltf (ms code extension)
enter image description here
** gltf source
{
"scenes" : [
{
"nodes" : [ 0 ]
}
],
"nodes" : [
{
"mesh" : 0
}
],
"meshes" : [
{
"name": "gltf test",
"primitives" : [
{
"attributes" : {
"POSITION" : 1
},
"indices" : 0,
"mode": 4
}
]
}
],
"buffers" : [
{
"uri" : "data:application/octet-stream;base64,AAABAAIAAAAYtZpJTBWDyiIhikritJpJSBWDyiohikrftJpJSRWDyikhiko=",
"byteLength" : 44
}
],
"bufferViews" : [
{
"buffer" : 0,
"byteOffset" : 0,
"byteLength" : 6,
"target" : 34963
},
{
"buffer" : 0,
"byteOffset" : 8,
"byteLength" : 36,
"byteStride": 12,
"target" : 34962
}
],
"accessors" : [
{
"bufferView" : 0,
"byteOffset" : 0,
"componentType" : 5123,
"count" : 3,
"type" : "SCALAR"
},
{
"bufferView" : 1,
"byteOffset" : 0,
"componentType" : 5126,
"count" : 3,
"type" : "VEC3",
"min" : [ 1267355.8953368044, -4295333.93970134, 4526225.02467026],
"max" : [ 1267363.054331189, -4295331.983019848, 4526228.767742585]
}
],
"asset" : {
"version" : "2.0"
}
}
You haven't shared how you created this file, so we can't really tell you the cause of the issue. I would recommend adding enough information to fully reproduce the problem. But importantly, the file you've shared has only three vertices. Try inspecting it in VS Code's glTF addon.
I have an application with structured Spark streaming and I would like to get some metrics like scheduling delay, latency, etc. Usually, such metrics can be found in Spark UI Streaming tab, however, such functionality does not exist for structured streaming as far as I know.
So how can I get these metrics values?
For now, I have tried to use query progress but not all of the required metrics can be found in the results:
QueryProgress {
"timestamp" : "2019-11-19T20:14:07.011Z",
"batchId" : 1,
"numInputRows" : 8,
"inputRowsPerSecond" : 0.8429038036034138,
"processedRowsPerSecond" : 1.1210762331838564,
"durationMs" : {
"addBatch" : 6902,
"getBatch" : 1,
"getEndOffset" : 0,
"queryPlanning" : 81,
"setOffsetRange" : 20,
"triggerExecution" : 7136,
"walCommit" : 41
},
"stateOperators" : [ {
"numRowsTotal" : 2,
"numRowsUpdated" : 2,
"memoryUsedBytes" : 75415,
"customMetrics" : {
"loadedMapCacheHitCount" : 400,
"loadedMapCacheMissCount" : 0,
"stateOnCurrentVersionSizeBytes" : 17815
}
} ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[tweets]]",
"startOffset" : {
"tweets" : {
"0" : 579
}
},
"endOffset" : {
"tweets" : {
"0" : 587
}
},
"numInputRows" : 8,
"inputRowsPerSecond" : 0.8429038036034138,
"processedRowsPerSecond" : 1.1210762331838564
} ]
I'm trying to find entries in my data which are equal in more than one aspect. I currently do this using a complex query which nests aggregations:
{
"size": 0,
"aggs": {
"duplicateFIELD1": {
"terms": {
"field": "FIELD1",
"min_doc_count": 2 },
"aggs": {
"duplicateFIELD2": {
"terms": {
"field": "FIELD2",
"min_doc_count": 2 },
"aggs": {
"duplicateFIELD3": {
"terms": {
"field": "FIELD3",
"min_doc_count": 2 },
"aggs": {
"duplicateFIELD4": {
"terms": {
"field": "FIELD4",
"min_doc_count": 2 },
"aggs": {
"duplicate_documents": {
"top_hits": {} } } } } } } } } } } }
This works to an extent as the result I get when no duplicates are found look something like this:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 27524067,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"duplicateFIELD1" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 27524027,
"buckets" : [
{
"key" : <valueFromField1>,
"doc_count" : 4,
"duplicateFIELD2" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
},
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
}
]
}
},
{
"key" : <valueFromField1>,
"doc_count" : 4,
"duplicateFIELD2" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
},
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
}
]
}
},
...
I'm skipping some of the output which looks rather similar.
I can now scan through this complex deeply nested data structure and find that no documents are stored in all of these nested buckets. But this seems rather cumbersome. I guess there might be a better (more straight-forward) way of doing this.
Also, if I want to check more than four fields, this nested structure will grow and grow and grow. So it does not scale very well and I want to avoid this.
Can I improve my solution so that I do get a simple list of all documents which are duplicates? (Maybe the ones which are duplicates of each other grouped together somehow.) or is there a completely different approach (such as without aggregation) which does not have the drawbacks I described here?
EDIT: I found an approach using the script feature of ES here, but in my version of ES this returns just an error message. Maybe someone can point out to me how to do it in ES 5.0? My trials up to now did not work.
EDIT: I found a way to use a script for my approach which uses the modern way (language "painless"):
{
"size": 0,
"aggs": {
"duplicateFOO": {
"terms": {
"script": {
"lang": "painless",
"inline": "doc['FIELD1'].value + doc['FIELD2'].value + doc['FIELD3'].value + doc['FIELD4'].value"
},
"min_doc_count": 2
}
}
}
}
This seems to work for very small amounts of data and results in an error for realistic amounts of data (circuit_breaking_exception: [request] Data too large, data for [<reused_arrays>] would be larger than limit of [6348236390/5.9gb]). Any idea on how I can fix this? Probably adjust some configuration of the ES to make it use larger internal buffers or similar?
There does not seem to be a proper solution for my situation which avoids the nesting in a general way.
Fortunately three of my four fields have a very limited value range; the first can only be 1 or 2, the second can be 1, 2, or 3 and the third can be 1, 2, 3, or 4. Since these are just 24 combinations I currently go with filtering one 24th out of the complete data set before applying the aggregation, then of just one (the remaining fourth field). I then have to apply all actions 24 times (once with each combination of the three limited fields mentioned above), but this is still more feasible than handling the complete data set at once.
The query (i. e. one of the 24 queries) I send now look something like this:
{
"size": 0,
"query": {
"bool": {
"must": [
{ "match": { "FIELD1": 2 } },
{ "match": { "FIELD2": 3 } },
{ "match": { "FIELD3": 4 } } ] } },
"aggs": {
"duplicateFIELD4": {
"terms": {
"field": "FIELD4",
"min_doc_count": 2 } } } }
The results for this of course are not nested anymore. But this cannot be done if more than one field holds arbitrary values of a larger range.
I also found out that, if nesting must be done, the fields with the most limited value range (e. g. just two values like "1 or 2") should be innermost, and the one with the largest value range should be outermost. This improves performance greatly (but still not enough in my case). Doing it wrong can let you end up with an unusable query (no response within hours, and finally an out of memory on the server side).
I now think that aggregating properly is the key to solve a problem like mine. The approach using a script to have a flat bucket list (as described in my question) is bound to overload the server as it cannot distribute the task in any way. In the case that no double is found at all, it has to hold a bucket for each document in memory (with just one document in it). Even if just a few doubles can be found, this cannot be done for larger data sets. If nothing else is possible, one will need to split the data set into groups artificially. E. g. one can create 16 sub-data sets by building a hash out of the relevant fields and use the last 4 bits to put the document in on of the 16 groups. Each group can then be handled separately; doubles are bound to fall into one group using this technique.
But independently from these general thoughts, the ES API should provide any means to paginate through the result of aggregations. It's a pity that there is no such option (yet).
Your last approach seems to be the best one. And you can update your elasticsearch settings as following:
indices.breaker.request.limit: "75%"
indices.breaker.total.limit: "85%"
I have chosen 75% because the default is 60% and it is 5.9gb in your elasticsearch and your query is becoming ~6.3gb which is around 71.1% based on your log.
circuit_breaking_exception: [request] Data too large, data for [<reused_arrays>] would be larger than limit of [6348236390/5.9gb]
And finally indices.breaker.total.limit must be greater than indices.breaker.fielddata.limit according to elasticsearch document.
An Idea that might work in a Logstash scenario is using copy fields:
Copy all combinations to a separate fields and concat them:
mutate {
add_field => {
"new_field" => "%{oldfield1} %{oldfield2}"
}
}
aggregate over the new field.
Have a look here: https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html
I don't know if add_field supports array (others do if you look at the documentation). If it does not you could try to add several new fields and use merge to have just one field.
If you can do this at index time it would certanly be better.
You only need the combinations (A_B) and not all Permutations (A_B, B_A)
I've written a back end node server for a multiplayer game I'm developing and most of the time each request takes about 20-100ms to resolve. However, sometimes (Maybe 1 out of 50 requests) I will do the same request and it will take 2000+ms to resolve.
The server is written entirely in node.js and is hosted on heroku. I am using mongoose to make the calls to the database.
Here is a screenshot of the logs, at the top you can see how queries normally function. The request comes in at 19:03:03.68 and the response is sent out at 19:03:03.73, saving all the data finishes at at 19:03:03.74. Heroku logs the request as taking 58ms which is the desired and expect outcome.
Below that is when the issue occurs. You can see multiple requests come in from two separate clients (Each client sends ~1 request per second which is correct) However the requests build up and after about 2000-5000ms they will all quickly resolve one after another. I’ve tried narrowing down the issue without much luck, but I believe it’s related to when I query the database as you can see multiple requests come in but the first query to the database doesn’t actually resolve until around 2300ms later. As far as I can tell these requests are identical to the ones that resolve in 20-100ms and occur completely at random.
The actual code is similar to this on the server (Simplified for the sake of this question):
console.log (“request received”);
Game.findOne({‘id’: gameID}, function(err, theGame){
console.log("First Query");
I also opened up the mongo shell for the database to look for queries taking an excessive amount of time (>2000ms) with this code:
db.system.profile.find( {millis: {$gt : 2000} } ).sort( { ts: 1} );
Here are the slightly modified results which should include everything relevant:
{ "op" : "update", "ns" : "theDb.players", "query" :
{ "_id" : ObjectId("572b8eb242d70903005df0df")
}, "updateobj" :
{ "$set" :
{ "lastSeen" : ISODate("2016-05-05T18:19:30.761Z"), "timeElapsed" : 16
}
}, "nscanned" : 1, "nscannedObjects" : 1, "nMatched" : 1, "nModified" : 1, "fastmod" : true, "keyUpdates" : 0, "writeConflicts" : 0, "numYield" : 0, "locks" :
{ "Global" :
{ "acquireCount":
{ "r" : NumberLong(2), "w" : NumberLong(2) }
}, "MMAPV1Journal" :
{ "acquireCount" :
{ "w" : NumberLong(2) }, "acquireWaitCount" :
{ "w" : NumberLong(1) }, "timeAcquiringMicros" :
{ "w" : NumberLong(7294179) }
}, "Database" :
{ "acquireCount" :
{ "w" : NumberLong(2) }
}, "Collection" :
{ "acquireCount" :
{ "W" : NumberLong(1) }
}, "oplog" :
{ "acquireCount" :
{ "w" : NumberLong(1) }
}
}, "milli" : 2298, "execStats" : {}, "ts" : ISODate("2016-05-05T18:19:33.060Z")
Second Result:
{ "op" : "update", "ns" : "theDb.connections", "query" :
{ "_id" : ObjectId("572b8eaf42d70903005df0dd")
}, "updateobj" :
{ "$set" :
{ "internalCounter" : 3, "lastCount" : 3, "lastSeen" : ISODate("2016-05-05T18:19:30.761Z"), "playerID" : 128863276517, "sinceLast" : 0
}
}, "nscanned" : 1, "nscannedObjects" : 1, "nMatched" : 1, "nModified" : 1, "keyUpdates" : 0, "writeConflicts" : 0, "numYield" : 0, "locks" :
{ "Global" :
{ "acquireCount" :
{ "r" : NumberLong(2), "w" : NumberLong(2)
}
}, "MMAPV1Journal" :
{ "acquireCount" :
{ "w" : NumberLong(2) }, "acquireWaitCount" :
{ "w" : NumberLong(1) }, "timeAcquiringMicros" :
{ "w" :NumberLong(7294149) }
}, "Database" :
{ "acquireCount" :
{ "w" : NumberLong(2) }
}, "Collection" :
{ "acquireCount" :
{ "W" : NumberLong(1) }
}, "oplog" :
{ "acquireCount" :
{ "w" : NumberLong(1) }
}
}, "millis" : 2299, "execStats" : {},"ts" : ISODate("2016-05-05T18:19:33.061Z")
I really need to ensure the latency for any request never exceeds 500ms otherwise it extremely irritating in the game itself. I’m really at a loss for what might be causing this and how to figure out more.
I'm assuming the cause for the issue is that timeAcquiringMicros is so long. I'm unsure of what is causing this though.
*Note, the client is requesting the data with just standard http requests, I’m not currently using any sockets.
Alright, I've finally solved the issue. The problem wasn't actually connected to anything that I had done. I was using the sandbox plan that mlab offers in connection to heroku which had my application competing for processing time with other people also using the sandbox plan. Their queries were slowing down the database causing those spikes in response times.
The solution: I had to upgrade to their shared cluster plan. Since upgrading I haven't had any irregularities in query times.
I have a collection with a sub-document consisting of more than 40K records.
My aggregate query takes about 300 secs. I have tried optimizing the same using compound as well as multi-key indexing, which completes in 180 secs.
I still require a reduced query time execution.
here is my collection:
{
"_id" : ObjectId("545b32cc7e9b99112e7ddd97"),
"grp_id" : 654,
"user_id" : 2,
"mod_on" : ISODate("2014-11-06T08:35:40.857Z"),
"crtd_on" : ISODate("2014-11-06T08:35:24.791Z"),
"uploadTp" : 0,
"tp" : 1,
"status" : 3,
"id_url" : [
{"mid":"xyz12793"},
{"mid":"xyz12794"},
{"mid":"xyz12795"},
{"mid":"xyz12796"}
],
"incl" : 1,
"total_cnt" : 25,
"succ_cnt" : 25,
"fail_cnt" : 0
}
and following is my query
db.member_id_transactions.aggregate([ { '$match':
{ id_url: { '$elemMatch': { mid: 'xyz12794' } } } },
{ '$unwind': '$id_url' },
{ '$match': { grp_id: 654, 'id_url.mid': 'xyz12794' } } ])
has anyone faced the same issue?
here's the o/p for aggregate query with explain option
{
"result" : [
{
"_id" : ObjectId("546342467e6d1f4951b56285"),
"grp_id" : 685,
"user_id" : 2,
"mod_on" : ISODate("2014-11-12T11:24:01.336Z"),
"crtd_on" : ISODate("2014-11-12T11:19:34.682Z"),
"uploadTp" : 1,
"tp" : 1,
"status" : 3,
"id_url" : [
{"mid":"xyz12793"},
{"mid":"xyz12794"},
{"mid":"xyz12795"},
{"mid":"xyz12796"}
],
"incl" : 1,
"__v" : 0,
"total_cnt" : 21406,
"succ_cnt" : 21402,
"fail_cnt" : 4
}
],
"ok" : 1,
"$gleStats" : {
"lastOpTime" : Timestamp(0, 0),
"electionId" : ObjectId("545c8d37ab9cc679383a1b1b")
}
}
One way to reduce the number of records being filtered further is to include the field grp_id, in the first $match operator.
db.member_id_transactions.aggregate([
{$match:{ "id_url.mid": 'xyz12794',"grp_id": 654 } },
{$unwind: "$id_url" },
{$match: { "id_url.mid": "xyz12794" } }
])
See how the performance is now. Add grp_id to the index to get better response time.
The above aggregation query though it works, is unnecessary. since you are not altering the structure of the document, and you expect only one element in the array to match the filter condition, you could just use a simple find and project.
db.member_id_transactions.find(
{ "id_url.mid": "xyz12794","grp_id": 654 },
{"_id":0,"grp_id":1,"id_url":{$elemMatch:{"mid":"xyz12794"}},
"user_id":1,"mod_on":1,"crtd_on":1,"uploadTp":1,
"tp":1,"status":1,"incl":1,"total_cnt":1,
"succ_cnt":1,"fail_cnt":1
}
)