MongoDB slow performance when fetching large array of documents

MongoDB slow performance when fetching large array of documents - node.js

So I've been at this for a week now and I can't solve it. Queries for large arrays of documents are way too slow.
I'm running a super basic find query followed by a .toArray() and then sending it to our frontend.
const orders = await db.collection('Orders').find({ organization: new ObjectId('5fa28c7ad882490116f8761e') }).toArray()
The query itself takes 25 milliseconds seen below in the executionStats but before the data is retrieved it takes 4.5 seconds.
A couple of hundred documents work as expected, it takes a couple of hundred milliseconds tops. But over 500 documents it quickly slows down and now I'm at a 4.5 seconds fetch for ~25k documents. And this is not even the final query. I'm going to a run an aggregation with a few steps on top of this, but this seem to be the main issue, and it just can't be hardware related. I've tried to bump upp the database to an M60 instead of an M30 mentioned below, almost no difference, and it's way too expensive for our company size.
What can I do? Is MongoDB just unable to return anything more than a few hundred documents?
Here's some stats:
Database: Hosted at MongoDB Atlas with the server in the same country, Cluster size M30 (8 GB RAM, 100 GB storage, 2 vCPUs, 288 IOPS). Up to 10 GBit/s network.
My machine: 64 GB 3600 MHz RAM, 5950x CPU, 970 EVO Plus M.2 SSD. 150 MBit/s network.
Collection stats:
{
"ns" : "[db].Orders",
"size" : 207216190,
"count" : 467771,
"avgObjSize" : 442,
"storageSize" : 59006976,
"freeStorageSize" : 12288,
"capped" : false,
"wiredTiger" : {
"metadata" : {
"formatVersion" : 1
}
}
}
Execution stats:
{
"executionSuccess": true,
"nReturned": 26385,
"executionTimeMillis": 25,
"totalKeysExamined": 26385,
"totalDocsExamined": 26385,
"executionStages": {
"stage": "FETCH",
"nReturned": 26385,
"executionTimeMillisEstimate": 3,
"works": 26386,
"advanced": 26385,
"needTime": 0,
"needYield": 0,
"saveState": 26,
"restoreState": 26,
"isEOF": 1,
"docsExamined": 26385,
"alreadyHasObj": 0,
"inputStage": {
"stage": "IXSCAN",
"nReturned": 26385,
"executionTimeMillisEstimate": 2,
"works": 26386,
"advanced": 26385,
"needTime": 0,
"needYield": 0,
"saveState": 26,
"restoreState": 26,
"isEOF": 1,
"keyPattern": {
"organization": 1,
"state.removed": 1
},
"indexName": "organization_1_state.removed_1",
"isMultiKey": false,
"multiKeyPaths": {
"organization": [],
"state.removed": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"organization": [
"[ObjectId('5fa28c7ad882490116f8761e'), ObjectId('5fa28c7ad882490116f8761e')]"
],
"state.removed": [
"[MinKey, MaxKey]"
]
},
"keysExamined": 26385,
"seeks": 1,
"dupsTested": 0,
"dupsDropped": 0
}
}
}

Related

Azure Gremlin edge traversal suspiciously high (Out() step) RU cost

I have a weird issue, where doing an out-operation on a few edges causes my RU cost to triple. Hope someone can help me shed light on why + what I can do to mitigate it.
I have a Graph in CosmosDB, where there are two types of vertex labels: "Profile" and "Score". Each profile has 0 or 1 score-vertices via a "ProfileHasAggregatedScore" edge. The partitionKey is the ID of the Profile.
If I make the following queries, the RU currently is:
g.V().hasLabel('Profile').out('ProfileHasAggregatedScore')
>78 RU (8 scores found)
And for reference, the cost of getting all vertices of a type is:
g.V().hasLabel('Profile')
>28 RU (110 profiles found)
g.E().hasLabel('ProfileHasAggregatedScore')
>11 RU (8 edges found)
g.V().hasLabel('AggregatedRating')
>11 RU (8 scores found)
And the cost of a single of the vertices or edges are:
g.V('aProfileId').hasLabel('Profile')
>4 RU (1 found)
g.E('anEdgeId')
> 7RU
G.V('aRatingId')
> 3.5 RU
Can someone please help me as to why, making a traversal with only a few vertices along the way (see traversal at the bottom), is more expensive than searching for everything? And is there something I can do to prevent it? Adding a has-filter with the partitionKey does not seem to help. It seems odd that traversing/finding 16 elements more (8 edges and 8 vertices) after finding 110 vertices triples the cost of the operation?
(NB. With 1000 profiles the cost of doing 1 traversal along an edge to the score node is 2200 RU. This seems high, considering the emphasis their Azure team put on it being scalable?)
Traversal if it can help (It seems most of the time is spent finding the edges with the out() step):
[
{
"gremlin": "g.V().hasLabel('Profile').out('ProfileHasAggregatedScore').executionProfile()",
"totalTime": 46,
"metrics": [
{
"name": "GetVertices",
"time": 13,
"annotations": {
"percentTime": 28.26
},
"counts": {
"resultCount": 110
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 110,
"size": 124649,
"time": 2.47
}
]
},
{
"name": "GetEdges",
"time": 26,
"annotations": {
"percentTime": 56.52
},
"counts": {
"resultCount": 8
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 8,
"size": 5200,
"time": 6.22
},
{
"fanoutFactor": 1,
"count": 0,
"size": 49,
"time": 0.88
}
]
},
{
"name": "GetNeighborVertices",
"time": 7,
"annotations": {
"percentTime": 15.22
},
"counts": {
"resultCount": 8
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 8,
"size": 6303,
"time": 1.18
}
]
},
{
"name": "ProjectOperator",
"time": 0,
"annotations": {
"percentTime": 0
},
"counts": {
"resultCount": 8
}
}
]
}
]
enter code here

Fetching host availability to external webpage in Nagios

Is there any possible way to fetch the live availability of host/host group from Nagios monitoring tool (where host/hostgroups are already configured) which can be redirected/captured to an external webpage.
are there any exposed API's to do that, couldn't found a way.
Nagios is on a Linux host.
Any help or info is appreciated.
EDIT1:
I have a hostgroup say for example 'All_prod' in this hostgroup I will be having around 20 linux hosts for all the host there would be some metrics/checks defined (example availability, cpu load, free memory ..etc). Here I want the report of only availability metrics of all the host(example : lets say if in 24 hours if the availability is down for 10 minutes then it should provide me with the report as it was down for 10 minutes in 24 hours or just give me any related info which i can evaluate using data evaluation).
it would be great if there are any API's to fetch that information, which will return the data as json/xml.

You can use the Nagios JSON API. You can use the query builder here http://NAGIOSURL/jsonquery.html.
But, to answer your specific question, the queries for hosts would look like this:
http://NAGIOSURL/cgi-bin/statusjson.cgi?query=host&hostname=localhost
Which will output something similar to the following:
{
"format_version": 0,
"result": {
"query_time": 1497384499000,
"cgi": "statusjson.cgi",
"user": "nagiosadmin",
"query": "host",
"query_status": "released",
"program_start": 1497368240000,
"last_data_update": 1497384489000,
"type_code": 0,
"type_text": "Success",
"message": ""
},
"data": {
"host": {
"name": "localhost",
"plugin_output": "egsdda",
"long_plugin_output": "",
"perf_data": "",
"status": 8,
"last_update": 1497384489000,
"has_been_checked": true,
"should_be_scheduled": false,
"current_attempt": 10,
"max_attempts": 10,
"last_check": 1496158536000,
"next_check": 0,
"check_options": 0,
"check_type": 1,
"last_state_change": 1496158536000,
"last_hard_state_change": 1496158536000,
"last_hard_state": 1,
"last_time_up": 1496158009000,
"last_time_down": 1496158536000,
"last_time_unreachable": 1480459504000,
"state_type": 1,
"last_notification": 1496158536000,
"next_notification": 1496165736000,
"no_more_notifications": false,
"notifications_enabled": true,
"problem_has_been_acknowledged": false,
"acknowledgement_type": 0,
"current_notification_number": 2,
"accept_passive_checks": true,
"event_handler_enabled": true,
"checks_enabled": false,
"flap_detection_enabled": true,
"is_flapping": false,
"percent_state_change": 0,
"latency": 0.49,
"execution_time": 0,
"scheduled_downtime_depth": 0,
"process_performance_data": true,
"obsess": true
}
}
}
And for hostgroups:
http://NAGIOSURL/nagios/cgi-bin/statusjson.cgi?query=hostlist&hostgroup=linux-servers
Which will output something similar to the following:
{
"format_version": 0,
"result": {
"query_time": 1497384613000,
"cgi": "statusjson.cgi",
"user": "nagiosadmin",
"query": "hostlist",
"query_status": "released",
"program_start": 1497368240000,
"last_data_update": 1497384609000,
"type_code": 0,
"type_text": "Success",
"message": ""
},
"data": {
"selectors": {
"hostgroup": "linux-servers"
},
"hostlist": {
"localhost": 8
}
}
}
Hope this helps!
EDIT 1 (To correspond with the question's EDIT 1):
What you're asking for isn't built in by default. You can use the above methods to grab the data for each host (but it sounds like you want it for each service), so again we will use the JSON API found at http://YOURNAGIOSURL/jsonquery.html to grab service data..
http://YOURNAGIOSURL/nagios/cgi-bin/statusjson.cgi?query=service&hostname=localhost&servicedescription=Current+Load
We'll get the following output (something similar, anyway):
{
"format_version": 0,
"result": {
"query_time": 1497875258000,
"cgi": "statusjson.cgi",
"user": "nagiosadmin",
"query": "service",
"query_status": "released",
"program_start": 1497800686000,
"last_data_update": 1497875255000,
"type_code": 0,
"type_text": "Success",
"message": ""
},
"data": {
"service": {
"host_name": "localhost",
"description": "Current Load",
"plugin_output": "OK - load average: 0.00, 0.00, 0.00",
"long_plugin_output": "",
"perf_data": "load1=0.000;5.000;10.000;0; load5=0.000;4.000;6.000;0; load15=0.000;3.000;4.000;0;",
"max_attempts": 4,
"current_attempt": 1,
"status": 2,
"last_update": 1497875255000,
"has_been_checked": true,
"should_be_scheduled": true,
"last_check": 1497875014000,
"check_options": 0,
"check_type": 0,
"checks_enabled": true,
"last_state_change": 1497019191000,
"last_hard_state_change": 1497019191000,
"last_hard_state": 0,
"last_time_ok": 1497875014000,
"last_time_warning": 1497019191000,
"last_time_unknown": 0,
"last_time_critical": 1497018891000,
"state_type": 1,
"last_notification": 0,
"next_notification": 0,
"next_check": 1497875314000,
"no_more_notifications": false,
"notifications_enabled": true,
"problem_has_been_acknowledged": false,
"acknowledgement_type": 0,
"current_notification_number": 0,
"accept_passive_checks": true,
"event_handler_enabled": true,
"flap_detection_enabled": true,
"is_flapping": false,
"percent_state_change": 0,
"latency": 0,
"execution_time": 0,
"scheduled_downtime_depth": 0,
"process_performance_data": true,
"obsess": true
}
}
}
The most important line for what you're trying to do (as far as I understand it) is the perfdata line:
"perf_data": "load1=0.000;5.000;10.000;0; load5=0.000;4.000;6.000;0; load15=0.000;3.000;4.000;0;",
This is the data you'd use to generate whatever custom metrics report you're trying to generate.
Keep in mind this is something that is sort of built in to Nagios XI (not in an exportable format like you're requesting) but the metrics component does allow you to easily drill down and take a look at some metric specific data.
Hope this helps!

How to extract grouped results from an array inside a collection in Mongodb

I am working with the Foursquare API using NodeJS and Mongodb on the backend side. I have all the user information and checkin history stored in a collection. So the collection looks similar to this:
{
_id: ...,
foursquareId: ...
personalInfo: {},
checkins: [
{
id: ...,
createdAt: 123456789 //Seconds since epoch>,
venue: {},
...
},
{
id: ...,
createdAt: 123456789 //Seconds since epoch>,
venue: {},
...
},
...
]
}
For this question I am only interested to the checkins array. I need to return a list of checkins quantity by month and year, but I am not sure which is the best way to approach this. I think that the result would be something like this: (I am not totally convinced though)
{
'2016': {
'January': 43,
'February': 38,
'March': 40,
'April': 48,
'May': 50,
'June': 41,
'July': 39,
'August': 38,
'September': 30,
'October': 29,
'November': 38,
'December': 41
},
'2017': {
'January': 55,
'February': 20
}
}
I am not interested about the way I receive the information on the frontend. I want to know if is possible to do this in mongodb because I couldn't find a way to do it on their documentation or any other example here. Otherwise I might need to do it in the frontend (not a good idea...so I could have around 7k results or more on this array...).

Using the aggregation framework should get you what you want.
db.collectionName.aggregate([
{$unwind:'$checkins'},
{
$project: {
id: 1,
'checkins.createdAt' : 1,
newDate : {
$add : [ new Date(0), {
$multiply : [ "$checkins.createdAt", 1000 ]
}]
}
}
},
{$project : {
year: {$year: "$newDate"},
month: {$month: "$newDate"}
}},
{$group: {_id:{year:"$year", month:"$month"}, count:{$sum:1}}},
{$group: {_id:{year:"$_id.year"}, monthTotals: { $push: { month: "$_id.month", count: "$count" } }}}
])
This produces documents like the following:
{
"_id" : {
"year" : NumberInt(2016)
},
"monthTotals" : [
{"month" : NumberInt(1),"count" : NumberInt(2)}
{"month" : NumberInt(2),"count" : NumberInt(3)}
]
}
The second step (first $project step) may need to be adjusted depending on how your date since epoch value is stored, but this should get you generally what you need.
There's not a way to get the data exactly as you've outlined without some post processing of the results, but it should be simple enough to modify the result.

Finding duplicates in Elasticsearch

I'm trying to find entries in my data which are equal in more than one aspect. I currently do this using a complex query which nests aggregations:
{
"size": 0,
"aggs": {
"duplicateFIELD1": {
"terms": {
"field": "FIELD1",
"min_doc_count": 2 },
"aggs": {
"duplicateFIELD2": {
"terms": {
"field": "FIELD2",
"min_doc_count": 2 },
"aggs": {
"duplicateFIELD3": {
"terms": {
"field": "FIELD3",
"min_doc_count": 2 },
"aggs": {
"duplicateFIELD4": {
"terms": {
"field": "FIELD4",
"min_doc_count": 2 },
"aggs": {
"duplicate_documents": {
"top_hits": {} } } } } } } } } } } }
This works to an extent as the result I get when no duplicates are found look something like this:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 27524067,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"duplicateFIELD1" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 27524027,
"buckets" : [
{
"key" : <valueFromField1>,
"doc_count" : 4,
"duplicateFIELD2" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
},
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
}
]
}
},
{
"key" : <valueFromField1>,
"doc_count" : 4,
"duplicateFIELD2" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
},
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
}
]
}
},
...
I'm skipping some of the output which looks rather similar.
I can now scan through this complex deeply nested data structure and find that no documents are stored in all of these nested buckets. But this seems rather cumbersome. I guess there might be a better (more straight-forward) way of doing this.
Also, if I want to check more than four fields, this nested structure will grow and grow and grow. So it does not scale very well and I want to avoid this.
Can I improve my solution so that I do get a simple list of all documents which are duplicates? (Maybe the ones which are duplicates of each other grouped together somehow.) or is there a completely different approach (such as without aggregation) which does not have the drawbacks I described here?
EDIT: I found an approach using the script feature of ES here, but in my version of ES this returns just an error message. Maybe someone can point out to me how to do it in ES 5.0? My trials up to now did not work.
EDIT: I found a way to use a script for my approach which uses the modern way (language "painless"):
{
"size": 0,
"aggs": {
"duplicateFOO": {
"terms": {
"script": {
"lang": "painless",
"inline": "doc['FIELD1'].value + doc['FIELD2'].value + doc['FIELD3'].value + doc['FIELD4'].value"
},
"min_doc_count": 2
}
}
}
}
This seems to work for very small amounts of data and results in an error for realistic amounts of data (circuit_breaking_exception: [request] Data too large, data for [<reused_arrays>] would be larger than limit of [6348236390/5.9gb]). Any idea on how I can fix this? Probably adjust some configuration of the ES to make it use larger internal buffers or similar?
There does not seem to be a proper solution for my situation which avoids the nesting in a general way.
Fortunately three of my four fields have a very limited value range; the first can only be 1 or 2, the second can be 1, 2, or 3 and the third can be 1, 2, 3, or 4. Since these are just 24 combinations I currently go with filtering one 24th out of the complete data set before applying the aggregation, then of just one (the remaining fourth field). I then have to apply all actions 24 times (once with each combination of the three limited fields mentioned above), but this is still more feasible than handling the complete data set at once.
The query (i. e. one of the 24 queries) I send now look something like this:
{
"size": 0,
"query": {
"bool": {
"must": [
{ "match": { "FIELD1": 2 } },
{ "match": { "FIELD2": 3 } },
{ "match": { "FIELD3": 4 } } ] } },
"aggs": {
"duplicateFIELD4": {
"terms": {
"field": "FIELD4",
"min_doc_count": 2 } } } }
The results for this of course are not nested anymore. But this cannot be done if more than one field holds arbitrary values of a larger range.
I also found out that, if nesting must be done, the fields with the most limited value range (e. g. just two values like "1 or 2") should be innermost, and the one with the largest value range should be outermost. This improves performance greatly (but still not enough in my case). Doing it wrong can let you end up with an unusable query (no response within hours, and finally an out of memory on the server side).
I now think that aggregating properly is the key to solve a problem like mine. The approach using a script to have a flat bucket list (as described in my question) is bound to overload the server as it cannot distribute the task in any way. In the case that no double is found at all, it has to hold a bucket for each document in memory (with just one document in it). Even if just a few doubles can be found, this cannot be done for larger data sets. If nothing else is possible, one will need to split the data set into groups artificially. E. g. one can create 16 sub-data sets by building a hash out of the relevant fields and use the last 4 bits to put the document in on of the 16 groups. Each group can then be handled separately; doubles are bound to fall into one group using this technique.
But independently from these general thoughts, the ES API should provide any means to paginate through the result of aggregations. It's a pity that there is no such option (yet).

Your last approach seems to be the best one. And you can update your elasticsearch settings as following:
indices.breaker.request.limit: "75%"
indices.breaker.total.limit: "85%"
I have chosen 75% because the default is 60% and it is 5.9gb in your elasticsearch and your query is becoming ~6.3gb which is around 71.1% based on your log.
circuit_breaking_exception: [request] Data too large, data for [<reused_arrays>] would be larger than limit of [6348236390/5.9gb]
And finally indices.breaker.total.limit must be greater than indices.breaker.fielddata.limit according to elasticsearch document.

An Idea that might work in a Logstash scenario is using copy fields:
Copy all combinations to a separate fields and concat them:
mutate {
add_field => {
"new_field" => "%{oldfield1} %{oldfield2}"
}
}
aggregate over the new field.
Have a look here: https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html
I don't know if add_field supports array (others do if you look at the documentation). If it does not you could try to add several new fields and use merge to have just one field.
If you can do this at index time it would certanly be better.
You only need the combinations (A_B) and not all Permutations (A_B, B_A)

Arangodb freeze when page fault increased

I using arango with nodejs and arangojs driver, one of the arango collection has 10,000,000 documents
Sometimes page fault going up (150 or 500) and arango freezed and don't response to query request Also freezed arango web panel.
My server config is:
RAM: 6 GB
CPU: 8 core
(From web panel arango using 4.76 GB (83.90 %) 6 GB of ram)
UPDATE1
This is output of /_api/collection/AdsStatics/figures
{
"id": "191689719157",
"name": "AdsStatics",
"isSystem": false,
"doCompact": true,
"isVolatile": false,
"journalSize": 33554432,
"keyOptions": {
"type": "traditional",
"allowUserKeys": true
},
"waitForSync": false,
"indexBuckets": 8,
"count": 7816780,
"figures": {
"alive": {
"count": 7815806,
"size": 3563838968
},
"dead": {
"count": 306,
"size": 167464,
"deletion": 0
},
"datafiles": {
"count": 104,
"fileSize": 3530743672
},
"journals": {
"count": 1,
"fileSize": 33554432
},
"compactors": {
"count": 0,
"fileSize": 0
},
"shapefiles": {
"count": 0,
"fileSize": 0
},
"shapes": {
"count": 121,
"size": 56520
},
"attributes": {
"count": 24,
"size": 56
},
"indexes": {
"count": 3,
"size": 1660594864
},
"lastTick": "10044860034955",
"uncollectedLogfileEntries": 985,
"documentReferences": 0,
"waitingFor": "-",
"compactionStatus": {
"message": "checked datafiles, but no compaction opportunity found",
"time": "2016-02-24T08:29:27Z"
}
},
"status": 3,
"type": 2,
"error": false,
"code": 200
}
Thanks

It seems that your system is running out of memory. The datafiles for the one collection are 3,530,743,672 bytes in size, the indexes are 1,660,594,864. That is about 5.1 GB for this one collection alone.
arangod will need further memory for its WAL, the V8 contexts and temporary query results in order to operate properly.
Provided the system has 6 GB of total RAM and the OS and other processes need some RAM, too, it looks like you're running out of memory.
I am wondering if you're seeing some sort of swapping activity, which would explain why (all) operations would get extremely slow.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

MongoDB slow performance when fetching large array of documents - node.js

Related

Azure Gremlin edge traversal suspiciously high (Out() step) RU cost

Fetching host availability to external webpage in Nagios

How to extract grouped results from an array inside a collection in Mongodb

Finding duplicates in Elasticsearch

Arangodb freeze when page fault increased

Categories

Resources