My collection is described as follows:
{ "_id" : ObjectId("5474af69d4b28042fb63b856"), "name" : "XXXX", "action" : "accept", "source" : "127.0.0.1", "srcport" : "80", "destination" : "192.168.0.13", "dstport" : "53213", "service" : "443", "service_id" : "https", "unixtime" : NumberLong("1412774569000"), "segment" : "MySegment", "direction" : "INCOMING", "location" : "US" }
I currently have ~5.5mio entries in my collection and the base query always is:
collection.count({"action":"2_different_action_types", "direction":"3_different_directions", "unixtime": {"$gte": 1412774000000, "$lte": 1412774900000}})
Action, direction and unixtime always exist in my query but their value is dynamic. Optional (also dynamic values) parameters are:
location
segment
service_id
For example:
collection.count({"action":"2_different_action_types", "direction":"3_different_directions", "location":"US","segment":"mySegment", "unixtime": {"$gte": 1412774000000, "$lte": 1412774900000}})
collection.count({"action":"2_different_action_types", "direction":"3_different_directions", "service_id":"https", "unixtime": {"$gte": 1412774000000, "$lte": 1412774500000}})
I created the following indexes:
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 })
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 , location:1})
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 , service_id:1})
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 , segment:1})
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 , location:1, service_id: 1})
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 , location:1, segment: 1})
My query without the index took ~8sec, the query with an index ~6sec, which is still kinda slow.
How can I speed up the whole thing? Note, that at the moment I'm just counting the findings, not really looking for a specific entry.
Additional Info:
I'm currently trying to optimize those queries directly in the mongoshell but in the end, I'm querying via NodeJS (don't know if this is relevant for the solution).
The indexes don't seem to make much sense this way. Not-equals-queries like $gte and $lte should be at the end - not only in the query, but also in the index. Putting unixtime at position 1 in the index is generally a bad idea (unless you need the set of distinct actions within a single second and the number of actions in a single second is so large that they need indexing, which is unlikely).
Try to reverse the indexes and make sure the order of the index matches the order in the query.
If location, segment and service_id have low selectivity, try without an index on these fields first. More indexes cost more RAM and slow insertion and update time, but with low selectivity, the gain in queries is sometimes negligible. In the query, it might make sense to put the optional fields last, at the end of all the other operations - if the candidate set is small enough after the required criteria and the unixtime interval, a collection scan of the remaining items shouldn't hurt performance too badly. If they do and the selectivity is high, move them further forward.
Related
I'm a bit of a noob with MongoDB, so would appreciate some help with figuring out the best solution/format/structure in storing some data.
Basically, the data that will be stored will be updated every second with a name, value and timestamp for a certain meter reading.
For example, one possibility is water level and temperature in a tank. The tank will have a name and then the level and temperature will be read and stored every second. Overall, there will be 100's of items (i.e. tanks), each with millions of timestamped values.
From what I've learnt so far (and please correct me if I'm wrong), there are a few options as how to structure the data:
A slightly RDMS approach:
This would consist of two collections, Items and Values
Items : {
_id : "id",
name : "name"
}
Values : {
_id : "id",
item_id : "item_id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The more document db denormalized method:
This method involves one collection of items each with an array of timestamped values
Items : {
_id : "id",
name : "name"
values : [{
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}]
}
A collection for each item
Save all the values in a collection named after that item.
ItemName : {
_id : "id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The majority of read queries will be to retrieve the timestamped values for a specified time period of an item (i.e. tank) and display in a graph. And for this, the first option makes more sense to me as I don't want to retrieve the millions of values when querying for a specific item.
Is it even possible to query for values between specific timestamps for option 2?
I will also need to query for a list of items, so maybe a combination of the first and third option with a collection for all the items and then a number of collections to store the values for each of those items?
Any feedback on this is greatly appreciated.
Don't use timestamp if you are not modifying the ObjectId.
As ObjectId itself has time stamp in it.
So you will be saving a lot of memory by it.
MongoDB Id Documentation
In case if you dont require the previous data then you can use update query in MongoDB to update the fields every second instead of storing.
If you want to store the updated data each time then instead of updating store it in flat structure.
{ "_id" : ObjectId("XXXXXX"),
"name" : "ItemName",
"value" : "ValueOfItem"
"created_at" : "timestamp"
}
Edit 1: Added timestamp as per the comments
I have a collection with several documents in it of jobs to process using another system. I look up 5 objects from this table like this:
Work.find({status: 'new'})
.sort({priority: 'desc'})
.limit(5)
.exec(function (err, work){})
There is another field on these documents which determines that only one job with a given unique value can be ran at the same time.
For example these 3 jobs:
{uniqueVal: 1, priority: 1, type: "scrape", status:"new"}
{uniqueVal: 1, priority: 1, type: "scrape", status:"new"}
{uniqueVal: 2, priority: 1, type: "scrape", status:"new"}
There are 2 records with the uniqueVal of 1. Is there anything I can do to only pull one record with the value 1?
Note: These values are not predetermined values like in the example, they are ObjectIds of other documents.
I've looked into Distinct(), but it seems like it only returns the actual unique values, not the documents themselves.
I think the best choice is to use aggregation.
You can $group by uniqueVal
http://docs.mongodb.org/manual/reference/operator/aggregation/group/#pipe._S_group
And use $first for the other values
http://docs.mongodb.org/manual/reference/operator/aggregation/first/#grp._S_first
I'm using Mongoose for Node.js to interface with the mongo driver, so my query looks like:
db.Deal
.find({})
.select({
_id: 1,
name: 1,
opp: 1,
dateUploaded: 1,
status: 1
})
.sort({ dateUploaded: -1 })
And get: too much data for sort() with no index. add an index or specify a smaller limit
The number of documents in the Deal collection is quite small, maybe ~500 - but each one contains many embedded documents. The fields returned in the query above are all primitive, i.e. aren't documents.
I currently don't have any indexes setup other than the default ones - I've never had any issue until now. Should I try adding a compound key on:
{ _id: 1, name: 1, opp: 1, status: 1, dateUploaded: -1 }
Or is there a smarter way to perform the query? First time using mongodb.
From the MongoDB documentation on limits and thresholds:
MongoDB will only return sorted results on fields without an index if the combined size of all documents in the sort operation, plus a small overhead, is less than 32 megabytes.
Probably all the embedded documents are too much, you should add an index on the sorted field dateUploaded if you want to run the same query.
Otherwise you can limit you query and start paginating the results.
I have a MongoDB document with quite a large embedded array:
name : "my-dataset"
data : [
{country : "A", province: "B", year : 1990, value: 200}
... 150 000 more
]
Let us say I want to return data objects where country == "A".
What is the proper way of doing this, for example via NodeJs?
Given 150 000 entries with 200 matches, how long should the query take approximately?
Would it be better (performance/structure wise) to store data as documents and the name as a property of each document?
Would it be more efficient to use Mysql for this? )
A) Just find them with a query.
B) If the compound index {name:1, data.country:1} is built, the query should be fast. But you store all the data in one array, $unwind op has to be used. As a result, the query could be slow.
C) It will be better. If you store the data like:
{country : "A", province: "B", year : 1990, value: 200, name:"my-dataset"}
{country : "B", province: "B", year : 1990, value: 200, name:"my-dataset"}
...
With compound index {name:1, country:1}, the query time should be < 10ms.
D) MySQL vs MongoDB 1000 reads
1.You can use the MongoDB aggregation :
db.collection.aggregate([
{$match: {name: "my-dataset"}},
{$unwind: "$data"},
{$match: {"data.country": "A"}}
])
Will return a document for each data entry where the country is "A". If you want to regroup the datasets, add a $group stage :
db.collection.aggregate([
{$match: {name: "my-dataset"}},
{$unwind: "$data"},
{$match: {"data.country": "A"}},
{$group: {_id: "$_id", data: {$addToSet: "$data"}}}
])
(Didn't test it on a proper dataset, so it might be bugged)
2.150000 Subdocuments is still not a lot for mongodb, so if you're only querying on one dataset it should be pretty fast (the order of the millisecond).
3.As long as you are sure that your document is going to be smaller than 16MB (kinda hard to say), the maximum BSON document size), it should be fine, but the queries would be simpler if you stored your data as documents with the dataset name as a property, which is generally better for performances.
I am experimenting to see whether arangodb might be suitable for our usecase.
We will have large collections of documents with the same schema (like an sql table).
To try some queries I have inserted about 90K documents, which is low, as we expect document counts in the order of 1 million of more.
Now I want to get a simple page of these documents, without filtering, but with descending sorting.
So my aql is:
for a in test_collection
sort a.ARTICLE_INTERNALNR desc
limit 0,10
return {'nr': a.ARTICLE_INTERNALNR}
When I run this in the AQL Editor, it takes about 7 seconds, while I would expect a couple of milliseconds or something like that.
I have tried creating a hash index and a skiplist index on it, but that didn't have any effect:
db.test_collection.getIndexes()
[
{
"id" : "test_collection/0",
"type" : "primary",
"unique" : true,
"fields" : [
"_id"
]
},
{
"id" : "test_collection/19812564965",
"type" : "hash",
"unique" : true,
"fields" : [
"ARTICLE_INTERNALNR"
]
},
{
"id" : "test_collection/19826720741",
"type" : "skiplist",
"unique" : false,
"fields" : [
"ARTICLE_INTERNALNR"
]
}
]
So, am I missing something, or is ArangoDB not suitable for these cases?
If ArangoDB needs to sort all the documents, this will be a relatively slow operation (compared to not sorting). So the goal is to avoid the sorting at all.
ArangoDB has a skiplist index, which keeps indexed values in sorted order, and if that can be used in a query, it will speed up the query.
There are a few gotchas at the moment:
AQL queries without a FILTER condition won't use an index.
the skiplist index is fine for forward-order traversals, but it has no backward-order traversal facility.
Both these issues seem to have affected you.
We hope to fix both issues as soon as possible.
At the moment there is a workaround to enforce using the index in forward-order using an AQL query as follows:
FOR a IN
SKIPLIST(test_collection, { ARTICLE_INTERNALNR: [ [ '>', 0 ] ] }, 0, 10)
RETURN { nr: a.ARTICLE_INTERNALNR }
The above picks up the first 10 documents via the index on ARTICLE_INTERNALNR with a condition "value > 0". I am not sure if there is a solution for sorting backwards with limit.