I have a MongoDB document with quite a large embedded array:
name : "my-dataset"
data : [
{country : "A", province: "B", year : 1990, value: 200}
... 150 000 more
]
Let us say I want to return data objects where country == "A".
What is the proper way of doing this, for example via NodeJs?
Given 150 000 entries with 200 matches, how long should the query take approximately?
Would it be better (performance/structure wise) to store data as documents and the name as a property of each document?
Would it be more efficient to use Mysql for this? )
A) Just find them with a query.
B) If the compound index {name:1, data.country:1} is built, the query should be fast. But you store all the data in one array, $unwind op has to be used. As a result, the query could be slow.
C) It will be better. If you store the data like:
{country : "A", province: "B", year : 1990, value: 200, name:"my-dataset"}
{country : "B", province: "B", year : 1990, value: 200, name:"my-dataset"}
...
With compound index {name:1, country:1}, the query time should be < 10ms.
D) MySQL vs MongoDB 1000 reads
1.You can use the MongoDB aggregation :
db.collection.aggregate([
{$match: {name: "my-dataset"}},
{$unwind: "$data"},
{$match: {"data.country": "A"}}
])
Will return a document for each data entry where the country is "A". If you want to regroup the datasets, add a $group stage :
db.collection.aggregate([
{$match: {name: "my-dataset"}},
{$unwind: "$data"},
{$match: {"data.country": "A"}},
{$group: {_id: "$_id", data: {$addToSet: "$data"}}}
])
(Didn't test it on a proper dataset, so it might be bugged)
2.150000 Subdocuments is still not a lot for mongodb, so if you're only querying on one dataset it should be pretty fast (the order of the millisecond).
3.As long as you are sure that your document is going to be smaller than 16MB (kinda hard to say), the maximum BSON document size), it should be fine, but the queries would be simpler if you stored your data as documents with the dataset name as a property, which is generally better for performances.
Related
I have a database with a collection of about 90k documents. Each document is as follows:
{
'my_field_name': "a", # Or "b" or "c" ...
'p1': Array[30],
'p2': Array[10000]
}
There are about 9 unique values for a field name. When there where ~30k documents in the collection:
>>> db.collection.distinct("my_field_name")
["a", "b", "c"]
However, now with 90k documents, db.collection.distinct() returns an empty list.
>>> db.collection.distinct("my_field_name")
[]
Is there a maxTimeMS setting for db.collection.distinct? If so how could I set it to a higher value. If not what else could I investigate?
One thing you can do to immediately speed up your query's execution time is to index the field on which you are running the 'distinct' operation on (if the field is not already indexed).
That being said, if you want to set a maxTimeMS, one work around is to rewrite your query as an aggregation and set the operation timeout on the returned cursor. E.g:
db.collection.aggregate([
{ $group: { _id: '$my_field_name' } },
]).maxTimeMS(10000);
However unlike distinct, a cursor will be returned by the above query.
I have this aggregate query :
cr = db.last_response.aggregate([
{"$unwind": '$blocks'},
{"$match": {"sender_id": "1234", "page_id": "563921", "blocks.tag": "pay1"}},
{"$project": {"block": "$blocks.block"}}
])
Now i want to get the number of element it returned (is it empty cursor or not).
This is how i did :
I defined an empty array :
x = []
I iterated through the cursor and append the array x:
for i in cr :
x.append(i['block'])
print("length of the result of aggregation cursor :",len(x))
My question is : Is there any faster way to get the number of the result of aggregate query like the count() method of the find() query ?
Thanks
The faster way is that reject operations of transfers all data from mongod to you application. To do this you may add final group stage to count docs
{"$group": {"_id": None, "count": {"$sum": 1}}},
This is mean that mongod do aggregate and get as result count of docs.
Thereis no way to get count of result without execution of aggregation pipeline.
I'm a bit of a noob with MongoDB, so would appreciate some help with figuring out the best solution/format/structure in storing some data.
Basically, the data that will be stored will be updated every second with a name, value and timestamp for a certain meter reading.
For example, one possibility is water level and temperature in a tank. The tank will have a name and then the level and temperature will be read and stored every second. Overall, there will be 100's of items (i.e. tanks), each with millions of timestamped values.
From what I've learnt so far (and please correct me if I'm wrong), there are a few options as how to structure the data:
A slightly RDMS approach:
This would consist of two collections, Items and Values
Items : {
_id : "id",
name : "name"
}
Values : {
_id : "id",
item_id : "item_id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The more document db denormalized method:
This method involves one collection of items each with an array of timestamped values
Items : {
_id : "id",
name : "name"
values : [{
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}]
}
A collection for each item
Save all the values in a collection named after that item.
ItemName : {
_id : "id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The majority of read queries will be to retrieve the timestamped values for a specified time period of an item (i.e. tank) and display in a graph. And for this, the first option makes more sense to me as I don't want to retrieve the millions of values when querying for a specific item.
Is it even possible to query for values between specific timestamps for option 2?
I will also need to query for a list of items, so maybe a combination of the first and third option with a collection for all the items and then a number of collections to store the values for each of those items?
Any feedback on this is greatly appreciated.
Don't use timestamp if you are not modifying the ObjectId.
As ObjectId itself has time stamp in it.
So you will be saving a lot of memory by it.
MongoDB Id Documentation
In case if you dont require the previous data then you can use update query in MongoDB to update the fields every second instead of storing.
If you want to store the updated data each time then instead of updating store it in flat structure.
{ "_id" : ObjectId("XXXXXX"),
"name" : "ItemName",
"value" : "ValueOfItem"
"created_at" : "timestamp"
}
Edit 1: Added timestamp as per the comments
I am using node-mongodb-native to fire mongodb queries using node js.
There is a collection name 'locations', which have following fields:
sublocality1, sublocality2, sublocality3, city.
I want to fetch overall distinct values from these fields.
Eg:
Documents:
{
'sublocality1':'a',
'sublocality2':'a',
'sublocality3': 'b',
'city': 'c'
}
{
'sublocality1':'b',
'sublocality2':'a',
'sublocality3': 'b',
'city': 'a'
}
The query should return
['a' , 'b', 'c']
I tried following:
Run distinct queries for each of the fields:
collection.distinct('sublocality1',..){},
collection.distinct('sublocality2',..){},
collection.distinct('sublocality3',..){},
collection.distinct('city',..){}
Insert the result from these queries into a list, and search for distinct items across list.
Can I optimize this? Is it possible running a single query?
You could aggregate it on the database server as below:
Group Individual document, to get the values of each intended field
in an array.
Project a field named values as the union of all the intended field
values, using the $setUnion operator.
Unwind values.
Group all the records, to get the distinct values.
Code:
Collection.aggregate([
{$group:{"_id":"$_id",
"sublocality1":{$push:"$sublocality1"},
"sublocality2":{$push:"$sublocality2"},
"sublocality3":{$push:"$sublocality3"},
"city":{$push:"$city"}}},
{$project:{"values":{$setUnion:["$sublocality1",
"$sublocality2",
"$sublocality3",
"$city"]}}},
{$unwind:"$values"},
{$group:{"_id":null,"distinct":{$addToSet:"$values"}}},
{$project:{"distinct":1,"_id":0}}
],function(err,resp){
// handle response
})
Sample o/p:
{ "distinct" : [ "c", "a", "b" ] }
If you want the results to be sorted, you could apply a sort stage in the pipeline before the final project stage.
I am experimenting to see whether arangodb might be suitable for our usecase.
We will have large collections of documents with the same schema (like an sql table).
To try some queries I have inserted about 90K documents, which is low, as we expect document counts in the order of 1 million of more.
Now I want to get a simple page of these documents, without filtering, but with descending sorting.
So my aql is:
for a in test_collection
sort a.ARTICLE_INTERNALNR desc
limit 0,10
return {'nr': a.ARTICLE_INTERNALNR}
When I run this in the AQL Editor, it takes about 7 seconds, while I would expect a couple of milliseconds or something like that.
I have tried creating a hash index and a skiplist index on it, but that didn't have any effect:
db.test_collection.getIndexes()
[
{
"id" : "test_collection/0",
"type" : "primary",
"unique" : true,
"fields" : [
"_id"
]
},
{
"id" : "test_collection/19812564965",
"type" : "hash",
"unique" : true,
"fields" : [
"ARTICLE_INTERNALNR"
]
},
{
"id" : "test_collection/19826720741",
"type" : "skiplist",
"unique" : false,
"fields" : [
"ARTICLE_INTERNALNR"
]
}
]
So, am I missing something, or is ArangoDB not suitable for these cases?
If ArangoDB needs to sort all the documents, this will be a relatively slow operation (compared to not sorting). So the goal is to avoid the sorting at all.
ArangoDB has a skiplist index, which keeps indexed values in sorted order, and if that can be used in a query, it will speed up the query.
There are a few gotchas at the moment:
AQL queries without a FILTER condition won't use an index.
the skiplist index is fine for forward-order traversals, but it has no backward-order traversal facility.
Both these issues seem to have affected you.
We hope to fix both issues as soon as possible.
At the moment there is a workaround to enforce using the index in forward-order using an AQL query as follows:
FOR a IN
SKIPLIST(test_collection, { ARTICLE_INTERNALNR: [ [ '>', 0 ] ] }, 0, 10)
RETURN { nr: a.ARTICLE_INTERNALNR }
The above picks up the first 10 documents via the index on ARTICLE_INTERNALNR with a condition "value > 0". I am not sure if there is a solution for sorting backwards with limit.