Timeout for db.collection.distinct()? - python-3.x

I have a database with a collection of about 90k documents. Each document is as follows:
{
'my_field_name': "a", # Or "b" or "c" ...
'p1': Array[30],
'p2': Array[10000]
}
There are about 9 unique values for a field name. When there where ~30k documents in the collection:
>>> db.collection.distinct("my_field_name")
["a", "b", "c"]
However, now with 90k documents, db.collection.distinct() returns an empty list.
>>> db.collection.distinct("my_field_name")
[]
Is there a maxTimeMS setting for db.collection.distinct? If so how could I set it to a higher value. If not what else could I investigate?

One thing you can do to immediately speed up your query's execution time is to index the field on which you are running the 'distinct' operation on (if the field is not already indexed).
That being said, if you want to set a maxTimeMS, one work around is to rewrite your query as an aggregation and set the operation timeout on the returned cursor. E.g:
db.collection.aggregate([
{ $group: { _id: '$my_field_name' } },
]).maxTimeMS(10000);
However unlike distinct, a cursor will be returned by the above query.

Related

documents that contain value(s) other than the ones specified in the solr query

I am working on a solr query that aims to retrieve documents that have other than specific values in array.
Example
{
id:1,
field:["a", "b", "c"]
},
{
id:2,
field:["a", "b"]
}
If i want to have documents which contains other than values "a" and "b", I expect to have this result from solr
{
id:1,
field:["a", "b", "c"]
}
If i want to have documents which contains other than values "a" and "b" and "c", I expect to have no result from solr
I tried to solve my problem with this query. I can get the right result for my first if but for the second, solr still sends me the document with id 1 even though it has no other value than "a", "b" and "c" in the field array.
# work
field: * AND (*:* NOT field:("a" OR "b"))
# doesn't work
field: * AND (*:* NOT field:("a" OR "b" OR "c"))
Does anyone have a solution to make this type of query with solr? :(

Get the size of the result of aggregate method MongoDB

I have this aggregate query :
cr = db.last_response.aggregate([
{"$unwind": '$blocks'},
{"$match": {"sender_id": "1234", "page_id": "563921", "blocks.tag": "pay1"}},
{"$project": {"block": "$blocks.block"}}
])
Now i want to get the number of element it returned (is it empty cursor or not).
This is how i did :
I defined an empty array :
x = []
I iterated through the cursor and append the array x:
for i in cr :
x.append(i['block'])
print("length of the result of aggregation cursor :",len(x))
My question is : Is there any faster way to get the number of the result of aggregate query like the count() method of the find() query ?
Thanks
The faster way is that reject operations of transfers all data from mongod to you application. To do this you may add final group stage to count docs
{"$group": {"_id": None, "count": {"$sum": 1}}},
This is mean that mongod do aggregate and get as result count of docs.
Thereis no way to get count of result without execution of aggregation pipeline.

MongoDB/Mongoose query to filter all the value in an array based on their presence in a collection

I have an array lets say [1,2,3] and a collection called 'Numbers' and it has a field called 'value'. I need to retain all the values in the array which are present against the 'value' field in any document in the collection.
Example,
Test array - [1,2,3]
Numbers collection - [{value: 1}, {value: 3}]
Result should be - [1,3]
Result is that way because '2' was not present against 'value' field in any documents within 'Numbers' collection.
How do i do this?
You can try below distinct query with projection and query filter.
db.Numbers.distinct( "value", { "value": { $in: [1,2,3] } } )

Increase performance for this MongoDB query

I have a MongoDB document with quite a large embedded array:
name : "my-dataset"
data : [
{country : "A", province: "B", year : 1990, value: 200}
... 150 000 more
]
Let us say I want to return data objects where country == "A".
What is the proper way of doing this, for example via NodeJs?
Given 150 000 entries with 200 matches, how long should the query take approximately?
Would it be better (performance/structure wise) to store data as documents and the name as a property of each document?
Would it be more efficient to use Mysql for this? )
A) Just find them with a query.
B) If the compound index {name:1, data.country:1} is built, the query should be fast. But you store all the data in one array, $unwind op has to be used. As a result, the query could be slow.
C) It will be better. If you store the data like:
{country : "A", province: "B", year : 1990, value: 200, name:"my-dataset"}
{country : "B", province: "B", year : 1990, value: 200, name:"my-dataset"}
...
With compound index {name:1, country:1}, the query time should be < 10ms.
D) MySQL vs MongoDB 1000 reads
1.You can use the MongoDB aggregation :
db.collection.aggregate([
{$match: {name: "my-dataset"}},
{$unwind: "$data"},
{$match: {"data.country": "A"}}
])
Will return a document for each data entry where the country is "A". If you want to regroup the datasets, add a $group stage :
db.collection.aggregate([
{$match: {name: "my-dataset"}},
{$unwind: "$data"},
{$match: {"data.country": "A"}},
{$group: {_id: "$_id", data: {$addToSet: "$data"}}}
])
(Didn't test it on a proper dataset, so it might be bugged)
2.150000 Subdocuments is still not a lot for mongodb, so if you're only querying on one dataset it should be pretty fast (the order of the millisecond).
3.As long as you are sure that your document is going to be smaller than 16MB (kinda hard to say), the maximum BSON document size), it should be fine, but the queries would be simpler if you stored your data as documents with the dataset name as a property, which is generally better for performances.

Index multiple MongoDB fields, make only one unique

I've got a MongoDB database of metadata for about 300,000 photos. Each has a native unique ID that needs to be unique to protect against duplication insertions. It also has a time stamp.
I frequently need to run aggregate queries to see how many photos I have for each day, so I also have a date field in the format YYYY-MM-DD. This is obviously not unique.
Right now I only have an index on the id property, like so (using the Node driver):
collection.ensureIndex(
{ id:1 },
{ unique:true, dropDups: true },
function(err, indexName) { /* etc etc */ }
);
The group query for getting the photos by date takes quite a long time, as one can imagine:
collection.group(
{ date: 1 },
{},
{ count: 0 },
function ( curr, result ) {
result.count++;
},
function(err, grouped) { /* etc etc */ }
);
I've read through the indexing strategy, and I think I need to also index the date property. But I don't want to make it unique, of course (though I suppose it's fine to make it unique in combine with the unique id). Should I do a regular compound index, or can I chain the .ensureIndex() function and only specify uniqueness for the id field?
MongoDB does not have "mixed" type indexes which can be partially unique. On the other hand why don't you use _id instead of your id field if possible. It's already indexed and unique by definition so it will prevent you from inserting duplicates.
Mongo can only use a single index in a query clause - important to consider when creating indexes. For this particular query and requirements I would suggest to have a separate unique index on id field which you would get if you use _id. Additionally, you can create a non-unique index on date field only. If you run query like this:
db.collection.find({"date": "01/02/2013"}).count();
Mongo will be able to use index only to answer the query (covered index query) which is the best performance you can get.
Note that Mongo won't be able to use compound index on (id, date) if you are searching by date only. You query has to match index prefix first, i.e. if you search by id then (id, date) index can be used.
Another option is to pre aggregate in the schema itself. Whenever you insert a photo you can increment this counter. This way you don't need to run any aggregation jobs. You can also run some tests to determine if this approach is more performant than aggregation.

Resources