Distinct values from various fields in MongoDB collection

Distinct values from various fields in MongoDB collection - node.js

I am using node-mongodb-native to fire mongodb queries using node js.
There is a collection name 'locations', which have following fields:
sublocality1, sublocality2, sublocality3, city.
I want to fetch overall distinct values from these fields.
Eg:
Documents:
{
'sublocality1':'a',
'sublocality2':'a',
'sublocality3': 'b',
'city': 'c'
}
{
'sublocality1':'b',
'sublocality2':'a',
'sublocality3': 'b',
'city': 'a'
}
The query should return
['a' , 'b', 'c']
I tried following:
Run distinct queries for each of the fields:
collection.distinct('sublocality1',..){},
collection.distinct('sublocality2',..){},
collection.distinct('sublocality3',..){},
collection.distinct('city',..){}
Insert the result from these queries into a list, and search for distinct items across list.
Can I optimize this? Is it possible running a single query?

You could aggregate it on the database server as below:
Group Individual document, to get the values of each intended field
in an array.
Project a field named values as the union of all the intended field
values, using the $setUnion operator.
Unwind values.
Group all the records, to get the distinct values.
Code:
Collection.aggregate([
{$group:{"_id":"$_id",
"sublocality1":{$push:"$sublocality1"},
"sublocality2":{$push:"$sublocality2"},
"sublocality3":{$push:"$sublocality3"},
"city":{$push:"$city"}}},
{$project:{"values":{$setUnion:["$sublocality1",
"$sublocality2",
"$sublocality3",
"$city"]}}},
{$unwind:"$values"},
{$group:{"_id":null,"distinct":{$addToSet:"$values"}}},
{$project:{"distinct":1,"_id":0}}
],function(err,resp){
// handle response
})
Sample o/p:
{ "distinct" : [ "c", "a", "b" ] }
If you want the results to be sorted, you could apply a sort stage in the pipeline before the final project stage.

Related

Query to get all Cosmos DB documents referenced by another

Assume I have the following Cosmos DB container with the possible doc type partitions:
{
"id": <string>,
"partitionKey": <string>, // Always "item"
"name": <string>
}
{
"id": <string>,
"partitionKey": <string>, // Always "group"
"items": <array[string]> // Always an array of ids for items in the "item" partition
}
I have the id of a "group" document, but I do not have the document itself. What I would like to do is perform a query which gives me all "item" documents referenced by the "group" document.
I know I can perform two queries: 1) Retrieve the "group" document, 2) Perform a query with IN clause on the "item" partition.
As I don't care about the "group" document other than getting the list of ids, is it possible to construct a single query to get me all the "item" documents I want with just the "group" document id?

You'll need to perform two queries, as there are no joins between separate documents. Even though there is support for subqueries, only correlated subqueries are currently supported (meaning, the inner subquery is referencing values from the outer query). Non-correlated subqueries are what you'd need.
Note that, even though you don't want all of the group document, you don't need to retrieve the entire document. You can project just the items property, which can then be used in your 2nd query, with something like array_contains(). Something like:
SELECT VALUE g.items
FROM g
WHERE g.id="1"
AND g.partitionKey="group"
SELECT VALUE i.name
FROM i
WHERE array_contains(<items-from-prior-query>,i.id)
AND i.partitionKey="item"
This documentation page clarifies the two subquery types and support for only correlated subqueries.

Timeout for db.collection.distinct()?

I have a database with a collection of about 90k documents. Each document is as follows:
{
'my_field_name': "a", # Or "b" or "c" ...
'p1': Array[30],
'p2': Array[10000]
}
There are about 9 unique values for a field name. When there where ~30k documents in the collection:
>>> db.collection.distinct("my_field_name")
["a", "b", "c"]
However, now with 90k documents, db.collection.distinct() returns an empty list.
>>> db.collection.distinct("my_field_name")
[]
Is there a maxTimeMS setting for db.collection.distinct? If so how could I set it to a higher value. If not what else could I investigate?

One thing you can do to immediately speed up your query's execution time is to index the field on which you are running the 'distinct' operation on (if the field is not already indexed).
That being said, if you want to set a maxTimeMS, one work around is to rewrite your query as an aggregation and set the operation timeout on the returned cursor. E.g:
db.collection.aggregate([
{ $group: { _id: '$my_field_name' } },
]).maxTimeMS(10000);
However unlike distinct, a cursor will be returned by the above query.

couchbase add subdocument unique array values

I have a couchbase document as
{
"last": 123,
"data": [
[0, 1.1],
[1, 2.3]
]
}
currently have code to upsert the document to change the last property and add values to the data array, however, cannot find a way to insert unique values only. I'd like to avoid fetching the whole document and doing the filtering in javascript. Is there any way in couchbase?
arrayAddUnique will fail, cause there are floats in the subarrays per couchbase docs.
.mutateIn(`document`)
.upsert("last", 234)
.arrayAppend("data", newDataArray)
.execute( ... )

MongoDB/Mongoose query to filter all the value in an array based on their presence in a collection

I have an array lets say [1,2,3] and a collection called 'Numbers' and it has a field called 'value'. I need to retain all the values in the array which are present against the 'value' field in any document in the collection.
Example,
Test array - [1,2,3]
Numbers collection - [{value: 1}, {value: 3}]
Result should be - [1,3]
Result is that way because '2' was not present against 'value' field in any documents within 'Numbers' collection.
How do i do this?

You can try below distinct query with projection and query filter.
db.Numbers.distinct( "value", { "value": { $in: [1,2,3] } } )

Increase performance for this MongoDB query

I have a MongoDB document with quite a large embedded array:
name : "my-dataset"
data : [
{country : "A", province: "B", year : 1990, value: 200}
... 150 000 more
]
Let us say I want to return data objects where country == "A".
What is the proper way of doing this, for example via NodeJs?
Given 150 000 entries with 200 matches, how long should the query take approximately?
Would it be better (performance/structure wise) to store data as documents and the name as a property of each document?
Would it be more efficient to use Mysql for this? )

A) Just find them with a query.
B) If the compound index {name:1, data.country:1} is built, the query should be fast. But you store all the data in one array, $unwind op has to be used. As a result, the query could be slow.
C) It will be better. If you store the data like:
{country : "A", province: "B", year : 1990, value: 200, name:"my-dataset"}
{country : "B", province: "B", year : 1990, value: 200, name:"my-dataset"}
...
With compound index {name:1, country:1}, the query time should be < 10ms.
D) MySQL vs MongoDB 1000 reads

1.You can use the MongoDB aggregation :
db.collection.aggregate([
{$match: {name: "my-dataset"}},
{$unwind: "$data"},
{$match: {"data.country": "A"}}
])
Will return a document for each data entry where the country is "A". If you want to regroup the datasets, add a $group stage :
db.collection.aggregate([
{$match: {name: "my-dataset"}},
{$unwind: "$data"},
{$match: {"data.country": "A"}},
{$group: {_id: "$_id", data: {$addToSet: "$data"}}}
])
(Didn't test it on a proper dataset, so it might be bugged)
2.150000 Subdocuments is still not a lot for mongodb, so if you're only querying on one dataset it should be pretty fast (the order of the millisecond).
3.As long as you are sure that your document is going to be smaller than 16MB (kinda hard to say), the maximum BSON document size), it should be fine, but the queries would be simpler if you stored your data as documents with the dataset name as a property, which is generally better for performances.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Distinct values from various fields in MongoDB collection - node.js

Related

Query to get all Cosmos DB documents referenced by another

Timeout for db.collection.distinct()?

couchbase add subdocument unique array values

MongoDB/Mongoose query to filter all the value in an array based on their presence in a collection

Increase performance for this MongoDB query

Categories

Resources