Distinct count of multiple fields values using mongodb aggregation

Distinct count of multiple fields values using mongodb aggregation - node.js

I'm trying to count distinct values of multiple fields By one MongoDB Aggregation query.
So here's my data:
{
"_id":ObjectID( "617b0dbacda6cbd1a0403f68")
"car_type": "suv",
"color": "red",
"num_doors": 4
},
{
"_id":ObjectID( "617b0dbacda6cbd1a04078df")
"car_type": " suv ",
"color": "blue",
"num_doors": 4
},
{
"_id":ObjectID( "617b0dbacda6cbd1a040ld45")
"car_type": "wagon",
"color": "red",
"num_doors": 4
},
{
"_id":ObjectID( "617b0dbacda6cbd1a0403dcd")
"car_type": "suv",
"color": "blue",
"num_doors": 4
},
{
"_id":ObjectID( "617b0dbacda6cbd1a0403879")
"car_type": " wagon ",
"color": "red",
"num_doors": 4
},
{
"_id":ObjectID( "617b0dbacda6cbd1a0405478")
"car_type": "wagon",
"color": "red",
"num_doors": 4
}
I want a distinct count of each color by car_type:
"car_type": "suv"
"red":2,
"blue":2
iwas able to distinct and cound all colors but i couldnt distinct them by car_type

Query
group specific first (cartype+color), to count the same colors
group less specific after (cartype), to get all colors/count for each car_type
project to fix structure and $arrayToObject to make the colors keys and the the count values
*query assumes that " wagon " was typing mistake(the extra spaces i mean), if your collection has those problems, use $trim to clear the database from those.
*query is updated to include the sum also, from the comment
Test code here
aggregate(
[{"$group":
{"_id": {"car_type": "$car_type", "color": "$color"},
"count": {"$sum": 1}}},
{"$group":
{"_id": "$_id.car_type",
"colors": {"$push": {"k": "$_id.color", "v": "$count"}}}},
{"$set": {"sum": {"$sum": "$colors.v"}}},
{"$project":
{"_id": 0,
"sum": 1,
"car_type": "$_id",
"colors": {"$arrayToObject": ["$colors"]}}},
{"$replaceRoot": {"newRoot": {"$mergeObjects": ["$colors", "$$ROOT"]}}},
{"$project": {"colors": 0}}])

Related

Aggregate a set of unique values from all documents in CouchDB

I'm trying to create a view in CouchDB that returns groups of unique values. E.g., a list of all unique brands and categories.
map function
function (doc) {
emit("brands", [doc.brand]);
emit("categories", [doc.category]);
}
reduce function
function (keys, values, rereduce) {
return values.reduce(function(acc, value) {
if (acc.indexOf(value[0]) === -1) {
return acc.concat(value);
}
return acc;
});
}
then I call that view with group=true, group_level=2. The grouping is correct, but the values aren't unique. The value is an array containing duplicates.
What I'm trying to achieve is basically having the key be the group name, e.g., brands, and the value be the aggregated unique values, e.g., ["Brand A", "Brand B"].
Given the following documents
[
{
"_id": "1",
"brand": "Brand A",
"category": "Category A",
"colors": [
"Red",
"White"
]
},
{
"_id": "2",
"brand": "Brand B",
"category": "Category B",
"colors": [
"Blue",
"White"
]
},
{
"_id": "3",
"brand": "Brand A",
"category": "Category B",
"colors": [
"Green",
"Red"
]
}
]
When I query then view in CouchDB, I'd like to get the following result back
{
"brands": ["Brand A", "Brand B"],
"categories": ["Category A", "Category B"],
"colors": ["Red", "White", "Blue", "Green"]
}
Note: The result above is just a demonstration of what I expect the view to return. It does not have to be structured as such (not even sure it's possible).

I'm going to answer this myself.
First, we want to define a map function that emit the group name as the key and the value wrapped in an array (making rereduce easier).
map function
function (doc) {
emit("brands", [doc.brand]);
emit("categories", [doc.category]);
doc.colors.forEach(function(color) {
emit("colors", [color]);
})
}
The we define a custom reduce function
function (keys, values, rereduce) {
return values.reduce(function(acc, value) {
value.forEach(function(v) {
if (acc.indexOf(v) === -1) {
return acc.push(v);
}
});
return acc;
});
}
Now, calling the view with group=true and group_level=1 will yield the following result:
+------------+-----------------------------------+
| key | value |
+------------+-----------------------------------+
| brands | ["Brand A", "Brand B"] |
| categories | ["Category A", "Category B"] |
| colors | ["Red", "White", "Blue", "Green"] |
+------------+-----------------------------------+

PySpark: How to create a nested JSON from spark data frame?

I am trying to create a nested json from my spark dataframe which has data in following structure. The below code is creating a simple json with key and value. Could you please help
df.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
Update1:
As per #MaxU answer,I converted the spark data frame to pandas and used group by. It is putting the last two fields in a nested array. How could i first put the category and count in nested array and then inside that array i want to put subcategory and count.
Sample text data:
Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count
Vendor1,10,Category 1,4,Sub Category 1,1
Vendor1,10,Category 1,4,Sub Category 2,2
Vendor1,10,Category 1,4,Sub Category 3,3
Vendor1,10,Category 1,4,Sub Category 4,4
j = (data_pd.groupby(['vendor_name','vendor_Cnt','Category','Category_cnt'], as_index=False)
.apply(lambda x: x[['Subcategory','subcategory_cnt']].to_dict('r'))
.reset_index()
.rename(columns={0:'subcategories'})
.to_json(orient='records'))
[{
"vendor_name": "Vendor 1",
"count": 10,
"categories": [{
"name": "Category 1",
"count": 4,
"subCategories": [{
"name": "Sub Category 1",
"count": 1
},
{
"name": "Sub Category 2",
"count": 1
},
{
"name": "Sub Category 3",
"count": 1
},
{
"name": "Sub Category 4",
"count": 1
}
]
}]

You need to re-structure the whole dataframe for that.
"subCategories" is a struct stype.
from pyspark.sql import functions as F
df.withColumn(
"subCategories",
F.struct(
F.col("subCategories").alias("name"),
F.col("subcategory_count").alias("count")
)
)
and then, groupBy and use F.collect_list to create the array.
At the end, you need to have only 1 record in your dataframe to get the result you expect.

The easiest way to do this in python/pandas would be to use a series of nested generators using groupby I think:
def split_df(df):
for (vendor, count), df_vendor in df.groupby(["Vendor_Name", "count"]):
yield {
"vendor_name": vendor,
"count": count,
"categories": list(split_category(df_vendor))
}
def split_category(df_vendor):
for (category, count), df_category in df_vendor.groupby(
["Categories", "Category_Count"]
):
yield {
"name": category,
"count": count,
"subCategories": list(split_subcategory(df_category)),
}
def split_subcategory(df_category):
for row in df.itertuples():
yield {"name": row.Subcategory, "count": row.Subcategory_Count}
list(split_df(df))
[
{
"vendor_name": "Vendor1",
"count": 10,
"categories": [
{
"name": "Category 1",
"count": 4,
"subCategories": [
{"name": "Sub Category 1", "count": 1},
{"name": "Sub Category 2", "count": 2},
{"name": "Sub Category 3", "count": 3},
{"name": "Sub Category 4", "count": 4},
],
}
],
}
]
To export this to json, you'll need a way to export the np.int64

query mongo : find count of array in all the documents of a collection

query mongo to find the count of all cars array in each document of collection company
I am new to mongo, I
db.company.find() --> but then how do I select the arrays and that too for all of them
collection company : {
{
"_id": "5b8ed214b460e7c17c5a33f9",
"company_location": "USA",
"company_name": "gmc",
"__v": 0,
"cars": [{
"_id": "5b8ed214044b2509466eca2e",
"model": "TERRAIN",
"year": 2013,
"PriceInINR": 3851710,
"trim": "SLE2 FWD",
"engine": "SPORT UTILITY 4-DR",
"body": "2.4L L4 DOHC 16V FFV",
"color": "Yellow",
"transmission_type": "Manual",
"dealer_id": "5b8e7ce7065fa50bee095072"
},
{------},
{------}
}

First, assign the query to the variable...
then count cars array... like below
var obj = db.company.find(_id:"5b8ed214b460e7c17c5a33f9").toArray();
var count = obj[0].cars.length
If you will find by _id obj[index] always will be 0.

MongoDB Get concatenated values in a find method [duplicate]

Let's say I have a collection called 'people' with the following documents:
{
"name": "doug",
"colors": ["blue", "red"]
}
{
"name": "jack",
"colors": ["blue", "purple"]
}
{
"name": "jenny",
"colors": ["pink"]
}
How would I get a concatenated array of all the colors subarrays, i.e.?
["blue", "red", "blue", "purple", "pink"]

Well, Try should work fine for you!!
db.people.distinct("colors")

Try to use aggregate:
db.people.aggregate([
{$unwind:"$colors"},
{$group:{_id:null, clrs: {$push : "$colors"} }},
{$project:{_id:0, colors: "$clrs"}}
])
Result:
{
"result" : [
{
"colors" : [
"blue",
"red",
"blue",
"purple",
"pink"
]
}
],
"ok" : 1
}
Updated
If you want to get unique values in result's array, you could use $addToSet operator instead of $push in the $group stage.

Mongoose: Returning unique result set with no duplicate entries

I am using Mongoose in a MEAN environment. How can I make sure to not have any duplicate results in my result set?
Example: my database contains 10 (partly duplicate) names:
Allan
Allan Fourier
Allan
Allan Maxwell
Allan
Allan Foo
Allan Whatever
Allan Whoever
Allan Smith
Allan Rogers
When querying this database for 'Allan' or maybe even just 'all' (using .find(regex...) and limiting the number of returned results to 5, I get this:
Allan
Allan Fourier
Allan
Allan Maxwell
Allan
Having three duplicate entries of 'Allan', we waste a lot of result-diversity (talking about an autocomplete function for a search input field). I need the returned result set free of duplicates, such as:
Allan
Allan Fourier
Allan Maxwell
Allan Foo
Allan Whatever
How can that be achieved using mongoose, if at all?

You can use find to establish the query and then chain a call to distinct on the resulting query object to get the unique names in the result:
var search = 'Allan';
Name.find({name: new RegExp(search)}).distinct('name').exec(function(err, names) {...});
Or you can combine it all into a call to distinct on the model, providing the query object as the second parameter:
var search = 'Allan';
Name.distinct('name', {name: new RegExp(search)}, function(err, names) {...});
In both cases, names is an array of just the distinct names, not full document objects.
You can also do this with aggregate which would then let you directly limit the number of results:
Name.aggregate([
{$match: {name: new RegExp(search)}},
{$group: {_id: '$name'}},
{$limit: 5}
])

You can use MongoDB's distinct() query to find only distinct values (i.e., unique) in your set. Per the API docs, distinct can be used with Mongoose.
Their example:
{ "_id": 1, "dept": "A", "item": { "sku": "111", "color": "red" }, "sizes": [ "S", "M" ] }
{ "_id": 2, "dept": "A", "item": { "sku": "111", "color": "blue" }, "sizes": [ "M", "L" ] }
{ "_id": 3, "dept": "B", "item": { "sku": "222", "color": "blue" }, "sizes": "S" }
{ "_id": 4, "dept": "A", "item": { "sku": "333", "color": "black" }, "sizes": [ "S" ] }
With db.inventory.distinct( "dept" ) will return [ "A", "B" ]

You can filter the search result which is array using method suggested here:
Delete duplicate from Array

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Distinct count of multiple fields values using mongodb aggregation - node.js

Related

Aggregate a set of unique values from all documents in CouchDB

PySpark: How to create a nested JSON from spark data frame?

query mongo : find count of array in all the documents of a collection

MongoDB Get concatenated values in a find method [duplicate]

Mongoose: Returning unique result set with no duplicate entries

Categories

Resources