CouchDB - Map Reduce similar to SQL Group by - couchdb

Consider following sample documents stored in CouchDB
{
"_id":....,
"rev":....,
"type":"orders",
"Period":"2013-01",
"Region":"East",
"Category":"Stationary",
"Product":"Pen",
"Rate":1,
"Qty":10,
"Amount":10
}
{
"_id":....,
"rev":....,
"type":"orders",
"Period":"2013-02",
"Region":"South",
"Category":"Food",
"Product":"Biscuit",
"Rate":7,
"Qty":5,
"Amount":35
}
Consider following SQL query
SELECT Period, Region,Category, Product, Min(Rate),Max(Rate),Count(Rate), Sum(Qty),Sum(Amount)
FROM Sales
GROUP BY Period,Region,Category, Product;
Is it possible to create map/reduce views in couchdb equivalent to the above SQL query and to produce output like
[
{
"Period":"2013-01",
"Region":"East",
"Category":"Stationary",
"Product":"Pen",
"MinRate":1,
"MaxRate":2,
"OrdersCount":20,
"TotQty":1000,
"Amount":1750
},
{
...
}
]

Up front, I believe #benedolph's answer is best-practice and best-case-scenario. Each reduce should ideally return 1 scalar value to keep the code as simple as possible.
However, it is true you'd have to issue multiple queries to retrieve the full resultset described by your question. If you don't have the option to run queries in parallel, or it is really important to keep the number of queries down it is possible to do it all at once.
Your map function will remain pretty simple:
function (doc) {
emit([ doc.Period, doc.Region, doc.Category, doc.Product ], doc);
}
The reduce function is where it gets lengthy:
function (key, values, rereduce) {
// helper function to sum all the values of a specified field in an array of objects
function sumField(arr, field) {
return arr.reduce(function (prev, cur) {
return prev + cur[field];
}, 0);
}
// helper function to create an array of just a single property from an array of objects
// (this function came from underscore.js, at least it's name and concept)
function pluck(arr, field) {
return arr.map(function (item) {
return item[field];
});
}
// rereduce made this more challenging, and I could not thoroughly test this right now
// see the CouchDB wiki for more information
if (rereduce) {
// a rereduce handles transitionary values
// (so the "values" below are the results of previous reduce functions, not the map function)
return {
OrdersCount: sumField(values, "OrdersCount"),
MinRate: Math.min.apply(Math, pluck(values, "MinRate")),
MaxRate: Math.max.apply(Math, pluck(values, "MaxRate")),
TotQty: sumField(values, "TotQty"),
Amount: sumField(values, "Amount")
};
} else {
var rates = pluck(values, "Rate");
// This takes a group of documents and gives you the stats you were asking for
return {
OrdersCount: values.length,
MinRate: Math.min.apply(Math, rates),
MaxRate: Math.max.apply(Math, rates),
TotQty: sumField(values, "Qty"),
Amount: sumField(values, "Amount")
};
}
}
I was not able to test the "rereduce" branch of this code at all, you'll have to do that on your end. (but this should work) See the wiki for information about reduce vs rereduce.
The helper functions I added at the top actually made the code overall much shorter and easier to read, they're largely influenced by my experience with Underscore.js. However, you can't include CommonJS modules in reduce functions, so it has to be written manually.
Again, best-case scenario is to have each aggregated field get it's own map/reduce index, but if that isn't on option to you, the above code should get you what you've described here in the question.

I will propose a very simple solution that requires one view per variable you want to aggregate in your "select" clause. While it is certainly possible to aggregate all variables in a single view, the reduce function would be far more complex.
The design document looks like this:
{
"_id": "_design/ddoc",
"_rev": "...",
"language": "javascript",
"views": {
"rates": {
"map": "function(doc) {\n emit([doc.Period, doc.Region, doc.Category, doc.Product], doc.Rate);\n}",
"reduce": "_stats"
},
"qty": {
"map": "function(doc) {\n emit([doc.Period, doc.Region, doc.Category, doc.Product], doc.Qty);\n}",
"reduce": "_stats"
}
}
}
Now, you can query <couchdb>/<database>/_design/ddoc/_view/rates?group_level=4 to get the statistics about the "Rate" variable. The result should look like this:
{"rows":[
{"key":["2013-01","East","Stationary","Pen"],"value":{"sum":4,"count":3,"min":1,"max":2,"sumsqr":6}},
{"key":["2013-01","North","Stationary","Pen"],"value":{"sum":1,"count":1,"min":1,"max":1,"sumsqr":1}},
{"key":["2013-01","South","Stationary","Pen"],"value":{"sum":0.5,"count":1,"min":0.5,"max":0.5,"sumsqr":0.25}},
{"key":["2013-02","South","Food","Biscuit"],"value":{"sum":7,"count":1,"min":7,"max":7,"sumsqr":49}}
]}
For the "Qty" variable, the query would be <couchdb>/<database>/_design/ddoc/_view/qty?group_level=4.
With the group_level property you can control over which levels the aggregation is to be performed. For example, querying with group_level=2 will aggregate up to "Period" and "Region".

Related

CouchDB Count Reduce with timestamp filtering

Let's say I have documents like so:
{
_id: "a98798978s978dd98d",
type: "signature",
uid: "u12345",
category: "cat_1",
timestamp: UNIX_TIMESTAMP
}
My goal is to be able to count all signature's created by a certain uid but being able to filter by timestamp
Thanks to Alexis, I've gotten to this far with a reduce _count function:
function (doc) {
if (doc.type === "signature") {
emit([doc.uid, doc.timestamp], 1);
}
}
With the following queries:
start_key=[null,lowerTimestamp]
end_key=[{},higherTimestamp]
reduce=true
group_level=1
Response:
{
"rows": [
{
"key": [ "u11111" ],
"value": 3
},
{
"key": [ "u12345" ],
"value": 26
}
]
}
It counts the uid correctly but the filter doesn't work properly. At first I thought it might be a CouchDB 2.2 bug, but I tried on Cloudant and I got the same response.
Does anyone have any ideas on how I could get this to work with being ale to filter timestamps?
When using compound keys in MapReduce (i.e. the key is an array of things), you cannot query a range of keys with a "leading" array element missing. i.e. you can query a range of uuids and get the results ordered by timestamp, but your use-case is the other way round - you want to query uuids by time.
I'd be tempted to put time first in the array, but unix timestamps are not so good for grouping ;). I don't known the ins and outs of your application but if you were to index a date instead of a timestamp like so:
function (doc) {
if (doc.type === "signature") {
var date = new Date(doc.timestamp)
var datestr = date.toISOString().split('T')[0]
emit([datestr, doc.uuid], 1);
}
}
This would allow you to query a range of dates (to the resolution of a whole day):
?startkey=["2018-01-01"]&endkey=["2018-02-01"]&group_level=2
albeit with your uuids grouped by day.

A find() statement with possible null parameters

I'm trying to figure out how Mongoose and MongoDB works... I'm really new to them, and I can't seem to figure how to return values based on a find statement, where some of the given parameters in the query possible are null - is there an attribute I can set for this or something?
To explain it further, I have a web page that has different input fields that are used to search for a company, however they're not all mandatory.
var Company = mongoose.model('Company');
Company.find({companyName: req.query.companyName, position: req.query.position,
areaOfExpertise: req.query.areaOfExpertise, zip: req.query.zip,
country: req.query.country}, function(err, docs) {
res.json(docs);
});
By filling out all the input fields on the webpage I get a result back, but only that specific one which matches. Let's say I only fill out country, it returns nothing because the rest are empty, but I wish to return all rows which are e.g. in Germany. I hope I expressed myself clearly enough.
You need to wrap the queries with the $or logic operator, for example
var Company = mongoose.model('Company');
Company.find(
{
"$or": [
{ "companyName": req.query.companyName },
{ "position": req.query.position },
{ "areaOfExpertise": req.query.areaOfExpertise },
{ "zip": req.query.zip },
{ "country": req.query.country }
]
}, function(err, docs) {
res.json(docs);
}
);
Another approach would be to construct a query that checks for empty parameters, if they are not null then include it as part of the query. For example, you can just use the req.query object as your query assuming the keys are the same as your document's field, as in the following:
/*
the req.query object will only have two parameters/keys e.g.
req.query = {
position: "Developer",
country: "France"
}
*/
var Company = mongoose.model('Company');
Company.find(req.query, function(err, docs) {
if (err) throw err;
res.json(docs);
});
In the above, the req.query object acts as the query and has an implicit logical AND operation since MongoDB provides an implicit AND operation when specifying a comma separated list of expressions. Using an explicit AND with the $and operator is necessary when the same field or operator has to be specified in multiple expressions.
If you are after a query that satisfies both logical AND and OR i.e. return all documents that match the conditions of both clauses for example given a query for position AND country OR any other fields then you may tweak the query to:
var Company = mongoose.model('Company');
Company.find(
{
"$or": [
{ "companyName": req.query.companyName },
{
"position": req.query.position,
"country": req.query.country
},
{ "areaOfExpertise": req.query.areaOfExpertise },
{ "zip": req.query.zip }
]
}, function(err, docs) {
res.json(docs);
}
);
but then again this could be subject to what query parameters need to be joined as mandatory etc.
I simply ended up deleting the parameters in the query in case they were empty. It seems all the text fields in the submit are submitted as "" (empty). Since there are no such values in the database, it would return nothing. So simple it never crossed my mind...
Example:
if (req.query.companyName == "") {
delete req.query.companyName;
}

Using MapReduce for results of geospatial indexes in Cloudant

I am using a geospatial index in Cloudant for retrieving all documents inside a polygon. Now I want to calculate some basic static values for those documents (e.g. average age and sum of earnings in a region).
Is it possible to query the geo index and then pass the result on to the MapReduce function?
How can I achieve this, preferable inside the database? Can I avoid querying for the document ids inside the polygon first and then sending the retrieved ids for performing the MapReduce (I am working with large data sets)?
What is working so far is querying the index as well as using the view (separately).
My geo index
function (doc) {
if (doc.geometry && doc.geometry.coordinates) {
st_index(doc.geometry);
}
}
My view
function (doc) {
var beitrag = doc.properties.beitrag;
var schadenaufwand = doc.schadenaufwand;
if(beitrag !== null && typeof beitrag === 'number' ) {
emit(doc._id, doc.properties.beitrag);
}
}
A sample geoJson document (original data looks similar)
{
"_id": "01bff77f642fc4249e787d2ded011504",
"_rev": "1-25a9a1a15939d5b21af3fbcc5c2d6ed1",
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
7.2316,
40.99
]
},
"properties": {
"age": 34,
"earnings": 982.7
}
}
This question is similar, but did not really help me: Cloudant - apply a view/mapReduce to a geospatial query
This demo could be something in the right direction: https://examples.cloudant.com/simplegeo_places/_design/geo/index.html
It seems like it would be a useful feature, but the answer to this is 'no'. The Geo indexer can't perform aggregations over the data.
I think you'll have to do as you were thinking -- use the returned list of doc ids to distribute the calculation in another map-reduce system.

Mongoose query returning repeated results

The query receives a pair of coordinates, a maximum Distance radius, a "skip" integer and a "limit" integer. The function should return the closest and newest locations according to the position given. There is no visible error in my code, however, when I call the query again, it returns repeated results. "skip" variable is updated according to the results returned.
Example:
1) I make query with skip = 0, limit = 10. I receive 10 non-repeated locations.
2) Query is called again now, skip = 10, limit = 10. I receive another 10 locations with repeated results from the first query.
QUERY
Locations.find({ coordinates :
{ $near : [ x , y ],
$maxDistance: maxDistance }
})
.sort('date_created')
.skip(skip)
.limit(limit)
.exec(function(err, locations) {
console.log("[+]Found Locations");
callback(locations);
});
SCHEMA
var locationSchema = new Schema({
date_created: { type: Date },
coordinates: [],
text: { type: String }
});
I have tried looking everywhere for a solution. My only option would be versions of Mongo? I use mongoose 4.x.x and mongodb is like 2.5.6. I believe. Any ideas?
There are a couple of things to consider here in the sort of results that you want, with the first consideration being that you have a "secondary" sort criteria in the "date_created" to deal with.
The basic problem there is that the $near operator and like operators in MongoDB do not at present "project" any field to indicate the "distance" from the queried location, and simply just "default sort" the data. So in order to do that "secondary" sort, a field with the "distance" needs to be present. There are therefore other options for this.
The second case is that "skip" and "limit" style paging is horrible form performance on large sets of data and should be avoided where you can. So it's better to select data based on a "range" where it occurs rather than "skip" through all the results you have previously displayed.
The first thing to do here is use a command that can "project" the distance into the document along with the other information. The aggregation command of $geoNear is good for this, and especially since we want to do other sorting:
var seenIds = [],
lastDistance = null,
lastDate = null;
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance
"distanceField": "dist",
"limit": 10
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
That is the first iteration of your results where you fetch the first 10. Noting the logic inside the loop, where each document in the results is inspected for either a change in the "date_created" or the projected "dist" field now present in the document and where this occurs the "seenIds" array is wiped of all current entries. The general action is that all the variables are tested and possibly updated on each iteration and where there is no change then items are added to the list of "seenIds".
All those three variables being worked on need to be stored somewhere awaiting the next request. For web applications the session store is ideal, but different approaches vary. You just want those values to be recalled when we start the next request, as on the next and subsequent iterations we alter the query a bit:
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance,
"minDistance": lastDistance,
"distanceField": "dist",
"limit": 10,
"query": {
"_id": { "$nin": seenIds },
"date_created": { "$lt": lastDate }
}
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
So there the "minDistance" parameter is entered as you want to exclude any of the "nearer" results that have already been seen, and the additional checks are placed in the query with the "date_created" needing to be "less than" the "lastDistance" recorded as well since we are in descending order of sort, with the final "sure" filter in excluding any "_id" values that were recorded within the list because the values had not changed.
Now with geospatial data that "seenIds" list is not likely to grow as generally you are not going to find things all at the same distance, but it is a general process of paging a sorted list of data like this, so it is worth understanding the concept.
So if you want to be able to use a secondary field to sort on with geospatial data and also considering the "near" distance then this is the general approach, by projecting a distance value into the document results as well as storing the last seen values before any changes that would not make them unique.
The general concept is "advancing the minimum distance" to enable each page of results to get gradually "further away" from the source point of origin used in the query.

$addToSet and return all new items added?

Is it possible to $addToSet and determine which items were added to the set?
i.e. $addToSet tags to a post and return which ones were actually added
Not really, and not with a single statement. The closest you can get is with the findAndModify() method, and compare the orginal document form to the fields that you submitted in your $addToSet statement:
So considering an initial document:
{
"fields": [ "B", "C" ]
}
And then processing this code:
var setInfo = [ "A", "B" ];
var matched = [];
var doc = db.collection.findAndModify(
{ "_id": "myid" },
{
"$addToSet": { "fields": { "$each": setInfo } }
}
);
doc.fields.forEach(function(field) {
if ( setInfo.indexOf(field) != -1 ) {
matched.push(field);
}
});
return matched;
So that is a basic JavaScript abstraction of the methods and not actually nodejs general syntax for either the native node driver or the Mongoose syntax, but it does describe the basic premise.
So as long as you are using a "default" implementation method that returns the "original" state of the document before it was modified the you can play "spot the difference" as it were, and as is shown in the code example.
But doing this over general "update" operations is just not possible, as they are designed to possibly affect one or more objects and never return this detail.

Resources