Using MapReduce for results of geospatial indexes in Cloudant - node.js

I am using a geospatial index in Cloudant for retrieving all documents inside a polygon. Now I want to calculate some basic static values for those documents (e.g. average age and sum of earnings in a region).
Is it possible to query the geo index and then pass the result on to the MapReduce function?
How can I achieve this, preferable inside the database? Can I avoid querying for the document ids inside the polygon first and then sending the retrieved ids for performing the MapReduce (I am working with large data sets)?
What is working so far is querying the index as well as using the view (separately).
My geo index
function (doc) {
if (doc.geometry && doc.geometry.coordinates) {
st_index(doc.geometry);
}
}
My view
function (doc) {
var beitrag = doc.properties.beitrag;
var schadenaufwand = doc.schadenaufwand;
if(beitrag !== null && typeof beitrag === 'number' ) {
emit(doc._id, doc.properties.beitrag);
}
}
A sample geoJson document (original data looks similar)
{
"_id": "01bff77f642fc4249e787d2ded011504",
"_rev": "1-25a9a1a15939d5b21af3fbcc5c2d6ed1",
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
7.2316,
40.99
]
},
"properties": {
"age": 34,
"earnings": 982.7
}
}
This question is similar, but did not really help me: Cloudant - apply a view/mapReduce to a geospatial query
This demo could be something in the right direction: https://examples.cloudant.com/simplegeo_places/_design/geo/index.html

It seems like it would be a useful feature, but the answer to this is 'no'. The Geo indexer can't perform aggregations over the data.
I think you'll have to do as you were thinking -- use the returned list of doc ids to distribute the calculation in another map-reduce system.

Related

CouchDB Count Reduce with timestamp filtering

Let's say I have documents like so:
{
_id: "a98798978s978dd98d",
type: "signature",
uid: "u12345",
category: "cat_1",
timestamp: UNIX_TIMESTAMP
}
My goal is to be able to count all signature's created by a certain uid but being able to filter by timestamp
Thanks to Alexis, I've gotten to this far with a reduce _count function:
function (doc) {
if (doc.type === "signature") {
emit([doc.uid, doc.timestamp], 1);
}
}
With the following queries:
start_key=[null,lowerTimestamp]
end_key=[{},higherTimestamp]
reduce=true
group_level=1
Response:
{
"rows": [
{
"key": [ "u11111" ],
"value": 3
},
{
"key": [ "u12345" ],
"value": 26
}
]
}
It counts the uid correctly but the filter doesn't work properly. At first I thought it might be a CouchDB 2.2 bug, but I tried on Cloudant and I got the same response.
Does anyone have any ideas on how I could get this to work with being ale to filter timestamps?
When using compound keys in MapReduce (i.e. the key is an array of things), you cannot query a range of keys with a "leading" array element missing. i.e. you can query a range of uuids and get the results ordered by timestamp, but your use-case is the other way round - you want to query uuids by time.
I'd be tempted to put time first in the array, but unix timestamps are not so good for grouping ;). I don't known the ins and outs of your application but if you were to index a date instead of a timestamp like so:
function (doc) {
if (doc.type === "signature") {
var date = new Date(doc.timestamp)
var datestr = date.toISOString().split('T')[0]
emit([datestr, doc.uuid], 1);
}
}
This would allow you to query a range of dates (to the resolution of a whole day):
?startkey=["2018-01-01"]&endkey=["2018-02-01"]&group_level=2
albeit with your uuids grouped by day.

Node.js/MongoDB - querying dates

I'm having a bit of an issue understanding how to query dates; I think the issue might be with how my data is structured. Here is a sample document on my database.
{
"phone_num": 12553,
"facilities": [
"flat-screen",
"parking"
],
"surroundings": [
"ping-pong",
"pool"
],
"rooms": [
{
"room_name": "Standard Suite",
"capacity": 2,
"bed_num": 1,
"price": 50,
"floor": 1,
"reservations": [
{
"checkIn": {
"$date": "2019-01-10T23:23:50.000Z"
},
"checkOut": {
"$date": "2019-01-20T23:23:50.000Z"
}
}
]
}
]
}
I'm trying to query the dates to see check if a specific room is available at a certain date-range but no matter what I do I can't seem to get a proper result, either my query 404's or returns empty array.
I really tried everything, right now for simplicity I'm just trying to get the query to work with checkIn so I can figure out what I'm doing wrong. I tried 100 variants of the code below but I couldn't get it to work at all.
.find({"rooms": { "reservations": { "checkIn" : {"$gte": { "$date": "2019-01-09T00:00:00.000Z"}}}}})
Am I misunderstanding how the .find method works or is something wrong with how I'm storing my dates? (I keep seeing people mentioning ISODates but not too sure what that is or how to implement).
Thanks in advance.
I think the query you posted is not correct. For example, if you want to query for the rooms with the checkin times in a certain range then the query should be like this -
.find({"rooms.reservations.checkout":{$gte:new Date("2019-01-06T13:11:50+06:00"), $lt:new Date("2019-01-06T14:12:50+06:00")}})
Now you can do the same with the checkout time to get the proper filtering to find the rooms available within a date range.
A word of advice though, the way you've designed your collection is not sustainable in the long run. For example, the date query you're trying to run will give you the correct documents, but not the rooms inside each document that satisfy your date range. You'll have to do it yourself on the server side (assuming you're not using aggregation). This will block your server from handling other pending requests which is not desirable. I suggest you break the collection down and have rooms and reservations in separate collections for easier querying.
Recently I was working on date query. First of all we need to understand how we store date into the mongodb database. Say I have stored data using UTC time format like 2020-07-21T09:45:06.567Z.
and my json structure is
[
{
"dateOut": "2020-07-21T09:45:06.567Z",
"_id": "5f1416378210c50bddd093b9",
"customer": {
"isGold": true,
"_id": "5f0c1e0d1688c60b95360565",
"name": "pavel_1",
"phone": 123456789
},
"movie": {
"_id": "5f0e15412065a90fac22309a",
"title": "hello world",
"dailyRentalRate": 20
}
}
]
and I want to perform a query so that I can get all data only for this( 2020-07-21) date. So how can we perform that?. Now we need to understand the basic.
let result = await Rental.find({
dateOut: {
$lt:''+new Date('2020-07-22').toISOString(),
$gt:''+new Date('2020-07-21').toISOString()
}
})
We need to find 21 date's data so our query will be greater than 21 and less than 22 cause 2020-07-21T00:45:06.567Z , 2020-07-21T01:45:06.567Z .. ... .. this times are greater than 21 but less than 22.
var mydate1 = new Date();
var mydate1 = new Date().getTime();
ObjectId.getTimestamp()
Returns the timestamp portion of the ObjectId() as a Date.
Example
The following example calls the getTimestamp() method on an ObjectId():
ObjectId("507c7f79bcf86cd7994f6c0e").getTimestamp()
This will return the following output:
ISODate("2012-10-15T21:26:17Z")
If your using timestamps data to query.
EG : "createdAt" : "2021-07-12T16:06:34.949Z"
const start = req.params.id; //2021-07-12
const data = await Model.find({
"createdAt": {
'$gte': `${start}T00:00:00.000Z`,
'$lt': `${start}T23:59:59.999Z`
}
});
console.log(data);
it will show the data of particular date .i.,e in this case. "2021-07-12"

Cloudant Query 2.0 unexpected behavior

I create an index using the below function:
function (doc) {
if(doc.type === 'Property') {
if(doc.Beds_Max) {
try {
index("Beds_Max", parseInt(doc.Beds_Max));
}
catch(err) {
//ooopss
}
}
if(doc.YearBuilt) {
try {
index("YearBuilt", parseInt(doc.YearBuilt));
}
catch(err) {
//ooopss
}
}
}
}
using the cloudant Design Documents -> New Search Index and after the index is built I can issue queries like
"YearBuilt": [2010 TO Infinity]
But if I try to query the same index using Cloudant query I see weird behavior. If I go to Cloudant Dashboard -> Query and pass something like
{"limit": 5,
"selector": {
"_id": {
"$gt": null
},
"Beds_Max": {"$gte": 7}
},
fields: ["_id"]}
I see huge spike in data transmission, it keeps on receiving huge amounts of data even for the most unusual queries which are only supposed to return no more than 1 or 2 results and then hangs my computer so that most probably is not right. When I use Pouchdb-find npm module which has support for Cloudant 2.0 Query and issue the same selector as above I see inconsistent behavior, e.g. sometimes it returns 0 rows and sometimes it gives a ETIMEOUTERROR. If I change the index and exclude parseInt I can query using the same Pouchdb-find and even Cloudant Dashboard-> Query and get the results but in that case I lose the ability to use inequality operators which is a no go for me.
I'm open to work-arounds and even altogether different features to achieve the desired result.

Mongoose query returning repeated results

The query receives a pair of coordinates, a maximum Distance radius, a "skip" integer and a "limit" integer. The function should return the closest and newest locations according to the position given. There is no visible error in my code, however, when I call the query again, it returns repeated results. "skip" variable is updated according to the results returned.
Example:
1) I make query with skip = 0, limit = 10. I receive 10 non-repeated locations.
2) Query is called again now, skip = 10, limit = 10. I receive another 10 locations with repeated results from the first query.
QUERY
Locations.find({ coordinates :
{ $near : [ x , y ],
$maxDistance: maxDistance }
})
.sort('date_created')
.skip(skip)
.limit(limit)
.exec(function(err, locations) {
console.log("[+]Found Locations");
callback(locations);
});
SCHEMA
var locationSchema = new Schema({
date_created: { type: Date },
coordinates: [],
text: { type: String }
});
I have tried looking everywhere for a solution. My only option would be versions of Mongo? I use mongoose 4.x.x and mongodb is like 2.5.6. I believe. Any ideas?
There are a couple of things to consider here in the sort of results that you want, with the first consideration being that you have a "secondary" sort criteria in the "date_created" to deal with.
The basic problem there is that the $near operator and like operators in MongoDB do not at present "project" any field to indicate the "distance" from the queried location, and simply just "default sort" the data. So in order to do that "secondary" sort, a field with the "distance" needs to be present. There are therefore other options for this.
The second case is that "skip" and "limit" style paging is horrible form performance on large sets of data and should be avoided where you can. So it's better to select data based on a "range" where it occurs rather than "skip" through all the results you have previously displayed.
The first thing to do here is use a command that can "project" the distance into the document along with the other information. The aggregation command of $geoNear is good for this, and especially since we want to do other sorting:
var seenIds = [],
lastDistance = null,
lastDate = null;
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance
"distanceField": "dist",
"limit": 10
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
That is the first iteration of your results where you fetch the first 10. Noting the logic inside the loop, where each document in the results is inspected for either a change in the "date_created" or the projected "dist" field now present in the document and where this occurs the "seenIds" array is wiped of all current entries. The general action is that all the variables are tested and possibly updated on each iteration and where there is no change then items are added to the list of "seenIds".
All those three variables being worked on need to be stored somewhere awaiting the next request. For web applications the session store is ideal, but different approaches vary. You just want those values to be recalled when we start the next request, as on the next and subsequent iterations we alter the query a bit:
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance,
"minDistance": lastDistance,
"distanceField": "dist",
"limit": 10,
"query": {
"_id": { "$nin": seenIds },
"date_created": { "$lt": lastDate }
}
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
So there the "minDistance" parameter is entered as you want to exclude any of the "nearer" results that have already been seen, and the additional checks are placed in the query with the "date_created" needing to be "less than" the "lastDistance" recorded as well since we are in descending order of sort, with the final "sure" filter in excluding any "_id" values that were recorded within the list because the values had not changed.
Now with geospatial data that "seenIds" list is not likely to grow as generally you are not going to find things all at the same distance, but it is a general process of paging a sorted list of data like this, so it is worth understanding the concept.
So if you want to be able to use a secondary field to sort on with geospatial data and also considering the "near" distance then this is the general approach, by projecting a distance value into the document results as well as storing the last seen values before any changes that would not make them unique.
The general concept is "advancing the minimum distance" to enable each page of results to get gradually "further away" from the source point of origin used in the query.

CouchDB - Map Reduce similar to SQL Group by

Consider following sample documents stored in CouchDB
{
"_id":....,
"rev":....,
"type":"orders",
"Period":"2013-01",
"Region":"East",
"Category":"Stationary",
"Product":"Pen",
"Rate":1,
"Qty":10,
"Amount":10
}
{
"_id":....,
"rev":....,
"type":"orders",
"Period":"2013-02",
"Region":"South",
"Category":"Food",
"Product":"Biscuit",
"Rate":7,
"Qty":5,
"Amount":35
}
Consider following SQL query
SELECT Period, Region,Category, Product, Min(Rate),Max(Rate),Count(Rate), Sum(Qty),Sum(Amount)
FROM Sales
GROUP BY Period,Region,Category, Product;
Is it possible to create map/reduce views in couchdb equivalent to the above SQL query and to produce output like
[
{
"Period":"2013-01",
"Region":"East",
"Category":"Stationary",
"Product":"Pen",
"MinRate":1,
"MaxRate":2,
"OrdersCount":20,
"TotQty":1000,
"Amount":1750
},
{
...
}
]
Up front, I believe #benedolph's answer is best-practice and best-case-scenario. Each reduce should ideally return 1 scalar value to keep the code as simple as possible.
However, it is true you'd have to issue multiple queries to retrieve the full resultset described by your question. If you don't have the option to run queries in parallel, or it is really important to keep the number of queries down it is possible to do it all at once.
Your map function will remain pretty simple:
function (doc) {
emit([ doc.Period, doc.Region, doc.Category, doc.Product ], doc);
}
The reduce function is where it gets lengthy:
function (key, values, rereduce) {
// helper function to sum all the values of a specified field in an array of objects
function sumField(arr, field) {
return arr.reduce(function (prev, cur) {
return prev + cur[field];
}, 0);
}
// helper function to create an array of just a single property from an array of objects
// (this function came from underscore.js, at least it's name and concept)
function pluck(arr, field) {
return arr.map(function (item) {
return item[field];
});
}
// rereduce made this more challenging, and I could not thoroughly test this right now
// see the CouchDB wiki for more information
if (rereduce) {
// a rereduce handles transitionary values
// (so the "values" below are the results of previous reduce functions, not the map function)
return {
OrdersCount: sumField(values, "OrdersCount"),
MinRate: Math.min.apply(Math, pluck(values, "MinRate")),
MaxRate: Math.max.apply(Math, pluck(values, "MaxRate")),
TotQty: sumField(values, "TotQty"),
Amount: sumField(values, "Amount")
};
} else {
var rates = pluck(values, "Rate");
// This takes a group of documents and gives you the stats you were asking for
return {
OrdersCount: values.length,
MinRate: Math.min.apply(Math, rates),
MaxRate: Math.max.apply(Math, rates),
TotQty: sumField(values, "Qty"),
Amount: sumField(values, "Amount")
};
}
}
I was not able to test the "rereduce" branch of this code at all, you'll have to do that on your end. (but this should work) See the wiki for information about reduce vs rereduce.
The helper functions I added at the top actually made the code overall much shorter and easier to read, they're largely influenced by my experience with Underscore.js. However, you can't include CommonJS modules in reduce functions, so it has to be written manually.
Again, best-case scenario is to have each aggregated field get it's own map/reduce index, but if that isn't on option to you, the above code should get you what you've described here in the question.
I will propose a very simple solution that requires one view per variable you want to aggregate in your "select" clause. While it is certainly possible to aggregate all variables in a single view, the reduce function would be far more complex.
The design document looks like this:
{
"_id": "_design/ddoc",
"_rev": "...",
"language": "javascript",
"views": {
"rates": {
"map": "function(doc) {\n emit([doc.Period, doc.Region, doc.Category, doc.Product], doc.Rate);\n}",
"reduce": "_stats"
},
"qty": {
"map": "function(doc) {\n emit([doc.Period, doc.Region, doc.Category, doc.Product], doc.Qty);\n}",
"reduce": "_stats"
}
}
}
Now, you can query <couchdb>/<database>/_design/ddoc/_view/rates?group_level=4 to get the statistics about the "Rate" variable. The result should look like this:
{"rows":[
{"key":["2013-01","East","Stationary","Pen"],"value":{"sum":4,"count":3,"min":1,"max":2,"sumsqr":6}},
{"key":["2013-01","North","Stationary","Pen"],"value":{"sum":1,"count":1,"min":1,"max":1,"sumsqr":1}},
{"key":["2013-01","South","Stationary","Pen"],"value":{"sum":0.5,"count":1,"min":0.5,"max":0.5,"sumsqr":0.25}},
{"key":["2013-02","South","Food","Biscuit"],"value":{"sum":7,"count":1,"min":7,"max":7,"sumsqr":49}}
]}
For the "Qty" variable, the query would be <couchdb>/<database>/_design/ddoc/_view/qty?group_level=4.
With the group_level property you can control over which levels the aggregation is to be performed. For example, querying with group_level=2 will aggregate up to "Period" and "Region".

Resources