How to perform multi-dimensional queries in CouchDB/Cloudant - couchdb

I'm a beginner with CouchDB/Cloudant and I would like some expert advice on the most appropriate method of performing multidimensional queries.
Example...
My documents are like this
{
_id: 79f14b64c57461584b152123e3924516,
lat: -71.05204477,
lng: 42.36674199,
time: 1531500769,
tileX: 5,
tileY: 10,
lod: 7,
val1: 200.1,
val2: 101.5,
val3: 50
}
lat, lng, and time are the query parameters and they will be queried as ranges.
For example fetch all the documents that have
lat_startkey = -70 & lat_endkey = -72 AND
lng_startkey = 50 & lng_endkey = 40 AND
time_startkey = 1531500769 & time_endkey = 1530500000
I will also query using time as a range, and tileX, tileY, lod as exact values
For example
tileX = 5 AND
tileY = 10 AND
lod = 7 AND
time_startkey = 1531500769 & time_endkey = 1530500000
I've been reading about Views (map reduce), and I guess for the first type of query I could create a View each for time, lat, lng. My client could then perform 3 separate range queries, one against each View, and then in the client perform an intersection (inner join) of the resulting document id's. However this is obviously moving some of the processing outside of CouchDB, and I was hoping I could do this all within CouchDB itself.
I have also just found that CouchSearch (json/lucene), and n1ql exist... would these be of any help?

You should be able to use the N1QL query language for queries like this with no problems. N1QL is only available for Couchbase, not the CouchDB project that Couchbase grew out of.
For example, if I understand your first query there, you could write it like this in N1QL:
SELECT *
FROM datapoints
WHERE lat BETWEEN -72 AND -70 AND
lng BETWEEN 40 AND 50 AND
time BETWEEN 1531500769 AND 1530500000
To run such a query efficiently, you'll need an index, like this:
CREATE INDEX lat_long_time_idx ON datapoints(lat, lng, time)
You can find out more about N1QL here:
https://query-tutorial.couchbase.com/tutorial/#1

Sadly CouchDB is extremely poor at handling these sorts of multi-dimensional queries. You can have views on any of the axes but there is no easy way to retrieve the intersection, as you describe.
However an extension was written in the early days of that project to handle GeoSpatial queries (lat, long) called GeoCouch and that extension has been included in the Cloudant platform that you seem to be using. That means that you can do direct queries on the lat/long combination, just not the time axis using the GeoJSON format: https://console.bluemix.net/docs/services/Cloudant/api/cloudant-geo.html#cloudant-nosql-db-geospatial
However Cloudant also has another query system - Query: https://console.bluemix.net/docs/services/Cloudant/api/cloudant_query.html#query
Under this system you can build an arbitary index over your documents and then query for documents having certain criteria. For example this query selector will find documents with years in the range 1900-1903:
{
"selector": {
"year": {
"$gte": 1900
},
"year": {
"$lte": 1903
}
},
So it looks to me as if you could index the three values you care about (Lat, Long and Time) and build a 3 axis query in Cloudant. I have not tried that myself however.

Related

MongoDB sort by custom calculation in Node.JS mongodb driver

I'm Using Node.JS MongoDB driver. I have a collection of job lists with salary and number of vacancies, I want to sort them according to one rule, if either salary or number of vacancies are greater they will get top priority in sorting, and I came up with this simple formula
( salary / 100 ) + num_of_vacancies
eg:
Top priority ones
{ salary: 5000 , num_of_vacancies: 500 } // value is 550
{ salary: 50000 , num_of_vacancies: 2 } // value is 502
And Less priority for
{ salary: 5000 , num_of_vacancies: 2 } // value is 52
But my Problem is, As far as I know, MongoDB sort takes arguments only to sort in ascending or descending order and a property to sort. How do I sort with custom expression.
The data in MongoDB looks like this // not the full varsion
{
title:"job title",
description:"job description",
salary:5000,
num_of_vacancy:50
}
This is just an option. Adjust it for a mongo driver.
$addFields we create the field to sort, named toSortLater just for semantic purposes.
add a $sort stage, and sort high values first. Change to 1 for the opposite behaviour.
db.collection.aggregate([{
$addFields:{
toSortLater:{
$add:[
{$divide:["$salary", 100]},
"$num_of_vacancies"]
}}}, {$sort:{"toSortLater":-1}}
])

Mongoose Query on Index Slow

I've got a query that gets quite often that I've identified as slow. It has index on every data point I query, though no compound indexes.
The query looks something like:
ExternalLead.find({
'price': {$gte:3, $lt:6},
"campaign.id":"an id",
createdOn: {$gte: new Date(moment().subtract(10, 'days')),
$lte: new Date(moment().subtract(5, 'min'))
}}).limit(10).sort({_id:-1}).select('_id').exec(function(err, docs){
if (err) console.log(err);
var st = new Date();
console.log(st - s);
});
Simple query, there are about 50k records for that query removing the price. Price is indexed, I'm 100% sure. I've verified it multiple ways. If I remove price this query finishes in about 200ms, with price it takes about 20 seconds. I've tested multiple price ranges, the first 10 it scanned should be a match. Is there something about this query that it's not using the indexes?
Also, the server is about 3x what this database needs right now, it's not a server issue. The entire database is loaded into ram.
Node 6.11.2,
Mongoose: 4.10.8,
mongodb-core: 2.1.1
MongoDb: 3.4
Turns out we needed a compound index for price and createdOn.

How should I attack a large GroupBy recordset in a JavaScript heavy stack?

I'm currently using Node.js and Firebase on a project, and I love both. My challenge is that I need to store millions of sales order rows that would look something like this:
{ companyKey: 'xxx',
orderKey : 'xxx',
rowKey : 'xxx',
itemKey : 'xxx',
orderQty: '5',
orderDate: '12/02/2015'
}
I'd like to query these records like the pseudocode below:
Select sum(orderQty) from mydb where companyKey = 'xxx' and itemKey = 'xxx' groupby orderDate
According to various reasons such as Firebase count group by, groupby in general can be a tough nut to crack. I've done it before using Oracle Materialized Views but would like to use some kind of service that just does all of that backend work for me so I can CRUD those sales orders without worrying about the aggregation maintenance. I read in another stackoverflow post that Keen.io might be a good approach to this problem.
How would the internet experts attack this problem if they were using a JavaScript heavy stack and they wanted an outside service to do aggregation by day for them?
A couple of points I'm considering. I'll update as they come up:
1) It seems I might have to take Keen.io off the list. It's $125 for 1M rows. I don't need all the power Keen.io provides, only aggregation by day.
2) Going the Sequelize + PostGreSQL seems to be a decent compromise. I can still use JavaScript, an ORM to alleviate the pain, and PostGreSQL hosting is usually cheap.
It sounds like you want to show a trend in sales of an item over time. That's a very good fit for an event data platform because showing trends over time is really native to the query language. In Keen IO, the idea of "grouping by time" is instead expressed as the concept of "timeframe" (e.g. previous_7_days) and "interval" (e.g. daily).
Here's how you would run that with a simple sum query in Keen:
var sum = new Keen.Query("sum", {
event_collection: "sales",
target_property: "orderQty",
timeframe: "previous_12_weeks",
interval: "weekly",
filters: [
{
property_name: "companyKey",
operator: "eq",
property_value: "xxx"
},
{
property_name: "itemKey",
operator: "eq",
property_value: "yyy"
}
]
});
In fact you could calculate the sum for ALL of your companies and products in a single query by using group_by.
var sum = new Keen.Query("sum", {
event_collection: "sales",
target_property: "orderQty",
timeframe: "previous_12_weeks",
interval: "weekly",
group_by: ["companyKey", "itemKey"]
});
Keen recently updated their pricing. Depending on the frequency of querying, something like this would be pretty light, in the $10s of dollars per month if you have millions of new transactions monthly.

Parse GeoPoint query slow and timed out using javascript sdk in node.js

I have the following parse query which times out when the number of records is large.
var query = new Parse.Query("UserLocation");
query.withinMiles("geo", geo, MAX_LOCATION_RADIUS);
query.ascending("createdAt");
if (createdAt !== undefined) {
query.greaterThan("createdAt", createdAt);
}
query.limit(1000);
it runs ok if UserLocation table is small. But the query times out from time to time when the table has ~100k records:
[2015-07-15 21:03:30.879] [ERROR] [default] - Error while querying for locations: [latitude=39.959064, longitude=-75.15846]: {"code":124,"message":"operation was slow and timed out"}
UserLocation table has a latitude,longitude pair and a radius. Given a geo point (latitude,longitude), I'm trying to find the list of UserLocations whose circle (lat,long)+radius covers the given geo point. It doesn't seem like I can use the value from another column in the table for the distance query (something like query.withinMiles("geo", inputGeo, "radius"), where "geo" and "radius" are the column names for GeoPoint and radius). It also has the limit that query "limit" combined with "skip" can only return maximum of 10,000 records (1000 records at a time and skip 10 times). So I had to do a almost full table scan by using "createdAt" as a filter criteria and keep querying until the query doesn't return results any more.
Anyway I can improve the algorithm so that it doesn't time out on large data set?

MongoDB - too much data for sort() with no index. Full collection

I'm using Mongoose for Node.js to interface with the mongo driver, so my query looks like:
db.Deal
.find({})
.select({
_id: 1,
name: 1,
opp: 1,
dateUploaded: 1,
status: 1
})
.sort({ dateUploaded: -1 })
And get: too much data for sort() with no index. add an index or specify a smaller limit
The number of documents in the Deal collection is quite small, maybe ~500 - but each one contains many embedded documents. The fields returned in the query above are all primitive, i.e. aren't documents.
I currently don't have any indexes setup other than the default ones - I've never had any issue until now. Should I try adding a compound key on:
{ _id: 1, name: 1, opp: 1, status: 1, dateUploaded: -1 }
Or is there a smarter way to perform the query? First time using mongodb.
From the MongoDB documentation on limits and thresholds:
MongoDB will only return sorted results on fields without an index if the combined size of all documents in the sort operation, plus a small overhead, is less than 32 megabytes.
Probably all the embedded documents are too much, you should add an index on the sorted field dateUploaded if you want to run the same query.
Otherwise you can limit you query and start paginating the results.

Resources