CouchDB view map function that doesn't segregate reduce keys - couchdb

Here is the doc "schema":
{
type: "offer",
product: "xxx",
price: "14",
valid_from: [2012, 7, 1, 0, 0, 0]
}
There are a lot of such documents with many valid dates in the past and the future and a lot of times two or three offers in the same month. I can't find a way to make the following view: Given a date, give me a list of the products and their running offer for that date.
I think I need to emit the valid_date field in order to set the endkey of the query to the given date and then I need to reduce on the max of this field, which means i can't emit it.
Have I got it wrong? I am totally new to the map/reduce concept. Any suggestions on how to do it?

I'm really thrown by your comments about wanting to reduce, based on your requirements you want just a map function - no reduce. Here's the map function based on what you asked for:
function(d) {
if( d.type === 'offer' ) {
var dd = d.valid_from;
dd[1] = ( '0' + ( dd[1] + 1 )).slice(-2); // remove the +1 if 7 is July not August
dd[2] = ( '0' + dd[2] ).slice(-2);
emit( dd.slice(0,3).join('-') );
}
}
Then to show all offers valid for a given day you'd query this view with params like:
endkey="2012-08-01"&include_docs=true

Related

Increment a field conditioned by a WHERE

I can't seem to figure out how to do this in Sequelize. I have an instance from findOne, and I want to increment one of its fields using an expression, and only under certain conditions. Something such as:
UPDATE Account SET balance = balance - 10 WHERE balance >= 10;
I want the db to calculate the expression, as this isn't happening in a transaction. So I can't do a SET balance = 32. (I could do SET balance = 32 WHERE balance = 42, but that's not as effective.) I don't want to put a CHECK in there, as there are other places where I do want to allow a negative balance.
(Our Sequelize colleague has left, and I can't figure out how to do this).
I see the instance.increment and decrement, but it doesn't look like they take a where object.
I don't see how to express the setting of balance = balance - 10, nor of expressing the expression in the where object.
You are probably looking for Model.decrement instead of instance.decrement. instance.decrement is for updating the specific record so where doesn't make sense.
Model.decrement: https://sequelize.org/master/class/lib/model.js~Model.html#static-method-decrement
The example in the link shows similar scenario as yours.
============================================
Update:
This translates to your example.
const Op = require('sequelize').Op;
Account.decrement('balance', {
by: 10,
where: {
balance: {
[Op.gte]: 10
}
}
});
Based on #Emma comments, here's what I have working. amountToAdd is just that; cash_balance is the field I'm incrementing. The check on cash_balance ensures that if I'm decrementing (that is amount_to_add is < 0), the balance doesn't go below 0. I need to muck around with that some.
const options = {
where: {
user_id: userId,
cash_balance: {
[ Op.gte ]: amountToAdd,
},
},
by: amountToAdd
};
const incrResults = await models.User.increment( 'cash_balance', options );

CouchDB Count Reduce with timestamp filtering

Let's say I have documents like so:
{
_id: "a98798978s978dd98d",
type: "signature",
uid: "u12345",
category: "cat_1",
timestamp: UNIX_TIMESTAMP
}
My goal is to be able to count all signature's created by a certain uid but being able to filter by timestamp
Thanks to Alexis, I've gotten to this far with a reduce _count function:
function (doc) {
if (doc.type === "signature") {
emit([doc.uid, doc.timestamp], 1);
}
}
With the following queries:
start_key=[null,lowerTimestamp]
end_key=[{},higherTimestamp]
reduce=true
group_level=1
Response:
{
"rows": [
{
"key": [ "u11111" ],
"value": 3
},
{
"key": [ "u12345" ],
"value": 26
}
]
}
It counts the uid correctly but the filter doesn't work properly. At first I thought it might be a CouchDB 2.2 bug, but I tried on Cloudant and I got the same response.
Does anyone have any ideas on how I could get this to work with being ale to filter timestamps?
When using compound keys in MapReduce (i.e. the key is an array of things), you cannot query a range of keys with a "leading" array element missing. i.e. you can query a range of uuids and get the results ordered by timestamp, but your use-case is the other way round - you want to query uuids by time.
I'd be tempted to put time first in the array, but unix timestamps are not so good for grouping ;). I don't known the ins and outs of your application but if you were to index a date instead of a timestamp like so:
function (doc) {
if (doc.type === "signature") {
var date = new Date(doc.timestamp)
var datestr = date.toISOString().split('T')[0]
emit([datestr, doc.uuid], 1);
}
}
This would allow you to query a range of dates (to the resolution of a whole day):
?startkey=["2018-01-01"]&endkey=["2018-02-01"]&group_level=2
albeit with your uuids grouped by day.

ArangoDB Faceted Search Performance

We are evaluating ArangoDB performance in space of facets calculations.
There are number of other products capable of doing the same, either via special API or query language:
MarkLogic Facets
ElasticSearch Aggregations
Solr Faceting etc
We understand, there is no special API in Arango to calculate factes explicitly.
But in reality, it is not needed, thanks for a comprehensive AQL it can be easily achieved via simple query, like:
FOR a in Asset
COLLECT attr = a.attribute1 INTO g
RETURN { value: attr, count: length(g) }
This query calculate a facet on attribute1 and yields frequency in the form of:
[
{
"value": "test-attr1-1",
"count": 2000000
},
{
"value": "test-attr1-2",
"count": 2000000
},
{
"value": "test-attr1-3",
"count": 3000000
}
]
It is saying, that across my entire collection attribute1 took three forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts provided.
Pretty much we run a DISTINCT query and aggregated counts.
Looks simple and clean. With only one, but really big issue - performance.
Provided query above runs for !31 seconds! on top of the test collection with only 8M documents.
We have experimented with different index types, storage engines (with rocksdb and without), investigating explanation plans at no avail.
Test documents we use in this test are very concise with only three short attributes.
We would appreciate any input at this point.
Either we doing something wrong. Or ArangoDB simply is not designed to perform in this particular area.
btw, ultimate goal would be to run something like the following in under-second time:
LET docs = (FOR a IN Asset
FILTER a.name like 'test-asset-%'
SORT a.name
RETURN a)
LET attribute1 = (
FOR a in docs
COLLECT attr = a.attribute1 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute2 = (
FOR a in docs
COLLECT attr = a.attribute2 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute3 = (
FOR a in docs
COLLECT attr = a.attribute3 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute4 = (
FOR a in docs
COLLECT attr = a.attribute4 INTO g
RETURN { value: attr, count: length(g[*])}
)
RETURN {
counts: (RETURN {
total: LENGTH(docs),
offset: 2,
to: 4,
facets: {
attribute1: {
from: 0,
to: 5,
total: LENGTH(attribute1)
},
attribute2: {
from: 5,
to: 10,
total: LENGTH(attribute2)
},
attribute3: {
from: 0,
to: 1000,
total: LENGTH(attribute3)
},
attribute4: {
from: 0,
to: 1000,
total: LENGTH(attribute4)
}
}
}),
items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),
facets: {
attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),
attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),
attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),
attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a)
}
}
Thanks!
Turns out main thread has happened on ArangoDB Google Group.
Here is a link to a full discussion
Here is a summary of current solution:
Run custom build of the Arango from a specific feature branch where number of performance improvements has been done (hope they should make it to a main release soon)
No indexes are required for a facets calculations
MMFiles is a preferred storage engine
AQL should be written to use "COLLECT attr = a.attributeX WITH COUNT INTO length" instead of "count: length(g)"
AQL should be split into smaller pieces and run in parallel (we are running Java8's Fork/Join to spread facets AQLs and then join them into a final result)
One AQL to filter/sort and retrieve main entity (if required. while sorting/filtering add corresponding skiplist index)
The rest are small AQLs for each facet value/frequency pairs
In the end we have gained >10x performance gain compare to an original AQL provided above.

Mongoose query returning repeated results

The query receives a pair of coordinates, a maximum Distance radius, a "skip" integer and a "limit" integer. The function should return the closest and newest locations according to the position given. There is no visible error in my code, however, when I call the query again, it returns repeated results. "skip" variable is updated according to the results returned.
Example:
1) I make query with skip = 0, limit = 10. I receive 10 non-repeated locations.
2) Query is called again now, skip = 10, limit = 10. I receive another 10 locations with repeated results from the first query.
QUERY
Locations.find({ coordinates :
{ $near : [ x , y ],
$maxDistance: maxDistance }
})
.sort('date_created')
.skip(skip)
.limit(limit)
.exec(function(err, locations) {
console.log("[+]Found Locations");
callback(locations);
});
SCHEMA
var locationSchema = new Schema({
date_created: { type: Date },
coordinates: [],
text: { type: String }
});
I have tried looking everywhere for a solution. My only option would be versions of Mongo? I use mongoose 4.x.x and mongodb is like 2.5.6. I believe. Any ideas?
There are a couple of things to consider here in the sort of results that you want, with the first consideration being that you have a "secondary" sort criteria in the "date_created" to deal with.
The basic problem there is that the $near operator and like operators in MongoDB do not at present "project" any field to indicate the "distance" from the queried location, and simply just "default sort" the data. So in order to do that "secondary" sort, a field with the "distance" needs to be present. There are therefore other options for this.
The second case is that "skip" and "limit" style paging is horrible form performance on large sets of data and should be avoided where you can. So it's better to select data based on a "range" where it occurs rather than "skip" through all the results you have previously displayed.
The first thing to do here is use a command that can "project" the distance into the document along with the other information. The aggregation command of $geoNear is good for this, and especially since we want to do other sorting:
var seenIds = [],
lastDistance = null,
lastDate = null;
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance
"distanceField": "dist",
"limit": 10
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
That is the first iteration of your results where you fetch the first 10. Noting the logic inside the loop, where each document in the results is inspected for either a change in the "date_created" or the projected "dist" field now present in the document and where this occurs the "seenIds" array is wiped of all current entries. The general action is that all the variables are tested and possibly updated on each iteration and where there is no change then items are added to the list of "seenIds".
All those three variables being worked on need to be stored somewhere awaiting the next request. For web applications the session store is ideal, but different approaches vary. You just want those values to be recalled when we start the next request, as on the next and subsequent iterations we alter the query a bit:
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance,
"minDistance": lastDistance,
"distanceField": "dist",
"limit": 10,
"query": {
"_id": { "$nin": seenIds },
"date_created": { "$lt": lastDate }
}
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
So there the "minDistance" parameter is entered as you want to exclude any of the "nearer" results that have already been seen, and the additional checks are placed in the query with the "date_created" needing to be "less than" the "lastDistance" recorded as well since we are in descending order of sort, with the final "sure" filter in excluding any "_id" values that were recorded within the list because the values had not changed.
Now with geospatial data that "seenIds" list is not likely to grow as generally you are not going to find things all at the same distance, but it is a general process of paging a sorted list of data like this, so it is worth understanding the concept.
So if you want to be able to use a secondary field to sort on with geospatial data and also considering the "near" distance then this is the general approach, by projecting a distance value into the document results as well as storing the last seen values before any changes that would not make them unique.
The general concept is "advancing the minimum distance" to enable each page of results to get gradually "further away" from the source point of origin used in the query.

Creating a couchdb view that returns an array of unique values from a document set

I have a couchdb database filled with time-stamped documents so the format of a given document is something like this:
{ id: "uniqueid",
year: 2011,
month: 3,
day: 31,
foo: "whatever"
bar: "something else"
}
I would like to construct a set of views such that a given key will return an array of year, month or day values for which documents exist. For example given the view name Days, I would like the following view url
/db/_design/designdoc/_view/Days?key=[2011,3]
to return an array of all the days in March of 2011 for which documents exist. For example, if March 2011 had some number of documents falling on six days, it might look like:
[1, 2, 5, 15, 27, 31]
Similarly,
/db/_design/designdoc/_view/Months?key=2011
If 2011 had some number of documents falling on April, May, and September, it might look like:
[4, 5, 9]
And
/db/_design/designdoc/_view/Years
will return an array of years in the whole database. If the documents have this year and last, it might look like:
[2010, 2011]
I gather it is difficult to write a reduce function that returns an array because you end up running into reduce overflow errors as the document count increases. I know this because I wrote a reduce function that worked but then started throwing reduce overflow errors after I loaded it up with documents.
One solution I have examined is just creating a view without a reduce that creates an array key [year, month, day] and then using startkey and endkey parameters on the view to return documents. The problem with this approach is how it scales. Say my database has thousands of documents spread out over two years. Using this view, I need to iterate over the entire set of documents just to discover this.
I believe this question is trying to ask the same thing though I am not quite sure so I figured I'd add a new question. Also, the answers given on that question do not avoid reduce overflow errors for larger document sets, as far as I could tell with my limited view writing skills.
I think for this, ou need to construct your views not only with maps, but also with reduces.
Disregarding eventual scaling problems there are 2 solutions. I will take into account only Days since the answer for Months and Years is similar.
Solution 1:
view Days:
map:
function(doc) {
if (doc.year && doc. month && doc.day) {
emit([ year, month, day ], 1);
}
}
reduce:
function(keys, values) {
return sum(values);
}
list listDays:
function(head, req) {
start({
"headers": {
"Content-Type": "text/plain"
}
});
var row;
var days = new Array();
while(row = getRow()) {
days.push(row.key[2]);
}
var daysString = json.join(',');
send('[' + daysString + ']');
}
http call:
http://couch/db/_design/db/_list/listDays/Days?group=true&group_level=2&startkey=["2011","3"]&endkey=["2011","3Z"]
Solution 2:
view Days:
map:
function(doc) {
if (doc.year && doc. month && doc.day) {
emit([ year, month, day ], null);
}
}
list listDays:
function(head, req) {
start({
"headers": {
"Content-Type": "text/plain"
}
});
var row;
var days = new Array();
while(row = getRow()) {
if (days.indexOf(row.key[2] == -1) { days.push(row.key[2]); }
}
var daysString = json.join(',');
send('[' + daysString + ']');
}
http call:
http://couch/db/_design/db/_list/listDays/Days?startkey=["2011","3"]&endkey=["2011","3Z"]

Resources