From http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
The couchdb reduce function is defined as
function (key, values, rereduce) {
return sum(values);
}
key will be an array whose elements
are arrays of the form [key,id]
values will be an array of the values
emitted for the respective elements
in keys
i.e. reduce([ [key1,id1], [key2,id2], [key3,id3] ], [value1,value2,value3], false)
I am having trouble understanding when/why the array of keys would contain different key values. If the array of keys does contain different key values, how would I deal with it?
As an example, assume that my database contains movements between accounts of the form.
{"amount":100, "CreditAccount":"account_number", "DebitAccount":"account_number"}
I want a view that gives the balance of an account.
My map function does:
emit( doc.CreditAccount, doc.amount )
emit( doc.DebitAccount, -doc.amount )
My reduce function does:
return sum(values);
I seem to get the expected results, however I can't reconcile this with the possibility that my reduce function gets different key values.
Is my reduce function supposed to group key values first? What kind of result would I return in that case?
By default, Futon "groups" your results, which means you get a fresh reduce per key—in your case, an account. The group feature is for exactly this situation.
Over the raw HTTP API, you will get one total reduce for all accounts which is probably not useful. So remember to use group=true in your own application to be sure you get summaries per account.
Related
I want to map a timestamp t and an identifier id to a certain state of an object. I can do so by mapping a tuple (t,id) -> state_of_id_in_t. I can use this mapping to access one specific (t,id) combination.
However, sometimes I want to know all states (with matching timestamps t) of a specific id (i.e. id -> a set of (t, state_of_id_in_t)) and sometimes all states (with matching identifiers id) of a specific timestamp t (i.e. t -> a set of (id, state_of_id_in_t)). The problem is that I can't just put all of these in a single large matrix and do linear search based on what I want. The amount of (t,id) tuples for which I have states is very large (1m +) and very sparse (some timestamps have many states, others none etc.). How can I make such a dict, which can deal with accessing its contents by partial keys?
I created two distinct dicts dict_by_time an dict_by_id, which are dicts of dicts. dict_by_time maps a timestamp t to a dict of ids, which each point to a state. Similiarly, dict_by_id maps an id to a dict of timestamps, which each point to a state. This way I can access a state or a set of states however I like. Notice that the 'leafs' of both dicts (dict_by_time an dict_by_id) point to the same objects, so its just the way I access the states that's different, the states themselves however are the same python objects.
dict_by_time = {'t_1': {'id_1': 'some_state_object_1',
'id_2': 'some_state_object_2'},
't_2': {'id_1': 'some_state_object_3',
'id_2': 'some_state_object_4'}
dict_by_id = {'id_1': {'t_1': 'some_state_object_1',
't_2': 'some_state_object_3'},
'id_2': {'t_1': 'some_state_object_2',
't_2': 'some_state_object_4'}
Again, notice the leafs are shared across both dicts.
I don't think it is good to do it using two dicts, simply because maintaining both of them when adding new timestamps or identifiers result in double work and could easily lead to inconsistencies when I do something wrong. Is there a better way to solve this? Complexity is very important, which is why I can't just do manual searching and need to use some sort of HashMap magic.
You can always trade add complexity with lookup complexity. Instead of using a single dict, you can create a Class with an add method and a lookup method. Internally, you can keep track of the data using 3 different dictionaries. One uses the (t,id) tuple as key, one uses t as the key and one uses id as the key. Depending on the arguments given to lookup, you can return the result from one of the dictionaries.
Sorry if my terminology is wrong, but I have a list of feed hashes.
So ie feed:1, feed:2, feed:3 inside those hashes I have some keys and values. ie inside feed:1 I have likes:300.
I have a list called feeds:fid which lists all the feed ids. So if I want to grab all the feeds I can just do a method like this in my node.js
module.getObjects = function(keys, callback) {
helpers.multiKeys(redisClient, 'hgetall', keys, callback);
};
I am not sure how I can sort them so I get all feed items sorted by most liked? Ideally I just want to get the "hottest feed" items.
I am curious how I can go about this in redis?
This would be difficult to do in your current set of things.
You can however use a single sorted set to store likes along with feed ids.
So,whenever a like happens, you store the like in your hash, but also do an
ZINCRBY operation on the same feed key in the sorted set.
-- At any point of time the sorted set will contain the feed ids as keys, and number of likes on the key as the score.
-- To get top or hottest feeds, you just do a ZREVRANGE operation, which will give you top N items with maximum likes.
-- To keep both your operations atomic, you shall use redis transactions to have data always synched between hash and sorted set.
With a couchdb view, we get results ordered by key. I have been using this to get values associated with a highest number. For example, take this result (in key: value form):
{1:'sam'}
{2:'jim'}
{4:'joan'}
{5:'jill'}
couchDB will sort those according to the key. (It could be helpful to think of the key as the "score".) I want to find out who has the highest or lowest score.
I have written a reduce function like so:
function(keys, values) {
var len = values.length;
return values[len - 1];
}
I know there's _stat and the like, but these are not possible in my application (this is a slimmed down, hypothetical example).
Usually when I run this reduce, i will get either 'sam' or 'jill' depending on whether descending is set. This is what I want. However, in large data-sets, sometimes I get someone from the middle of the list.
I suspect this is happening on rereduce. I had assumed that when rereduce has been run, the order of results is preserved. However, I can find no assurances that this is the case. I know that on rereduce, the key is null, so by the normal sorting rules they would not be sorted. Is this the case?
If so, any advice on how to get my highest scorer?
Yeah, I don't think sorting order is guaranteed, probably because it cannot be guaranteed in clustered environments. I suspect the way you're using map/reduce here is a little iffy, but you should post your view code if you really want a good answer here.
CouchDB's map functions emit key/value pairs:
function(doc) {
emit(doc.date, 1);
}
Potentially, there could be many key/value pairs with the same key. Setting group=true while querying a view groups key/value pairs with the same key into the same reduce:
function(keys, values, rereduce) {
return sum(values);
}
Does this mean that with group=true (or for any group_level > 0), there will be exactly one reduce per key?
Or does the grouping only guarantee that all reduces will have homogeneous keys, and that there could still be one or more rereduces?
I am working with a reduce function that is not commutative, but which will not have a large number of records per key. I was hoping that I would be able to set group=true and then control the order of operation within a single reduce. If there will be rereduces, then that plan does not make sense.
group=true roughly means "Hey, Couch! Group this map in the way there all keys will be distinct, but don't miss any case of them!" and actually equals to group_level=999 (see docs).
While you may not guess with proper group_level and strip some key items (if key is an array it makes sense), group takes care of this for you and rereduce wouldn't be applied.
Also, your reduce function could be replaced with the built-in _sum - it's implemented in Erlang and is much faster.
I have some documents with a "status" field of "Green", "Red", "Amber".
I'm sure it's possible to use MapReduce to produce a grouped response containing three keys (one for each status), each with a value containing an array of all the documents with that key. However, I'm struggling on how to use re(reduce) functions.
Map function:
function(doc) {
emit(doc.status, doc);
}
Reduce function: ???
This is not a problem that reduce is intended to solve; reduce in CouchDB is for aggregation.
If I understand you correctly, you want this;
Map:
function(doc) {
for (var i in doc.status) {
emit(doc.status[i], null);
}
}
You can then find all docs of status Green with;
/_design/foo/_view/bar?key="Green"&include_docs=true
This will return a list of all docs with that status. If you wish to find docs of more than one status in a single query, then use http POST with a body of this form;
{"keys":["Green", "Red"]}
HTH,
B.
Generally speaking, you will not use a reduce function to obtain your list of documents. A reduce is meant to take a list, and reduce it to a single value. In fact, there is an upper limit to the size of a reduce value anyways, and using entire documents will trigger a reduce_overflow error. Examples of reduces are counts, sums, averages, etc. Stick with the map query, and you will have your values collated and sorted by the status value.
On another, possibly unrelated note, I would not emit the document with your view. You can just use the include_docs view query parameter, and achieve the same effect, while saving disk-space in the process. The trade-off is that internally the doc will have to be retrieved one-by-one. (but since they're indexed already by _id anyways, it's usually a negligible difference.