Referencing external doc in CouchDB view - couchdb

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}

Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

Related

Couchdb filter using reduce functions/linked documents

Considering:
doc profile
{
_id:"1",
name:"john",
likes: ["2222","1111"]
}
doc likes
{
_id:"2222",
value:"true"
}
{
_id:"1111",
value:"false"
}
I have a filter on my xamarin app to get the profile, and it works well but I need to include the "children" (linked) docs... I can do this with a view setting include_docs=true but I want couchdb to filter so I can use replication.
Also, it would be possible to accomplish the same result if I could use a reduce function to filter data, but I can't make the filter use the reduce function.. So, any idea?
the expected result would be:
doc profile
{
_id:"1",
name:"john",
likes: {
{_id:"2222",
value:"true"},
{_id:"1111",
value:"false"]
}
}
Thanks!
I can do this with a view setting include_docs=true but I want couchdb to filter so I can use replication
You might already know this but you can use couchdb views as filters.
Also, it would be possible to accomplish the same result if I could use a reduce function to filter data
The reduce function is for "reducing" the values that are returned by the map function. The map function returns a key and a value like so:
emit(key,value)
The reduce function only gets the keys and the values that are returned from a map function. For example if you call a view with
?key=abc
and it returns results like
[{
_id:...,
type: abc
},
{
_id:...,
type:abc
}
....
]
You already have all the documents filtered by the key "abc". The reduce function will get as inputs the key, the value and a rereduce parameters. If you use the reduce function as a post map processing step to further filter the results from the view there will be two problems:
There is no way to pass a parameter to a reduce. The keys that you specify will only be used by the map function and then passed as they are to reduce.
It is not a good idea anyway. With reduce you want to return a small value that aggregates the results you get from a view. So taking the above example if you return say an integer as a value from the map function ( in emit(key,value)//suppose that the value is an integer) the reduce function may return a sum or aggregate of those values. But trying to return a modified document is not what reduce function is for. From the docs
"A reduce function must reduce the input values to a smaller output value. If you are building a composite return structure in your reduce, or only transforming the values field, rather than summarizing it, you might be misusing this feature. "
List functions might be more suited to what you are trying to do. If you want to process the results of the view query before returning them they are they way to go.
In list functions you get a set of results returned by the view function. You can even pass additional parameters if you'd like to apply complex filters on them. But you won't be able to use list functions for replication.
Finally replication works on a document level. Documents have _rev fields that is used by the replicator process to check what version the document is in before the replication is performed. So you won't be able to replicate the results returned by a view. Only the documents will be replicated.

CouchDB: Single document vs "joining" documents together

I'm tryting to decide the best approach for a CouchApp (no middleware). Since there are similarities to my idea, lets assume we have a stackoverflow page stored in a CouchDB. In essence it consists of the actual question on top, answers and commets. Those are basically three layers.
There are two ways of storing it. Either within a single document containing a suitable JSON representation of the data, or store each part of the entry within a separate document combining them later through a view (similar to this: http://www.cmlenz.net/archives/2007/10/couchdb-joins)
Now, both approaches may be fine, yet both have massive downsides from my current point of view. Storing a busy document (many changes through multiple users are expected) as a signle entity would cause conflicts to happen. If user A stores his/her changes to the document, user B would receive a conflict error once he/she is finished typing his/her update. I can imagine its possible to fix this without the users knowledge through re-downloading the document before retrying.
But what if the document is rather big? I'll except them to become rather blown up over time which would put quite some noticeable delay on a save process, especially if the retry process has to happen multiple times due to many users updating a document at the same time.
Another problem I'd see is editing. Every user should be allowed to edit his/her contributions. Now, if they're stored within one document it might be hard to write a solid auth handler.
Ok, now lets look at the multiple documents approach. Question, Answers and Comments would be stored within their own documents. Advantage: only the actual owner of the document can cause conflicts, something that won't happen too often. Being rather small elements of the whole, redownloading wouldn't take much time. Furthermore the auth routine should be quite easy to realize.
Now here's the downside. The single document is real easy to query and display. Having a lot of unsorted snippets laying around seems like a messy thing since I didn't really get the actual view to present me with a 100% ready to use JSON object containing the entire item in an ordered and structured format.
I hope I've been able to communicate the actual problem. I try to decide which solution would be more suitable for me, which problems easier to overcome. I imagine the first solution to be the prettier one in terms of storage and querying, yet the second one the more practical one solvable through better key management within the view (I'm not entirely into the principle of keys yet).
Thank you very much for your help in advance :)
Go with your second option. It's much easier than having to deal with the conflicts. Here are some example docs how I might structure the data:
{
_id: 12345,
type: 'question',
slug: 'couchdb-single-document-vs-joining-documents-together',
markdown: 'Im tryting to decide the best approach for a CouchApp (no middleware). Since there are similarities to...' ,
user: 'roman-geber',
date: 1322150148041,
'jquery.couch.attachPrevRev' : true
}
{
_id: 23456,
type: 'answer'
question: 12345,
markdown: 'Go with your second option...',
user : 'ryan-ramage',
votes: 100,
date: 1322151148041,
'jquery.couch.attachPrevRev' : true
}
{
_id: 45678,
type: 'comment'
question: 12345,
answer: 23456,
markdown : 'I really like what you have said, but...' ,
user: 'somedude',
date: 1322151158041,
'jquery.couch.attachPrevRev' : true
}
To store revisions of each one, I would store the old versions as attachments on the doc being edited. If you use the jquery client for couchdb, you get it for free by adding the jquery.couch.attachPrevRev = true. See Versioning docs in CouchDB by jchris
Create a view like this
fullQuestion : {
map : function(doc) {
if (doc.type == 'question') emit([doc._id, null, null], null);
if (doc.type == 'answer') emit([doc.question, doc._id, null], null);
if (doc.type == 'comment') emit([doc.question, doc.answer, doc._id], null) ;
}
}
And query the view like this
http://localhost:5984/so/_design/app/_view/fullQuestion?startkey=['12345']&endkey=['12345',{},{}]&include_docs=true
(Note: I have not url encoded this query, but it is more readable)
This will get you all of the related documents for the question that you will need to build the page. The only thing is that they will not be sorted by date. You can sort them on the client side (in javascript).
EDIT: Here is an alternative option for the view and query
Based on your domain, you know some facts. You know an answer cant exist before a question existed, and a comment on an answer cant exist before an answer existed. So lets make a view that might make it faster to create the display page, respecting the order of things:
fullQuestion : {
map : function(doc) {
if (doc.type == 'question') emit([doc._id, doc.date], null);
if (doc.type == 'answer') emit([doc.question, doc.date], null);
if (doc.type == 'comment') emit([doc.question, doc.date], null);
}
}
This will keep all the related docs together, and keep them ordered by date. Here is a sample query
http://localhost:5984/so/_design/app/_view/fullQuestion?startkey=['12345']&endkey=['12345',{}]&include_docs=true
This will get back all the docs you will need, ordered from oldest to newest. You can now zip through the results, knowing that the parent objects will be before the child ones, like this:
function addAnswer(doc) {
$('.answers').append(answerTemplate(doc));
}
function addCommentToAnswer(doc) {
$('#' + doc.answer).append(commentTemplate(doc));
}
$.each(results.rows, function(i, row) {
if (row.doc.type == 'question') displyQuestionInfo(row.doc);
if (row.doc.type == 'answer') addAnswer(row.doc);
if (row.doc.type == 'comment') addCommentToAnswer(row.doc)
})
So then you dont have to perform any client side sorting.
Hope this helps.

How do I design a couchdb view for following case ?

I am migrating an application from mySQL to couchDB. (Okay, Please dont pass judgements on this).
There is a function with signature
getUserBy($column, $value)
Now you can see that in case of SQL it is a trivial job to construct a query and fire it.
However as far as couchDB is concerned I am supposed to write views with map functions
Currently I have many views such as
get_user_by_name
get_user_by_email
and so on. Can anyone suggest a better and yet scalable way of doing this ?
Sure! One of my favorite views, for its power, is by_field. It's a pretty simple map function.
function(doc) {
// by_field: map function
// A single view for every field in every document!
var field, key;
for (field in doc) {
key = [field, doc[field]];
emit(key, 1);
}
}
Suppose your documents have a .name field for their name, and .email for their email address.
To get users by name (ex. "Alice" and "Bob"):
GET /db/_design/example/_view/by_field?include_docs=true&key=["name","Alice"]
GET /db/_design/example/_view/by_field?include_docs=true&key=["name","Bob"]
To get users by email, from the same view:
GET /db/_design/example/_view/by_field?include_docs=true&key=["email","alice#gmail.com"]
GET /db/_design/example/_view/by_field?include_docs=true&key=["name","bob#gmail.com"]
The reason I like to emit 1 is so you can write reduce functions later to use sum() to easily add up the documents that match your query.

CouchDB emit with lookup key that is array, such that order of array elements are ignored

When indexing a couchdb view, you can emit an array as the key such as:
emit(["one", "two", "three"], doc);
I appreciate the fact that when searching the view, the order is important, but sometimes I would like the view to ignore it. I have thought of a couple of options.
1. By convention, just emit the contents in alphabetical order, and ensure that looking up uses the same convention.
2. Somehow hash in a manner that disregards the order, and emit/search based on that hash. (This is fairly easy, if you simply hash each one individually, "sum" the hashes, then mod.)
Note: I'm sure this may be covered somewhere in the authoritative guide, but I was unsuccessful in finding it.
It looks like the correct approach is to determine a conventional ordering on the keys, emit them in this ordering, and be sure to query with this ordering enforced. Otherwise we would need to emit all n(factorial) permutations of the keys (which could get bad if n is greater than 3)
CouchDB will always maintain the array keys in order. Have you considered emitting all sequence variations as part of the view? Something along the lines of:
function(doc) {
function computeAllKeyVariations(fromKey) {
// returns array of key arrays
}
var allKeys = computeAllKeyVariations(startingKey);
for (k in allKeys) {
emit(k, doc); // or emit(k, null)
}
}
Side note: You also have the option to use emit(['one','two','three'], null) instead of emitting the document. This will avoid having CouchDB store the full document in the view index (more than once). To get the same results as before just make use of &include_docs=true

Creating a pagination index in CouchDB?

I'm trying to create a pagination index view in CouchDB that lists the doc._id for every Nth document found.
I wrote the following map function, but the pageIndex variable doesn't reliably start at 1 - in fact it seems to change arbitrarily depending on the emitted value or the index length (e.g. 50, 55, 10, 25 - all start with a different file, though I seem to get the correct number of files emitted).
function(doc) {
if (doc.type == 'log') {
if (!pageIndex || pageIndex > 50) {
pageIndex = 1;
emit(doc.timestamp, null);
}
pageIndex++;
}
}
What am I doing wrong here? How would a CouchDB expert build this view?
Note that I don't want to use the "startkey + count + 1" method that's been mentioned elsewhere, since I'd like to be able to jump to a particular page or the last page (user expectations and all), I'd like to have a friendly "?page=5" URI instead of "?startkey=348ca1829328edefe3c5b38b3a1f36d1e988084b", and I'd rather CouchDB did this work instead of bulking up my application, if I can help it.
Thanks!
View functions (map and reduce) are purely functional. Side-effects such as setting a global variable are not supported. (When you move your application to BigCouch, how could multiple independent servers with arbitrary subsets of the data know what pageIndex is?)
Therefore the answer will have to involve a traditional map function, perhaps keyed by timestamp.
function(doc) {
if (doc.type == 'log') {
emit(doc.timestamp, null);
}
}
How can you get every 50th document? The simplest way is to add a skip=0 or skip=50, or skip=100 parameter. However that is not ideal (see below).
A way to pre-fetch the exact IDs of every 50th document is a _list function which only outputs every 50th row. (In practice you could use Mustache.JS or another template library to build HTML.)
function() {
var ddoc = this,
pageIndex = 0,
row;
send("[");
while(row = getRow()) {
if(pageIndex % 50 == 0) {
send(JSON.stringify(row));
}
pageIndex += 1;
}
send("]");
}
This will work for many situations, however it is not perfect. Here are some considerations I am thinking--not showstoppers necessarily, but it depends on your specific situation.
There is a reason the pretty URLs are discouraged. What does it mean if I load page 1, then a bunch of documents are inserted within the first 50, and then I click to page 2? If the data is changing a lot, there is no perfect user experience, the user must somehow feel the data changing.
The skip parameter and example _list function have the same problem: they do not scale. With skip you are still touching every row in the view starting from the beginning: finding it in the database file, reading it from disk, and then ignoring it, over and over, row by row, until you hit the skip value. For small values that's quite convenient but since you are grouping pages into sets of 50, I have to imagine that you will have thousands or more rows. That could make page views slow as the database is spinning its wheels most of the time.
The _list example has a similar problem, however you front-load all the work, running through the entire view from start to finish, and (presumably) sending the relevant document IDs to the client so it can quickly jump around the pages. But with hundreds of thousands of documents (you call them "log" so I assume you will have a ton) that will be an extremely slow query which is not cached.
In summary, for small data sets, you can get away with the page=1, page=2 form however you will bump into problems as your data set gets big. With the release of BigCouch, CouchDB is even better for log storage and analysis so (if that is what you are doing) you will definitely want to consider how high to scale.

Resources