Compare two couchdb databases - couchdb

I have a couchdb instance with database a and database b. They should contain identical sets of documents, except that the _rev property will be different, which, AIUI, means I can't use replication.
How do I verify that the two databases really do contain the same documents which are all otherwise 'equal'?
I've tried using the python-based couchdb-dump tool with a lot of sed magic to get rid of the _rev and MD5 and ETag headers, but then it still seems that property order in the JSON structure is slightly random, which means I still can't compare the output easily with something like diff.
Is there a better approach here? Have other people wanted to solve a similar problem?

If you want to make sure they're exactly the same, write a map job that emits the document path as the key, and the documents hash (generated any way you like) as the value. Do not include the _rev field in the hash generation.
You cannot reduce to a single hash because order is not guaranteed, but you can feed the resultant JSON document to a good diff program.

Related

CouchDB - human readable id

Im using CouchDB with node.js. Right now there is one node involved and even in remote future its not planned to changed that. While I can remove most of the cases where a short and auto-incremental-like (it can be sparse but not like random) ID is required there remains one place where the users actually needs to enter the ID of a product. I'd like to keep this ID as short as possible and in a more human readable format than something like '4ab234acde242349b' as it sometimes has to be typed by hand and so on.
However in the database it can be stored with whatever ID pleases CouchDB (using the default auto generated UUID) but it should be possible to give it a number that can be used to identify it as well. What I have thought about is creating a document that consists of an array with all the UUIDs from CouchDB. When in node I create a new product I would run an update handler that updates said document with the new unique ID at the end. To obtain the products ID I'd then query the array and client side using indexOf I could get the index as a short ID.
I dont know if this is feasible. From the performance point of view I can say the following: There are more queries that should do numerical ID -> uuid than uuid -> numerical ID. There will be at max 7000 new entries a year in the database. Also there is no use case where a product can be deleted yet I'd like not to rely on that.
Are there any other applicable ways to genereate a shorter and more human readable ID that can be associated with my document?
/EDIT
From a technical point of view: It seems to be working. I can do both conversions number <-> uuid and it seems go well. I dont now if this works well with replication and stuff but as there is said array i guess it should, right?
You have two choices here:
Set your human readable id as _id field. Basically you can just set in create document calls to DB, and it will accept it. This can be a more lightweight solution, but it comes with some limitations:
It has to be unique. You should also be careful about clients trying to create documents, but instead overwrite existing ones.
It can only contain alphanumeric or a few special characters. In my experience it is asking for trouble to have extra character types.
It cannot be longer than a theoretical string length limit(Couchdb doesn't define any, but you should). Long ids will increase your views(indexes) size really bad. And it might make it s lower.
If these things are no problem with you, then you should go with this solution.
As you said yourself, let the _id be a UUID, and set the human readable id to another field. To reach the document by the human readable id, you can just create a view emitting the human readable id as a key, and then either emit the document as value or get the document via include_docs=true option. Whenever the view is reached Couchdb will update the view incrementally and return you the list. This is really same as you creating a document with an array/object of ids inside it. Except with using a couchdb view, you get more performance.
This might be also slightly slower on querying and inserting. If the ids are inserted sequentially, it's fine, if not, CouchDB will slightly take more time to insert it at the right place. These don't work well with huge amounts of insert coming at the DB.
Querying shouldn't be more than 10% of total query time longer than first option. I think 10% is really a big number. It will be most probably less than 5%, I remember in my CouchDB application, I switched from reading by _id to reading from a view by a key and the slow down was very little that from user end point, when making 100 queries at the same time, it wasn't noticeable.
This is how people, query documents by other fields than id, for example querying a user document with email, when the user is logging in.
If you don't know how couchdb views work, you should read the views chapter of couchdb definite guide book.
Also make sure you stay away from documents with huge arrays inside them. I think CouchDB, has a limit of 4GB per document. I remember having many documents and it had really long querying times because the view had to iterate on each array item. In the end for each array item, instead I created one document. It was way faster.

Generating lexographically ascending unique IDs

I want to generate IDs for use with CouchDB. I'd like the IDs to be lexographically ascending by time so that I can sort on id without maintaining a seperate timestamp field. I know that CouchDB will generate ids with this property, but I don't want the performance hit of querying the database, I'd rather just run an algorithm on my servers. I'd go with an implementation of rfc 4112 except that the results aren't lexographically ascending. Is there any good reason I shouldn't just do:
(Date.now()) + 'x' + Math.round(Math.random() *1E18)
(I'm using nodejs). Are there any costs of using a non-standard uuid, or of relying on javascript's built in random function?
You have some choices when it comes to uuids.
The first choice is if you want the _id generated client side(node, browser, etc..), or by couch. It sounds like you want to generate your own uuid on the client side. That is fine. Just stick the result of your function into the _id field of the doc you save to couchdb. Couch will just use that.
You could have couch create the id. Couchdb only generates a _id if you don't choose one for yourself. Couchdb by default uses a 'sequential' uuid generation algorithm. You can change the algorithm to others via futon and config. There is a section called 'uuids' with a key of 'algorithm'. You can see the source for these algorithms here:
https://github.com/apache/couchdb/blob/master/src/couchdb/couch_uuids.erl
With descriptions about them here:
http://wiki.apache.org/couchdb/HttpGetUuids?highlight=%28utc%5C_random%29
As you can see the utc_random function is very similiar to your suggestion. But if you wanted your own,If you were inclined you could add you algorithm on the serverside and recompile couch.
The second part of your question is about the performance of choosing different algorithms. I am going to quote Dave Cottlehuber from a user list post:
CouchDB will have best insert time when your doc ids are
continually increasing, as this minimises rewrites to the b~tree. This
will also help
your view build time for the same reason, and also minimises wasted doc space,
although that would also be recovered during compaction.
So both your algorithm and the utc_random should be fine as they doc ids are continually increasing do to the seemingly helpful one direction of time.
I would recommend sticking with the UUID that CouchDB generates for you, but you can configure the server to use utc_random which will prefix a timestamp which you can sort your records by.
http://wiki.apache.org/couchdb/HttpGetUuids

Using different types/format of _id ok?

I'm using one database for all (users, files and comments).
I was wondering if I can/should use
twitter user id for user doc _id's
md5 hash (of file) for file doc _id's
provided uuid for comment doc _id's
It feels weird to mix those different types of id's.
What speaks agains this scenario? Should I stick to the CouchDB uuid's for consistency?
Use any format of id, or combination of formats, as you see fit. You might wish to add a prefix to ensure there are no overlaps between them, though.
twitter:#rnewson
md5:86f646c11b3bc7d434d06c077aee43d8
And so on.

redis performance, store json object as a string

I need to save a User model, something like:
{ "nickname": "alan",
"email": ...,
"password":...,
...} // and a couple of other fields
Today, I use a Set: users
In this Set, I have a member like user:alan
In this member I have the hash above
This is working fine but I was just wondering if instead of the above approach that could make sense to use the following one:
Still use users Set (to easily get the users (members) list)
In this set only use a key / value storage like:
key: alan
value : the stringify version of the above user hash
Retrieving a record would then be easier (I will then have to Parse it with JSON).
I'm very new to redis and I am not sure what could be the best. What do you think ?
You can use Redis hashes data structure to store your JSON object fields and values. For example your "users" set can still be used as a list which stores all users and your individual JSON object can be stored into hash like this:
db.hmset("user:id", JSON.stringify(jsonObj));
Now you can get by key all users or only specific one (from which you get/set only specified fields/values). Also these two questions are probably related to your scenario.
EDIT: (sorry I didn't realize that we talked about this earlier)
Retrieving a record would then be easier (I will then have to Parse it with JSON).
This is true, but with hash data structure you can get/set only the field/value which you need to work with. Retrieving entire JSON object can result in decrease of performance (depends on how often you do it) if you only want to change part of the object (other thing is that you will need to stringify/parse the object everytime).
One additional merit for JSON over hashes is maintaining type. 123.3 becomes the string "123.3" and depending on library Null/None can accidentally be casted to "null".
Both are a bit tedious as that will require writing a transformer for extracting the strings and converting them back to their expected types.
For space/memory consumption considerations, I've started leaning towards storing just the values as a JSON list ["my_type_version", 123.5, null , ... ] so I didn't have overhead of N * ( sum(len(concat(JSON key names))) which in my case was +60% of Redis's used memory footprint.
bear in mind: Hashes cannot store nested objects, JSON can do it.
Truthfully, either way works fine. The way you store it is a design decision you will need to make. It depends on how you want to retrieve the user information, etc.
In terms of performance, storing the JSON encoded version of the user object will use less memory and take less time for storage/retrieval. That is, JSON parsing is probably faster than retrieving each field from Redis. And, even if not, it is probably more memory efficient. The difference in performance is probably minimal anyway.

couchdb hovercraft limitations, storing arbitrary erlang terms into Couchdb

so Ive been messing with hovercraft and ran into some anoying limitation, that are probably there due to the fact that internally couchdb deals with key/value pairs associated with a document as opaque strings (json strings).
namelly:
- doc _id's can only be binary strings (utf8) - no complex erlang terms allowed here
- key/value pairs can only be binatry_strings or atoms or lists (no tuples, or arbitrary binaries allowed).
I was looking forward to storing arbitrary erlang terms in there, without encoding them as JSON first. yes this is possible, but then the entire view system (and the http api,notifications,verification,indexing) just stops working.
that too is fine, I could code around it, not use futon, map/reduce over documents manually and store results as documents (which actually is better since then those results can be replicated to other DBs/nodes, unlike views results(which dont replicate - correct me if Im wrong)).
the real problem seems that without views one cannot get a list of all the keys that are stored in a db, at least not via the current hovercraft api. that is a show stoper for mapreducing manually over an entire db, without knowing prior what the doc _id's are.
any ideas as to how I can get a list of these keys in a db? via erlang calls, possibly into the internals of couchdb?
its even more obvious to me now that the direct erlang api for couchdb was a total afterthough.
As the author of Hovercraft, I agree with the statement "the direct erlang api for couchdb was a total afterthough."
You should only use Hovercraft if you are converting CouchDB from an HTTP server to say, an SMTP server. HTTP will scale much better than Hovercraft.
It should be possible to use the internal _changes API to iterate over all the docs in the database and maintain a secondary index incrementally.
As for storing non-JSON data in CouchDB, that sounds risky as no one will be looking out to make sure we don't break your use case.
But if you are having fun, by all means, continue. And I love getting patches to Hovercraft, so any little thing will probably get rolled back in.
Thanks,
Chris

Resources