Why does CouchDB use a random offset between generated UUIDs? - couchdb

I'm working on an application that stores data in CouchDB and I want to generate UUIDs for new documents within the application, i.e. without using the _uuid API or relying on UUIDs to be generated by CouchDB when the documents are inserted. To do this, I'm going to recreate the default algorithm that CouchDB uses to generate UUIDs, which works as follows:
A generated UUID consists of two parts, a random prefix and a monotonically increasing suffix. The same prefix is used for as long as the suffix doesn't overflow, if it does, a new random prefix is used. The suffix starts at zero and is increased in random steps between 1 and 0xffe.
This all seems reasonable, especially the part about the random but constant prefix that allows documents that are inserted at the same time to be stored near each other in the document B-Tree. What I don't understand is why the suffix is increased in random steps instead of just 1 each time. What is the explanation, or a possible explanation, for this decision?

Related

Long IDs vs Short IDs

Currently in my Node.Js app I use node-uuid module for giving unique IDs to my database objects.
Using uuid.v1() function from that module I get something like
81a0b3d0-e4d0-11e3-ac56-73f5c88681be
Now, my requests are quite long, sometimes hundreds of nodes and edges in one query. So you can imagine they become huge, because every node and edge has to have a unique ID.
Do you know if I could use a shorter ID system in order to not run into any problems after the number of my items grow? I mean I know I could get away with just the first 8 symbols (as there are 36^8 > 2 Trl combinations, but how well will it perform when they are randomly generated? As the number of my nodes increase, what is the chance that the randomly generated ID will not fall into the already existing ones?
Thank you!
If you're really concerned about uuid collisions you can always simply do a lookup to make sure you don't already have a row with that uuid. The point is that there is always a very low but non-zero chance of a collision given current uuid generators, especially with shorter strings.
Here's a post that discusses it in more detail
https://softwareengineering.stackexchange.com/questions/130261/uuid-collisions
One alternative would be to use a sequential ID system (autoincrement) instead of uuids.

Readable row key cassandra

We are investigating migrating a system from a RDBMS to Cassandra and are having trouble finding a way to convert auto-increment column to Cassandra. We actually have no need for this to be sequential at all, it can even contain characters, but it must be short (ideally under 8 chars) and globally unique. Ideal value would look something like
AB123456
First part of the question is should we be generating this key in application code or in Cassandra?
Second part:
If Cassandra, how?
If Application code, is it an acceptable pattern to generate a candidate code then attempt an insert, if collision occurs then regenerate key candidate and retry?
The common way to do this in Cassandra is to use a uuid (or timeuuid if the IDs should be time ordered). But these must be long to get uniqueness - they are 16 bytes long. (uuids are unique because the probability of a collision is so low; timeuuids are guaranteed unique since they contain information about the generating host and include time.)
If you need a shorter key, you can't reliably find collisions by checking before inserting. There will always be race conditions where this will fail without external coordination. Coming in Cassandra 2.0 is compare-and-set which will let you do this, but at a performance cost.
If you use a random 8 character string, containing only numbers and letters, there are 36^8 possible keys, with collisions becoming very likely after about sqrt(36^8) ~ 1 million operations. You can improve this by using any character, so there are 256^8 keys, with collisions likely after about sqrt(256^8) ~ 4 billion operations. This is probably too low though, so it would be better to use longer IDs.

Generating lexographically ascending unique IDs

I want to generate IDs for use with CouchDB. I'd like the IDs to be lexographically ascending by time so that I can sort on id without maintaining a seperate timestamp field. I know that CouchDB will generate ids with this property, but I don't want the performance hit of querying the database, I'd rather just run an algorithm on my servers. I'd go with an implementation of rfc 4112 except that the results aren't lexographically ascending. Is there any good reason I shouldn't just do:
(Date.now()) + 'x' + Math.round(Math.random() *1E18)
(I'm using nodejs). Are there any costs of using a non-standard uuid, or of relying on javascript's built in random function?
You have some choices when it comes to uuids.
The first choice is if you want the _id generated client side(node, browser, etc..), or by couch. It sounds like you want to generate your own uuid on the client side. That is fine. Just stick the result of your function into the _id field of the doc you save to couchdb. Couch will just use that.
You could have couch create the id. Couchdb only generates a _id if you don't choose one for yourself. Couchdb by default uses a 'sequential' uuid generation algorithm. You can change the algorithm to others via futon and config. There is a section called 'uuids' with a key of 'algorithm'. You can see the source for these algorithms here:
https://github.com/apache/couchdb/blob/master/src/couchdb/couch_uuids.erl
With descriptions about them here:
http://wiki.apache.org/couchdb/HttpGetUuids?highlight=%28utc%5C_random%29
As you can see the utc_random function is very similiar to your suggestion. But if you wanted your own,If you were inclined you could add you algorithm on the serverside and recompile couch.
The second part of your question is about the performance of choosing different algorithms. I am going to quote Dave Cottlehuber from a user list post:
CouchDB will have best insert time when your doc ids are
continually increasing, as this minimises rewrites to the b~tree. This
will also help
your view build time for the same reason, and also minimises wasted doc space,
although that would also be recovered during compaction.
So both your algorithm and the utc_random should be fine as they doc ids are continually increasing do to the seemingly helpful one direction of time.
I would recommend sticking with the UUID that CouchDB generates for you, but you can configure the server to use utc_random which will prefix a timestamp which you can sort your records by.
http://wiki.apache.org/couchdb/HttpGetUuids

Search strategy

I'm writing a java program that needs to find possible matches for specified strings. Strings will generally be in the form of
onetwothree one.two.three
onesomethingtwoblah onesomething
where one two and three are parts of an actual title. Candidate matches from the database are in the form one+two+three. The method i have come up with is to compare each token from database candidates with the entire specified string using regex. A counter for the number of database token matches will be used to determine the rank of possible matches.
My concern is the accuracy of matches presented and the method's ability to successfully find matches if they do exist. Is this method efficient?
Depends, if you have a lot of database records and large strings to compare against the search may end up being quite expensive. It would need to pass the entire input string for each record.
You could consider doing a single pass over the input string and search tokens against the database. Some smart search indexed could help speed this up.
When pairing multiple tokens you would need to figure out a way knowing when to stop scanning and advance to a next token. Partial matches could help here; store one+two+three also as seperate one, two and three. Or if the order matters store it also as one, one+two and one+two+three.
Basically as you scan you have a list of candidate DB entries that gets smaller and smaller, comparable to a facet search.

How are session identifiers generated?

Most web applications depend on some kind of session with the user (for instance, to retain login status). The session id is kept as a cookie in the user's browser and sent with every request.
To make it hard to guess the next user's session these session-ids need to be sparse and somewhat random. The also have to be unique.
The question is - how to efficiently generate session ids that are sparse and unique?
This question has a good answer for unique random numbers, but it seems not scalable for a large range of numbers, simply because the array will end up taking a lot of memory.
EDIT:
GUIDs are considered unsafe as far as security (and randomness) go.
The core problem is making sure the numbers are unique, i.e. they don't repeat and making it efficient.
If you want them to be unique and not easily guessable, why not combine these?
Take a counter (generates unique value for new session) and append random bits generated by a CSPRNG. Make sure to get the minimum number of bits required right.
This should work on a farm as well without hitches: just prefix the counter that is local to a server with an id that is unique to that server.
SSSSCCCCCRRRRRR
Where S is server id that created the session, C is the server local counter and R is a crypto random.
(Disclaimer: the number of letters do not correspond to the number of digits/bits you should use in any way. :)
Unique, secure.
You could take a look at the RNGCryptoServiceProvider if you are using .NET.
http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=775
This is a cryptographically secure way of generating random numbers.

Resources