Long IDs vs Short IDs

Long IDs vs Short IDs - node.js

Currently in my Node.Js app I use node-uuid module for giving unique IDs to my database objects.
Using uuid.v1() function from that module I get something like
81a0b3d0-e4d0-11e3-ac56-73f5c88681be
Now, my requests are quite long, sometimes hundreds of nodes and edges in one query. So you can imagine they become huge, because every node and edge has to have a unique ID.
Do you know if I could use a shorter ID system in order to not run into any problems after the number of my items grow? I mean I know I could get away with just the first 8 symbols (as there are 36^8 > 2 Trl combinations, but how well will it perform when they are randomly generated? As the number of my nodes increase, what is the chance that the randomly generated ID will not fall into the already existing ones?
Thank you!

If you're really concerned about uuid collisions you can always simply do a lookup to make sure you don't already have a row with that uuid. The point is that there is always a very low but non-zero chance of a collision given current uuid generators, especially with shorter strings.
Here's a post that discusses it in more detail
https://softwareengineering.stackexchange.com/questions/130261/uuid-collisions
One alternative would be to use a sequential ID system (autoincrement) instead of uuids.

Related

How do I quickly get a weighted random instance of a Django model instance based on a weight field on that model?

I'm using Postgres as the database backend but I don't think that will matter. Also, I'd like to still use sqlite3 as the local development database, so ideally any approach will work for both.
By "weighted", I mean that some items in that database table are more likely to appear than others based on a heuristic value ranging from 0 to +inf where 0 is "never going to be picked", 1 is "as equal to be picked as any other instance, and 2 is "twice as likely to be picked than any other instance".
I've read other SO posts about pulling random instances of models, but as far as I've seen there are no ways to do this quickly involving weights.
My model:
Has millions of instances.
Has a weight DecimalField which can be updated at any time, even during runtime.
Is not referenced anywhere except for using this random selection algorithm (so, for example, it can be deleted and recreated at any time with no problems).
What I'm after is a fast way to do this that is faster than the solutions I've tried or an explanation to why one of my solutions that I've tried is the fastest I can get.
Avoiding XY problem
I want to select "fresh" stuff from a database table but still give the chance of seeing some of the older content. If some content has been viewed too frequently or it is not well-received, it should appear less frequently. Ideally, I would be able to control how frequently this is: "ah, so this will appear 1.5 times more than other content on the site."
Stuff I've tried
Selecting randomly and trying again based on a probability roll
Count the total number of instances of the model. E.g.: 100.
Select a random instance of the model.
Roll a dice from 1 * weight / instances_count to determine if the item should be thrown back and a random choice be made again.
This seems like it's pretty slow and the random nature of the "throw back" might never terminate. In general, this is really janky and I would hate to use it. Putting it first as it is pretty "simple" and I'm most likely to dismiss it.
SELECTing every element ID and weight and using a random weight algorithm to select an ID
SELECT all of the rows.
Assign each row a range of IDs based on the weights.
Add up all of the weights dynamically.
Roll a sum_of_all_weights-sided dice.
Whatever is selected, pick an ID based on the weights.
Problem is, this algorithm seems slow for millions of rows. This is the "naive" solution.
Assigning a range of IDs depending on the weight and dynamically delete/re-create instances
Whenever something is added or the weight is changed, delete all instances containing the unique details of the instance and create weight more instances with the same meta-information.
Pick a random instance normally.
A caveat with this is that only integer-based weightings are possible. Also, the performance problem is offloaded from the SELECT to the INSERT and DELETE operations.
Rethink the entire model and introduce a "throw-back" value
Add a throw_back_probability field instead of weight.
If the probability is 0.0 it will never be thrown back. Otherwise, roll a die and throw back if required depending on the throw_back_probability.
Limit the algorithm to 3 "throw-back"s (or some arbitrary number).
This ultimately solves the problem but still probably requires more database calls and is an indirect solution.
It's SO, so I'm sure there are clever annotation-based solutions to this (or similar) that I'm overlooking. Thanks for the help in advance!

You can combine sharding with any of the approaches you list. Choose a number of shards (preferably with number of rows / number of shards significantly greater than log(number of rows) to avoid empty shards with high probability), assign each row a uniform random shard ID, and make the shard ID the first entry of the primary key so that the table is sorted by shard. To sample, choose a uniform random shard and then sample within the shard. This is inaccurate to the extent that the shard totals are unbalanced, but if the shards are large enough, then the law of large numbers will kick in. (If the shards are too large, though, then that starts to defeat the point of sharding.)

How to insert an auto increment / sequential number value in CouchDB documents?

I'm currently playing with couchDB a bit and have the following scenario:
I'm implementing an issue tracker. Requirement is that each issue document has (besides it's document _id) a unique numerical sequential number in order to refer to it in a more appropriate way.
My first approach was to have a view which simply returns the count of unique issue documents currently stored. Increment that value on the client side by 1, assign it to my new issue and insert that.
Turned out to be a bad idea, when inserting multiple issues with ajax calls or having multiple clients adding issues at the same time. In latter case is wouldn't be even possible without communication between clients.
Ideally I want the sequential number to be generated on couch, which is afaik not possible due to conflicting states in distributed systems.
Is there any good pattern one could use (maybe on the client side) to approach this? I feel like this is a standard kind of use case (thinking of invoice numbers, etc).
Thanks in advance!

You could use a separate document which is empty, though it only consists of the id and rev. The rev prefix is always an integer, so you could use it as your auto incrementing number.
Just make a POST to your document, this will increase the rev and return it. Then you can use this generated value for your purpose.
Alternative way:
Create a separate document, consisting of value and lock. Then execute something like: "IF lock == true THEN return ELSE set lock = true AND increase value by 1", then do a GET to retrieve the new value and finally set lock = false.

I agree with you that using a view that gives you a document count is not a great idea. And it is the reason that couchdb uses a uuid's instead.
I'm not aware of a sequential id feature in couchdb, but think it's quite easy to write. I'd consider either:
An RPC (eg. with RabbitMQ) call to a single service to avoid concurrency issues. You can then store the latest number in a dedicated document on a specific non distributed couchdb or somewhere else. This may not scale particularly well, but you're writing a heck of an issue tracking system before this becomes an issue.
If you can allow missing numbers, set the uuid algorithm on your couch to sequential and you are at least good until the first buffer overflow. See more info at: http://couchdb.readthedocs.org/en/latest/config/misc.html#uuids-configuration

Why does CouchDB use a random offset between generated UUIDs?

I'm working on an application that stores data in CouchDB and I want to generate UUIDs for new documents within the application, i.e. without using the _uuid API or relying on UUIDs to be generated by CouchDB when the documents are inserted. To do this, I'm going to recreate the default algorithm that CouchDB uses to generate UUIDs, which works as follows:
A generated UUID consists of two parts, a random prefix and a monotonically increasing suffix. The same prefix is used for as long as the suffix doesn't overflow, if it does, a new random prefix is used. The suffix starts at zero and is increased in random steps between 1 and 0xffe.
This all seems reasonable, especially the part about the random but constant prefix that allows documents that are inserted at the same time to be stored near each other in the document B-Tree. What I don't understand is why the suffix is increased in random steps instead of just 1 each time. What is the explanation, or a possible explanation, for this decision?

Are MongoDB ids guessable?

If you bind an api call to the object's id, could one simply brute force this api to get all objects? If you think of MySQL, this would be totally possible with incremental integer ids. But what about MongoDB? Are the ids guessable? For example, if you know one id, is it easy to guess other (next, previous) ids?
Thanks!

Update Jan 2019: As mentioned in the comments, the information below is true up until version 3.2. Version 3.4+ changed the spec so that machine ID and process ID were merged into a single random 5 byte value instead. That might make it harder to figure out where a document came from, but it also simplifies the generation and reduces the likelihood of collisions.
Original Answer:
+1 for Sergio's answer, in terms of answering whether they could be guessed or not, they are not hashes, they are predictable, so they can be "brute forced" given enough time. The likelihood depends on how the ObjectIDs were generated and how you go about guessing. To explain, first, read the spec here:
Object ID Spec
Let us then break it down piece by piece:
TimeStamp - completely predictable as long as you have a general idea of when the data was generated
Machine - this is an MD5 hash of one of several options, some of which are more easily determined than others, but highly dependent on the environment
PID - again, not a huge number of values here, and could be sleuthed for data generated from a known source
Increment - if this is a random number rather than an increment (both are allowed), then it is less predictable
To expand a bit on the sources. ObjectIDs can be generated by:
MongoDB itself (but can be migrated, moved, updated)
The driver (on any machine that inserts or updates data)
Your Application (you can manually insert your own ObjectID if you wish)
So, there are things you can do to make them harder to guess individually, but without a lot of forethought and safeguards, for a normal data set, the ranges of valid ObjectIDs should be fairly easy to work out since they are all prefixed with a timestamp (unless you are manipulating this in some way).

Mongo's ObjectId were never meant to be a protection from brute force attack (or any attack, for that matter). They simply offer global uniqueness. You should not assume that some object can't be accessed by a user because this user should not know its id.
For an actual protection of your resources, employ other techniques.
If you defend against an unauthorized access, place some authorization logic in your app (allow access to legitimate users, deny for everyone else).
If you want to hinder dumping all objects, use some kind of rate limiting. Combine with authorization if applicable.
Optional reading: Eric Lippert on GUIDs.

Should I implement auto-incrementing in MongoDB?

I'm making the switch to MongoDB from MySQL. A familiar architecture to me for a very basic users table would have auto-incrementing of the uid. See Mongo's own documentation for this use case.
I'm wondering whether this is the best architectural decision. From a UX standpoint, I like having UIDs as external references, for example in shorter URLs: http://example.com/users/12345
Is there a third way? Someone in IRC Freenode's #mongodb suggested creating a range of IDs and caching them. I'm unsure of how to actually implement that, or whether there's another route I can go. I don't necessarily even need the _id itself to be incremented this way. As long as the users all have a unique numerical uid within the document, I would be happy.

I strongly disagree with author of selected answer that No auto-increment id in MongoDB and there are good reasons. We don't know reasons why 10gen didn't encourage usage of auto-incremented IDs. It's speculation. I think 10gen made this choice because it's just easier to ensure uniqueness of 12-byte IDs in clustered environment. It's default solution that fits most newcomers therefore increases product adoption which is good for 10gen's business.
Now let me tell everyone about my experience with ObjectIds in commercial environment.
I'm building social network. We have roughly 6M users and each user has roughly 20 friends.
Now imagine we have a collection which stores relationship between users (who follows who). It looks like this
_id : ObjectId
user_id : ObjectId
followee_id : ObjectId
on which we have unique composite index {user_id, followee_id}. We can estimate size of this index to be 12*2*6M*20 = 2GB. Now that's index for fast look-up of people I follow. For fast look-up of people that follow me I need reverse index. That's another 2GB.
And this is just the beginning. I have to carry these IDs everywhere. We have activity cluster where we store your News Feed. That's every event you or your friends do. Imagine how much space it takes.
And finally one of our engineers made an unconscious decision and decided to store references as strings that represent ObjectId which doubles its size.
What happens if an index does not fit into RAM? Nothing good, says 10gen:
When an index is too large to fit into RAM, MongoDB must read the index from disk, which is a much slower operation than reading from RAM. Keep in mind an index fits into RAM when your server has RAM available for the index combined with the rest of the working set.
That means reads are slow. Lock contention goes up. Writes gets slower as well. Seeing lock contention in 80%-nish is no longer shock to me.
Before you know it you ended up with 460GB cluster which you have to split to shards and which is quite hard to manipulate.
Facebook uses 64-bit long as user id :) There is a reason for that. You can generate sequential IDs
using 10gen's advice.
using mysql as storage of counters (if you concerned about speed take a look at handlersocket)
using ID generating service you built or using something like Snowflake by Twitter.
So here is my general advice to everyone. Please please make your data as small as possible. When you grow it will save you lots of sleepless nights.

Josh,
No auto-increment id in MongoDB and there are good reasons.
I would say go with ObjectIds which are unique in the cluster.
You can add auto increment by a sequence collection and using findAndModify to get the next id to use. This will definitely add complexities to your application and may also affect the ability to shard your database.
As long as you can guarantee that your generated ids will be unique, you will be fine.
But the headache will be there.
You can look at this post for more info about this question in the dedicated google group for MongoDB:
http://groups.google.com/group/mongodb-user/browse_thread/thread/f57b712b2aae6f0b/b4315285e689b9a7?lnk=gst&q=projapati#b4315285e689b9a7
Hope this helps.
Thanks

So, there's a fundamental problem with "auto-increment" IDs. When you have 10 different servers (shards in MongoDB), who picks the next ID?
If you want a single set of auto-incrementing IDs, you have to have a single authority for picking those IDs. In MySQL, this is generally pretty easy as you just have one server accepting writes. But big deployments of MongoDB are running sharding which doesn't have this "central authority".
MongoDB, uses 12-byte ObjectIds so that each server can create new documents uniquely without relying on a single authority.
So here's the big question: "can you afford to have a single authority"?
If so, then you can use findAndModify to keep track of the "last highest ID" and then you can insert with that.
That's the process described in your link. The obvious weakness here is that you technically have to do two writes for each insert. This may not scale very well, you probably want to avoid it on data with a high insertion rate. It may work for users, it probably won't work for tracking clicks.

There is nothing like an auto-increment in MongoDB but you may store your own counters in a dedicated collection and $inc the related value of counter as needed. Since $inc is an atomic operation you won't see duplicates.

The default Mongo ObjectId -- the one used in the _id field -- is incrementing.
Mongo uses a timestamp ( seconds since the Unix epoch) as the first 4-byte portion of its 4-3-2-3 composition, very similar (if not exactly) the same composition as a Version 1 UUID. And that ObjectId is generated at time of insert (if no other type of _id is provided by the user/client)
Thus the ObjectId is ordinal in nature; further, the default sort is based on this incrementing timestamp.
One might consider it an updated version of the auto-incrementing (index++) ids used in many dbms.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string