MongoDB: Architecture and security best-practices

MongoDB: Architecture and security best-practices - node.js

Assuming we build a web page that features some sort of profile page for users, then every user is supposed to be identified by use of a unique id. In MongoDB this is most likely the _id column of a document. When handling the link to a users profile, I don't feel comfortable exposing an actual database ID in the weblink, such as
https://www.foopage.com/userprofile/5612afeedd2387aeeebbdd211100ccdd
Instead, I'd like to introduce a shorter identifier, as can also be seen on youtube where every video has a unique 10 digit alphanumeric code.
My question now is:
1) From a security point of view: is it worth hiding the actual ID of a document from public or does it in fact not enhancing system security?
2) I am expecting performance drawbacks by using my own identifier, as the native _ID of MongoDB is most likely indexed and "optimized", whereas my identifier always requires some sort of data mapping or additional indexing. What is your experience? Is there any documented difference in using a native index and a manually created one for MongoDB?

1) From a security point of view: is it worth hiding the actual ID of a document from public or does it in fact not enhancing system security?
Usually an attacker should not have any benefit from knowing ObjectID's. That means unless you build some security vulnerability in your application which can be exploited by knowing an ID.
Another thing you need to be aware is that ObjectID's are not entirely random. They also leak a machine identifier, the process-id of the process which generated it and the date and time the ID was generated. See the documentation for more information. None of that information is really security-critical (although creation-time is something you might not want to reveal in some cases), but might be valuable information for an attacker trying to exploit some other vulnerability of your system.
That means hiding ObjectsIDs as part of a defense in depth strategy can be useful, but it is only security through obscurity. Do not expect ObjectIDs to be secret.
2) I am expecting performance drawbacks by using my own identifier, as the native _ID of MongoDB is most likely indexed and "optimized", whereas my identifier always requires some sort of data mapping or additional indexing.
Did you know that your _id values don't need to be an ObjectID? You can use any value of any type you want as long as it is unique on collection level. So if your data already has an unique identifier, you can use that as _id and save some bytes and an index. To do this, simply set the _id field of a document to the desired value before you insert it into the database.
If you want to retain the _id:ObjectID pattern, you can just create an unique index on your application-specific ID field. You now have at least two indexes which are updated on every insert, but the performance penalty for this should not be very severe unless your application is very write-heavy (which would be unusual in the context of web application).
By the way: The URL example.com/userprofile/5612afeedd2387aeeebbdd211100ccdd is not very search-engine friendly. It would be better to have URLs like example.com/userprofile/IgorP. They are more readable for both search engines and human users.

Related

What are the efficiency costs associated with using a custom ID in Mongodb

I plan on using this NPM package (shortid) to produce shorter IDs, primarily for use in URL's, I wish to use them, as directed, as the Mongodb id (at least for certain collections).
What are the costs associated with using custom IDs? Will it effect lookup time, write time etc. in any significant way?

These types of questions can quickly wander off into a battle of opinions so rather than stating an opinion I think providing some pros and cons and letting you decide which is better for this application would make more sense.
Assuming the format of the "shortid" will be stored as a string I think a response by Abigail Watson to a similar question on Google Groups sums up some of the larger points. Her response is primarily aimed at Meteor apps and so some of her pro/cons are associated with design decisions made by the Meteor team but you can see how you should be thinking about whether or not to use an ObjectId or a "shortid" is an application based decision.
Her entire response:
ObjectId Pros
it has an embedded timestamp in it.
it's the default Mongo _id type; ubiquitous
interoperability with other apps and drivers
ObjectId Cons
it's an object, and a little more difficult to manipulate in practice.
there will be times when you forget to wrap your string in new ObjectId()
it requires server side object creation to maintain _id uniqueness
which makes generating them client-side by minimongo problematic
String Pros
developers can create domain specific _id topologies
String Cons
developer has to ensure uniqueness of _ids
findAndModify() and getNextSequence() queries may be invalidated
Meteor's choice to go with a string, as I understand it, basically boils down to latency compensation and being able to generate the _id on the client-side in mini-mongo. The default ObjectId implementation didn't lend itself to being generated on the client as part of the latency compensation framework, so they decided to roll their own _id scheme.
Personally, I find the embedded timestamps in ObjectIds to be invaluable later in an application's lifecycle. They are more difficult to manipulate, and they add more debugging time to an application's development cycle. But for the extra 10 or 20 hours you put into debugging the ObjectIds, can return 10x or 100x savings down the road. Example: at work, we just salvaged a year's worth of production data because of the embedded timestamps, which has saved us probably hundreds of thousands of dollars of R&D time and effort.
ObjectId's are great if you can ensure that there's one central authority for generating them. They're also the preferred index type for any type of timeseries data. And while it may seem tempting to try to make a one-or-the-other decision for your entire app, I find choosing a string vs ObjectId (vs some other index scheme) really boils down to the topology of the data in the collection.
Some useful questions to maybe ask when choosing the _id for a collection:
Does the data in the collection need latency compensation?
Is it time-series data?
Will other applications or worker utilities be accessing the collection?
What is the topology of the data in the collection?
https://groups.google.com/d/msg/meteor-talk/f-ljBdZOwPk/oQYZQxCAKN8J
My two cents to throw into the mix is considering if the main reason to use a "shortid" is for shorter URLs why not create a URL property that is also indexed and used only for fetching documents with a URL id? You get to keep the ObjectId so you don't have to worry about sharding or dependency issues down the road while also having a shorter URL ID value.

Visible User ID in Address Bar

Currently, to pass a user id to the server on certain views I use the raw user id.
http://example.com/page/12345 //12345 Being the users id
Although there is no real security risk in my specific application by exposing this data, I can't help but feeling a little dirty about it. What is the proper solution? Should I somehow be disguising the data?
Maybe a better way to propose my question is to ask what the standard approach is. Is it common for applications to use user id's in plain view if it's not a security risk? If it is a security risk how is it handled? I'm just looking for a point in the right direction here.

There's nothing inherently wrong with that. Lots of sites do it. For instance, Stack Overflow users can be enumerated using URLs of the form:
http://stackoverflow.com/users/123456
Using a normalized form of the user's name in the URL, either in conjunction with the ID or as an alternative to it, may be a nicer solution, though, e.g:
http://example.com/user/yourusername
http://example.com/user/12345/yourusername
If you go with the former, you'll need to ensure that the normalized username is set up as a unique key in your user database.
If you go with the latter, you've got a choice: if the normalized username in the database doesn't match the one in the URL, you can either redirect to the correct URL (like Stack Overflow does), or return a 404 error.

In addition to duskwuff's great suggestion to use the username instead of the ID itself, you could use UUIDs instead of integers. They are 128-bit in length so infeasible to enumerate, and also avoid disclosing exactly how many users you have. As an added benefit, your site is future proofed against user id limits if it becomes massively popular.
For example, with integer ids, an attacker could find out the largest user_id on day one, and come back in a week or months time and find what the largest user_id is now. They can continually do this to monitor the rate of growth on your site - perhaps not a biggie for your example - but many organisations consider this sort of information commercially sensitive. Also helps avoid social engineering, e.g. makes it significantly harder for an attacker to email you asking to reset their password "because I've changed email providers and I've forgotten my old password but I remember my user id!". Give an attack an inch and they'll run a mile.
I prefer to use Version/Type 4 (Random) UUIDs, however you could also use Version/Type 5 (SHA-1-based) so you could go UUID.fromName(12345) and get a UUID derived from the integer value, which is useful if you want to migrate existing data and need to update a bunch of foreign key values. Most major languages support UUIDs natively or are included in popular libraries (C & C++), although some database software might require some tweaking - I've used them with postgres and myself and are easy transitions.
The downside is UUIDs are significantly longer and not memorable, but it doesn't sound like you need the ability for the user to type in the URLs manually. You do also need to check if the UUID already exists when creating a user, and if it does, just keep generating until an unused UUID is found - in practice given the size of the numbers, using Version 4 Random UUIDs you will have a better chance at winning the lottery than dealing with a collision, so it's not something that will impact performance etc.
Example URL: http://example.com/page/4586A0F1-2BAD-445F-BFC6-D5667B5A93A9

CouchDB - human readable id

Im using CouchDB with node.js. Right now there is one node involved and even in remote future its not planned to changed that. While I can remove most of the cases where a short and auto-incremental-like (it can be sparse but not like random) ID is required there remains one place where the users actually needs to enter the ID of a product. I'd like to keep this ID as short as possible and in a more human readable format than something like '4ab234acde242349b' as it sometimes has to be typed by hand and so on.
However in the database it can be stored with whatever ID pleases CouchDB (using the default auto generated UUID) but it should be possible to give it a number that can be used to identify it as well. What I have thought about is creating a document that consists of an array with all the UUIDs from CouchDB. When in node I create a new product I would run an update handler that updates said document with the new unique ID at the end. To obtain the products ID I'd then query the array and client side using indexOf I could get the index as a short ID.
I dont know if this is feasible. From the performance point of view I can say the following: There are more queries that should do numerical ID -> uuid than uuid -> numerical ID. There will be at max 7000 new entries a year in the database. Also there is no use case where a product can be deleted yet I'd like not to rely on that.
Are there any other applicable ways to genereate a shorter and more human readable ID that can be associated with my document?
/EDIT
From a technical point of view: It seems to be working. I can do both conversions number <-> uuid and it seems go well. I dont now if this works well with replication and stuff but as there is said array i guess it should, right?

You have two choices here:
Set your human readable id as _id field. Basically you can just set in create document calls to DB, and it will accept it. This can be a more lightweight solution, but it comes with some limitations:
It has to be unique. You should also be careful about clients trying to create documents, but instead overwrite existing ones.
It can only contain alphanumeric or a few special characters. In my experience it is asking for trouble to have extra character types.
It cannot be longer than a theoretical string length limit(Couchdb doesn't define any, but you should). Long ids will increase your views(indexes) size really bad. And it might make it s lower.
If these things are no problem with you, then you should go with this solution.
As you said yourself, let the _id be a UUID, and set the human readable id to another field. To reach the document by the human readable id, you can just create a view emitting the human readable id as a key, and then either emit the document as value or get the document via include_docs=true option. Whenever the view is reached Couchdb will update the view incrementally and return you the list. This is really same as you creating a document with an array/object of ids inside it. Except with using a couchdb view, you get more performance.
This might be also slightly slower on querying and inserting. If the ids are inserted sequentially, it's fine, if not, CouchDB will slightly take more time to insert it at the right place. These don't work well with huge amounts of insert coming at the DB.
Querying shouldn't be more than 10% of total query time longer than first option. I think 10% is really a big number. It will be most probably less than 5%, I remember in my CouchDB application, I switched from reading by _id to reading from a view by a key and the slow down was very little that from user end point, when making 100 queries at the same time, it wasn't noticeable.
This is how people, query documents by other fields than id, for example querying a user document with email, when the user is logging in.
If you don't know how couchdb views work, you should read the views chapter of couchdb definite guide book.
Also make sure you stay away from documents with huge arrays inside them. I think CouchDB, has a limit of 4GB per document. I remember having many documents and it had really long querying times because the view had to iterate on each array item. In the end for each array item, instead I created one document. It was way faster.

Are MongoDB ids guessable?

If you bind an api call to the object's id, could one simply brute force this api to get all objects? If you think of MySQL, this would be totally possible with incremental integer ids. But what about MongoDB? Are the ids guessable? For example, if you know one id, is it easy to guess other (next, previous) ids?
Thanks!

Update Jan 2019: As mentioned in the comments, the information below is true up until version 3.2. Version 3.4+ changed the spec so that machine ID and process ID were merged into a single random 5 byte value instead. That might make it harder to figure out where a document came from, but it also simplifies the generation and reduces the likelihood of collisions.
Original Answer:
+1 for Sergio's answer, in terms of answering whether they could be guessed or not, they are not hashes, they are predictable, so they can be "brute forced" given enough time. The likelihood depends on how the ObjectIDs were generated and how you go about guessing. To explain, first, read the spec here:
Object ID Spec
Let us then break it down piece by piece:
TimeStamp - completely predictable as long as you have a general idea of when the data was generated
Machine - this is an MD5 hash of one of several options, some of which are more easily determined than others, but highly dependent on the environment
PID - again, not a huge number of values here, and could be sleuthed for data generated from a known source
Increment - if this is a random number rather than an increment (both are allowed), then it is less predictable
To expand a bit on the sources. ObjectIDs can be generated by:
MongoDB itself (but can be migrated, moved, updated)
The driver (on any machine that inserts or updates data)
Your Application (you can manually insert your own ObjectID if you wish)
So, there are things you can do to make them harder to guess individually, but without a lot of forethought and safeguards, for a normal data set, the ranges of valid ObjectIDs should be fairly easy to work out since they are all prefixed with a timestamp (unless you are manipulating this in some way).

Mongo's ObjectId were never meant to be a protection from brute force attack (or any attack, for that matter). They simply offer global uniqueness. You should not assume that some object can't be accessed by a user because this user should not know its id.
For an actual protection of your resources, employ other techniques.
If you defend against an unauthorized access, place some authorization logic in your app (allow access to legitimate users, deny for everyone else).
If you want to hinder dumping all objects, use some kind of rate limiting. Combine with authorization if applicable.
Optional reading: Eric Lippert on GUIDs.

Should I implement auto-incrementing in MongoDB?

I'm making the switch to MongoDB from MySQL. A familiar architecture to me for a very basic users table would have auto-incrementing of the uid. See Mongo's own documentation for this use case.
I'm wondering whether this is the best architectural decision. From a UX standpoint, I like having UIDs as external references, for example in shorter URLs: http://example.com/users/12345
Is there a third way? Someone in IRC Freenode's #mongodb suggested creating a range of IDs and caching them. I'm unsure of how to actually implement that, or whether there's another route I can go. I don't necessarily even need the _id itself to be incremented this way. As long as the users all have a unique numerical uid within the document, I would be happy.

I strongly disagree with author of selected answer that No auto-increment id in MongoDB and there are good reasons. We don't know reasons why 10gen didn't encourage usage of auto-incremented IDs. It's speculation. I think 10gen made this choice because it's just easier to ensure uniqueness of 12-byte IDs in clustered environment. It's default solution that fits most newcomers therefore increases product adoption which is good for 10gen's business.
Now let me tell everyone about my experience with ObjectIds in commercial environment.
I'm building social network. We have roughly 6M users and each user has roughly 20 friends.
Now imagine we have a collection which stores relationship between users (who follows who). It looks like this
_id : ObjectId
user_id : ObjectId
followee_id : ObjectId
on which we have unique composite index {user_id, followee_id}. We can estimate size of this index to be 12*2*6M*20 = 2GB. Now that's index for fast look-up of people I follow. For fast look-up of people that follow me I need reverse index. That's another 2GB.
And this is just the beginning. I have to carry these IDs everywhere. We have activity cluster where we store your News Feed. That's every event you or your friends do. Imagine how much space it takes.
And finally one of our engineers made an unconscious decision and decided to store references as strings that represent ObjectId which doubles its size.
What happens if an index does not fit into RAM? Nothing good, says 10gen:
When an index is too large to fit into RAM, MongoDB must read the index from disk, which is a much slower operation than reading from RAM. Keep in mind an index fits into RAM when your server has RAM available for the index combined with the rest of the working set.
That means reads are slow. Lock contention goes up. Writes gets slower as well. Seeing lock contention in 80%-nish is no longer shock to me.
Before you know it you ended up with 460GB cluster which you have to split to shards and which is quite hard to manipulate.
Facebook uses 64-bit long as user id :) There is a reason for that. You can generate sequential IDs
using 10gen's advice.
using mysql as storage of counters (if you concerned about speed take a look at handlersocket)
using ID generating service you built or using something like Snowflake by Twitter.
So here is my general advice to everyone. Please please make your data as small as possible. When you grow it will save you lots of sleepless nights.

Josh,
No auto-increment id in MongoDB and there are good reasons.
I would say go with ObjectIds which are unique in the cluster.
You can add auto increment by a sequence collection and using findAndModify to get the next id to use. This will definitely add complexities to your application and may also affect the ability to shard your database.
As long as you can guarantee that your generated ids will be unique, you will be fine.
But the headache will be there.
You can look at this post for more info about this question in the dedicated google group for MongoDB:
http://groups.google.com/group/mongodb-user/browse_thread/thread/f57b712b2aae6f0b/b4315285e689b9a7?lnk=gst&q=projapati#b4315285e689b9a7
Hope this helps.
Thanks

So, there's a fundamental problem with "auto-increment" IDs. When you have 10 different servers (shards in MongoDB), who picks the next ID?
If you want a single set of auto-incrementing IDs, you have to have a single authority for picking those IDs. In MySQL, this is generally pretty easy as you just have one server accepting writes. But big deployments of MongoDB are running sharding which doesn't have this "central authority".
MongoDB, uses 12-byte ObjectIds so that each server can create new documents uniquely without relying on a single authority.
So here's the big question: "can you afford to have a single authority"?
If so, then you can use findAndModify to keep track of the "last highest ID" and then you can insert with that.
That's the process described in your link. The obvious weakness here is that you technically have to do two writes for each insert. This may not scale very well, you probably want to avoid it on data with a high insertion rate. It may work for users, it probably won't work for tracking clicks.

There is nothing like an auto-increment in MongoDB but you may store your own counters in a dedicated collection and $inc the related value of counter as needed. Since $inc is an atomic operation you won't see duplicates.

The default Mongo ObjectId -- the one used in the _id field -- is incrementing.
Mongo uses a timestamp ( seconds since the Unix epoch) as the first 4-byte portion of its 4-3-2-3 composition, very similar (if not exactly) the same composition as a Version 1 UUID. And that ObjectId is generated at time of insert (if no other type of _id is provided by the user/client)
Thus the ObjectId is ordinal in nature; further, the default sort is based on this incrementing timestamp.
One might consider it an updated version of the auto-incrementing (index++) ids used in many dbms.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string