Can I rely on the assumption that each generated id is random, and there's no way to user can guess? So that I don't need to do more double check of ownership and what not, on my application.
Generally you will only get random uuids if you configure your instance to use the random algorithm. When looking at the source you can see that it then uses Erlang's crypto:rand_bytes/1. I'm not sure how predictable its results are, but you should note that there would also be crypto:strong_rand_bytes/1 - not sure why it hasn't been used.
I think the decision mostly depends on other circumstances.
What information are you trying to secure?
Why do you want to avoid ownership-checks as an additional security measure?
How do you ensure that the _ids are not leaked somehow (eg through CouchDB's /_log API)?
I wouldn't rely on that, no. A malicious user could make a lot of random guesses and could eventually hit something. Best to let identifiers identify things and security restrictions secure them.
Related
Currently, to pass a user id to the server on certain views I use the raw user id.
http://example.com/page/12345 //12345 Being the users id
Although there is no real security risk in my specific application by exposing this data, I can't help but feeling a little dirty about it. What is the proper solution? Should I somehow be disguising the data?
Maybe a better way to propose my question is to ask what the standard approach is. Is it common for applications to use user id's in plain view if it's not a security risk? If it is a security risk how is it handled? I'm just looking for a point in the right direction here.
There's nothing inherently wrong with that. Lots of sites do it. For instance, Stack Overflow users can be enumerated using URLs of the form:
http://stackoverflow.com/users/123456
Using a normalized form of the user's name in the URL, either in conjunction with the ID or as an alternative to it, may be a nicer solution, though, e.g:
http://example.com/user/yourusername
http://example.com/user/12345/yourusername
If you go with the former, you'll need to ensure that the normalized username is set up as a unique key in your user database.
If you go with the latter, you've got a choice: if the normalized username in the database doesn't match the one in the URL, you can either redirect to the correct URL (like Stack Overflow does), or return a 404 error.
In addition to duskwuff's great suggestion to use the username instead of the ID itself, you could use UUIDs instead of integers. They are 128-bit in length so infeasible to enumerate, and also avoid disclosing exactly how many users you have. As an added benefit, your site is future proofed against user id limits if it becomes massively popular.
For example, with integer ids, an attacker could find out the largest user_id on day one, and come back in a week or months time and find what the largest user_id is now. They can continually do this to monitor the rate of growth on your site - perhaps not a biggie for your example - but many organisations consider this sort of information commercially sensitive. Also helps avoid social engineering, e.g. makes it significantly harder for an attacker to email you asking to reset their password "because I've changed email providers and I've forgotten my old password but I remember my user id!". Give an attack an inch and they'll run a mile.
I prefer to use Version/Type 4 (Random) UUIDs, however you could also use Version/Type 5 (SHA-1-based) so you could go UUID.fromName(12345) and get a UUID derived from the integer value, which is useful if you want to migrate existing data and need to update a bunch of foreign key values. Most major languages support UUIDs natively or are included in popular libraries (C & C++), although some database software might require some tweaking - I've used them with postgres and myself and are easy transitions.
The downside is UUIDs are significantly longer and not memorable, but it doesn't sound like you need the ability for the user to type in the URLs manually. You do also need to check if the UUID already exists when creating a user, and if it does, just keep generating until an unused UUID is found - in practice given the size of the numbers, using Version 4 Random UUIDs you will have a better chance at winning the lottery than dealing with a collision, so it's not something that will impact performance etc.
Example URL: http://example.com/page/4586A0F1-2BAD-445F-BFC6-D5667B5A93A9
I'm implementing a referral system, and I'm wondering if I NEED to use a hash as the referrer_id, or if I can just use the user's id? I don't want to dogmatically use hashes for EVERYTHING, so can you give me some examples of potential pitfalls?
It depends on what is important to you. If you need to know for sure that the referrer_id originated from the user who provided the refer, then you will need some sort of hash combined with token that cannot be guessed. If all you do is hash the user id, you aren't providing any real security because an attacker could simply guess a user ID and hash it to make your server happy.
It is important to know that hashing does not produce any unpredictability. It will always produce the same output for the same input. Hashing is valuable because it prevents tampering (a small change in input produces a large change in output) and it normalizes the size of the data, but it should never be confused with encryption.
How much do you care about Confidentiality, Integrity, and Availability of the referrer_id? The answers to those questions will determine how much effort you need to exert to protect the referrer_id.
I've heard a few people say that you should never expose your internal ids to the outside world (for instance an auto_increment'ng primary key).
Some suggest having some sort of uuid column that you use instead for lookups.
I'm wondering really why this would be suggested and if it's truly important.
Using a uuid instead is basically just obfuscating the id. What's the point? The only thing I can think of is that auto_incrementing integers obviously point out the ordering of my db objects. Does it matter if an outside user knows that one thing was created before/after another?
Or is it purely that obfuscating the ids would prevent "guessing" at different operations on specific objects?
Is this even an issue I should thinking about when designing an external facing API?
Great answers, I'll add another reason to why you don't want to expose your internal auto incremented ID.
As a competitive company I can easily instrument how many new users/orders/etc you get every week/day/hour. I just need to create a user and/or order and subtract the new ID from what I got last time.
So not only for security reasons, it's business reasons as well.
Any information that you provide a malicious user about your application and its layout can and will be used against your application. One of the problems we face in (web) application security is that seemingly innocuous design decisions taken at the infancy of a project become achilles heels when the project scales larger. Letting an attacker make informed guesses about the ordering of entities can come back to haunt you in the following, somewhat unrelated ways:
The ID of the entity will inevitably be passed as a parameter at some point in your application. This will result in hackers eventually being able to feed your application arguments they ordinarily should not have access to. I've personally been able to view order details (on a very popular retailer's site) that I had no business viewing, as a URL argument no less. I simply fed the app sequential numbers from my own legitimate order.
Knowing the limits or at least the progression of primary key field values is invaluable fodder for SQL injection attacks, scope of which I can't cover here.
Key values are used not only in RDBMS systems, but other Key-Value mapping systems. Imagine if the JSESSION_ID cookie order could be predetermined or guessed? Everybody with opposable thumbs will be replaying sessions in web apps.
And many more that I'm sure other ppl here will come up with.
SEAL team 6 doesn't necessarily mean there are 6 seal teams. Just keeps the enemy guessing. And the time spent guessing by a potential attacker is more money in your pocket any way you slice it.
As with many security-related issues, it's a subtle answer - kolossus gives a good overview.
It helps to understand how an attacker might go about compromising your API, and how many security breaches occur.
Most security breaches are caused by bugs or oversights, and attackers look for those. An attacker who is trying to compromise your API will firstly try to collect information about it - as it's an API, presumably you publish detailed usage documentation. An attacker will use this document, and try lots of different ways to make your site crash (and thereby expose more information, if he's lucky), or react in ways you didn't anticipate.
You have to assume the attacker has lots of time, and will script their attack to try every single avenue - like a burglar with infinite time, who goes around your house trying every door and window, with a lock pick that learns from every attempt.
So, if your API exposes a method like getUserInfo(userid), and userID is an integer, the attacker will write a script to iterate from 0 upwards to find out how many users you have. They'll try negative numbers, and max(INT) + 1. Your application could leak information in all those cases, and - if the developer forgot to handle certain errors - may expose more data than you intended.
If your API includes logic to restrict access to certain data - e.g. you're allowed to execute getUserInfo for users in your friend list - the attacker may get lucky with some numbers because of a bug or an oversight, and he'll know that the info he is getting relates to a valid user, so they can build up a model of the way your application is designed. It's the equivalent of a burglar knowing that all your locks come from a single manufacturer, so they only need to bring that lock pick.
By itself, this may be of no advantage to the attacker - but it makes their life a tiny bit easier.
Given the effort of using UUIDs or another meaningless identifier, it's probably worth making things harder for the attacker. It's not the most important consideration, of course - it probably doesn't make the top 5 things you should do to protect your API from attackers - but it helps.
If you bind an api call to the object's id, could one simply brute force this api to get all objects? If you think of MySQL, this would be totally possible with incremental integer ids. But what about MongoDB? Are the ids guessable? For example, if you know one id, is it easy to guess other (next, previous) ids?
Thanks!
Update Jan 2019: As mentioned in the comments, the information below is true up until version 3.2. Version 3.4+ changed the spec so that machine ID and process ID were merged into a single random 5 byte value instead. That might make it harder to figure out where a document came from, but it also simplifies the generation and reduces the likelihood of collisions.
Original Answer:
+1 for Sergio's answer, in terms of answering whether they could be guessed or not, they are not hashes, they are predictable, so they can be "brute forced" given enough time. The likelihood depends on how the ObjectIDs were generated and how you go about guessing. To explain, first, read the spec here:
Object ID Spec
Let us then break it down piece by piece:
TimeStamp - completely predictable as long as you have a general idea of when the data was generated
Machine - this is an MD5 hash of one of several options, some of which are more easily determined than others, but highly dependent on the environment
PID - again, not a huge number of values here, and could be sleuthed for data generated from a known source
Increment - if this is a random number rather than an increment (both are allowed), then it is less predictable
To expand a bit on the sources. ObjectIDs can be generated by:
MongoDB itself (but can be migrated, moved, updated)
The driver (on any machine that inserts or updates data)
Your Application (you can manually insert your own ObjectID if you wish)
So, there are things you can do to make them harder to guess individually, but without a lot of forethought and safeguards, for a normal data set, the ranges of valid ObjectIDs should be fairly easy to work out since they are all prefixed with a timestamp (unless you are manipulating this in some way).
Mongo's ObjectId were never meant to be a protection from brute force attack (or any attack, for that matter). They simply offer global uniqueness. You should not assume that some object can't be accessed by a user because this user should not know its id.
For an actual protection of your resources, employ other techniques.
If you defend against an unauthorized access, place some authorization logic in your app (allow access to legitimate users, deny for everyone else).
If you want to hinder dumping all objects, use some kind of rate limiting. Combine with authorization if applicable.
Optional reading: Eric Lippert on GUIDs.
Is it recommended to create a column (unique key) that is a hash.
When people view my URL, it is currently like this:
url.com/?id=2134
But, people can look over this and data-mine all the content, right?
Is it RECOMMENDED to go 1 extra step to make this through hash?
url.com?id=3fjsdFNHDNSL
Thanks!
The first and most important step is to use some form of role-based security to ensure that no user can see data they aren't supposed to see. So, for example, if a user should only see their own information, then you should check that the id belongs to the logged-in user before you display it.
As a second level of protection, it's not a bad idea to have a unique key that doesn't let you predict other keys (a hash, as you suggest, or a UUID). However, that still means that, for example, a malicious user who obtained someone else's URL (e.g. by sniffing a network, by reading a log file, by viewing the history in someone's browser) could see that user's information. You need authentication and authorization, not simply obfuscating the IDs.
It sort of depends on your situation, but off hand I think if you think you need to hash you need to hash. If someone could data mine by, say, iterating through:
...
url.com?id=2134
url.com?id=2135
url.com?id=2136
...
Then using a hash for the id is necessary to avoid this, since it will be much harder to figure out the next one. Keep in mind, though, that you don't want to make the hash too obvious, so that a determined attacker would easily figure it out, e.g. just taking the MD5 of 2134 or whatever number you had.
Well, the problem here is that an actual Hash is technically one way. So if you hash the data you won't be able to recover it on the receiving side. Without knowing what technology you are using to create your web page it's hard to make any concrete suggestions, but if you must have sensitive information in your query string then I would recommend that you at least use a symmetric encryption algorithm on it to keep people from simply reading off the values and reverse engineering things.
Of course if you have the option - it's probably better to not have that information in the query string at all.