Hashing SSNs and other limited-domain information - security

I'm currently working on an application where we receive private health information. One of the biggest concerns is with the SSN. Currently, we don't use the SSN for anything, but in the future we'd like to be able to use it to uniquely identify a patient across multiple facilities. The only way I can see to do that reliably is through the SSN. However, we (in addition to our customers) REALLY don't want to store the SSN.
So naturally, I thought of just SHA hashing it since we're just using it for identification. The problem with that is that if an attacker knows the problem domain (an SSN), then they can focus on that domain. So it's much easier to calculate the billion SSNs rather than a virtually unlimited number of passwords. I know I should use a site salt and a per-patient salt, but is there anything else I can do to prevent an attacker from revealing the SSN? Instead of SHA, I was planning on using BCrypt, since Ruby has a good library and it handles scalable complexity and salting automagically.
It's not going to be used as a password. Essentially, we get messages from many facilities, and each describes a patient. The only thing close to a globally unique identifier for a patient is the SSN number. We are going to use the hash to identify the same patient at multiple facilities.

The algorithm for generating Social Security Numbers was created before the concept of a modern hacker and as a consequence they are extremely predictable. Using a SSN for authentication is a very bad idea, it really doesn't matter what cryptographic primitive you use or how large your salt value is. At the end of the day the "secret" that you are trying to protect doesn't have much entropy.
If you never need to know the plain text then you should use SHA-256. SHA-256 is a very good function to use for passwords.

If you seriously want to hash a social security number in a secure way, do this:
Find out how much entropy is in
an SSN (hint: there is very little.
Far less than a randomly chosen 9
digit number).
Use any hashing algorithm.
Keep fewer (half?) bits than
there is entropy in an SSN.
Result:
Pro: Secure hash of an SSN because of
a large number of hash collisions.
Pro: Your hashes are short and easy to store.
Con: Hash collisions.
Con: You can't use it for a unique
identifier because of Con#1.
Pro: That's good because you really
really need to not be using SSNs as
identifiers unless you are the Social
Security Administration.

First, much applause and praise for storing a hash of the SSN.
It appears as if you're reserving the SSNs as a sort of 'backup username.' In this case, you need another form of authentication besides the username - a password, a driver's license number, a passport number, proof of residence, etcetera.
Additionally, if you're concerned that an attacker is going to predict the top 10,000 SSNs for a patient born in 1984 in Arizona, and attempt each of them, then you can put in an exponentially increasing rate limiter in your application.* For additional defense, build in a notification system that alerts a sys-admin when it appears that there is an unusually high number of failed login attempts.**
*Example exponentially increasing rate limiter:
After each failed request, delay the next request by (1.1^N) seconds, where N is the number of failed requests from that IP. Track IP and failed login attempts in a DB table; shouldn't add too much load, depending on the audience of your application (do you work for Google?).
**In the case where an attacker has access to multiple IPs, the notification will alert a sys-admin who can use his or her judgment to see if you have an influx of stupid users or it's a malicious attempt.

Related

Full text search on encrypted data

Suppose I have a server storing encrypted text (end-to-end: server never sees plain text).
I want to be able to do full text search on that text.
I know this is tricky, but my idea is to use the traditional full text design ("list" and "match" tables where words are stored and matched with ids from the content table). When users submit the encrypted text, they also send a salted MD5 of the words and respective matches. The salt used is unique for each user and is recovered from their password.
(in short: the only difference is that the "list" table will contain hashed words)
Now, how vulnerable would this system be?
Note that I said "how vulnerable" instead of "how safe", because I acknowledge that it can't be totally safe.
I DO understand the tradeoff between features (full text search) and security (disclosing some information from the word index). For example, I understand that an attacker able to get the list and match tables could get information about the original, encrypted text and possibly be able to decipher some words with statistical analysis (however, being the salt unique for each user, this would need to be repeated for each user).
How serious would this threat be? And would there be any other serious threats?
DISCLAIMER
What I'm trying to build (and with the help of a cryptographer for actual implementation; right now I'm just trying to understand wether this will be possible) is a consumer-grade product which will deal with confidential yet not totally secret data.
My goal is just to provide something safe enough, so that it would be easier for an attacker to try stealing users' passwords (e.g. breaching into clients - they're consumers, eventually) rather than spending a huge amount of time and computing power trying to brute force the index or run complicated statistical analysis.
Comments in response to #Matthew
(may be relevant for anyone else answering)
As you noted, other solutions are not viable. Storing all the data inside the client means that users cannot access their data from other clients. Server-side encryption would work, but then we won't be able to give users the added security of client-side encryption.The only "true alternative" is just to not implement search: while this is not a required feature, it's very important to me/us.
The salt will be protected in the exactly same way as the users' decryption key (the one used to decrypt stored texts). Thus, if someone was able to capture the salt, he or she would likely be able to capture also the key, creating a much bigger issue.To be precise, the key and the salt will be stored encrypted on the server. They will be decrypted by the client locally with the user's password and kept in memory; the server never sees the decrypted key and salt. Users can change passwords, then, and they just need to re-encrypt the key and the salt, and not all stored texts. This is a pretty standard approach in the industry, to my knowledge.
Actually, the design of the database will be as follow (reporting relevant entries only). This design is like you proposed in your comment. It disallows proximity searches (not very relevant to us) and makes frequency less accurate.
Table content, containing all encrypted texts. Columns are content.id and content.text.
Table words, containing the list of all hashes. Columns are words.id and words.hash.
Table match, that matches texts with hashes/words (in a one-to-many relationship). Columns are match.content_id and match.word_id.
We would have to implement features like removing stopwords etc. Sure. That is not a big issue (will, of course, be done on the client). Eventually, those lists have always been of limited utility for international (i.e. non English-speaking) users.
We expect the lookup/insert ratio to be pretty high (i.e. many lookups, but rare inserts and mostly in bulk).
Decrypting the whole hash database is certainly possible, but requires a brute force attack.
Suppose the salt is kept safe (as per point 2 above). If the salt is long enough (you cited 32 bits... but why not 320? - just an example) that would take A LOT of time.
To conclude... You confirmed my doubts about the possible risk of frequency analysis. However, I feel like this risk is not so high. Can you confirm that?
Indeed, first of all the salt would be unique per each user. This means that one user must be attacked at time.
Second, by reporting words only once per text (no matter how many times they appear), frequency analysis becomes less reliable.
Third... Frequency analysis on hashed words doesn't sound as something as good as frequency analysis on a Caesar-shift, for example. There are 250,000 words in English alone (and, again, not all our users will be English-speaking), and even if some words are more common than others, I believe it'd be hard to do this attack anyway.
PS: The data we'll be storing is messages, like instant messages. These are short, contain a lot of abbreviations, slang, etc. And every person has a different style in writing texts, further reducing the risk (in my opinion) of frequency attacks.
TL;DR: If this needs to be secure enough that it requires per-user end-to-end encryption: Don't do it.
Too long for a comment, so here goes - if I understand correctly:
You have encrypted data submitted by the user (client side encrypted, so not using the DB to handle).
You want this to be searchable to the user (without you knowing anything about this - so an encrypted block of text is useless).
Your proposed solution to this is to also store a list (or perhaps paragraph) of hashed words submitted from the client as well.
So the data record would look like:
Column 1: Encrypted data block
Column 2: (space) delimited hashed, ordered, individual words from the above encrypted text
Then to search you just hash the search terms and treat the hashed terms as words to search the paragraph(s) of "text" in column 2. This will definitely work - just consider searching nonsense text with nonsense search terms. You would even still be able to do some proximity ranking of terms with this approach.
Concerns:
The column with the individually hashed words as text will incredibly weak in comparison to the encrypted text. You are greatly weakening your solution as not only are there limited words to work from, the resultant text will be susceptible to word frequency analysis, etc.
If you do this: separately store a salt unrelated to the password. Given that a rainbow table will be easy to create if your salt is captured (dictionary words only) store it encrypted somewhere.
You will lose many benefits of FTS like ignoring words like 'the' - you will need to re-implement this functionality on your own if you want it (i.e. remove these terms on the client side before submitting the data / search terms).
Other approaches that you imply are not acceptable/workable:
Implement searching client side (all data has to exist on the client to search)
Centralized encryption leveraging the databases built in functionality
I understand the argument being that your approach provides the user with the only access to their data (i.e. you cannot see/decrypt it). I would argue that this hashed approach weakens the data sufficiently that you could reasonably work out a users data (that is, you have lowered the effort required to the point that it is very plausible you could decrypt a user's information without any knowledge of their keys/salts). I wouldn't quite lower the bar to describe this as just obfuscation, but you should really think through how significant this is.
If you are sure that weakening your system to implement searching like this makes sense, and the another approach is not sufficient, one thing that could help is to store the hashes of words in the text as a list of uniquely occuring words only (i.e. no frequency or proximity information would be available). This would reduce the attack surface area of your implementation a little, but would also lose the benefits you are implying you want by describing the approach as FTS. You could get very fast results like this though as the hashed words essentially become tags attached to all the records that include them. The search lookup then could become very fast (at the expense of your inserts).
*Just to be clear - I would want to be REALLY sure my business needs demanded something like this before I implemented it...
EDIT:
Quick example of the issues - say I know you are using 32-bit salts and are hashing common words like "the". 2^32 possible salts = 4 billion possible salts (that is, not that many if you only need to hash a handful of words for the initial attack). Assume the salt is appended or prepended, this still is only 8 billion entries to pre-calculate. Even if it is less common words you do not need to create too many lists to ensure you will get hits (if this is not the case your data would not be worth searching).
Then lookup the highest frequency salts for a given block of text in our each of our pre-calculated salt tables and use the match to see if it correctly decrypts other words in the text. Once you have a plausible candidate generate the 250,000 word English language rainbow table for that salt and decrypt the text.
I would guess you could decrypt the hashed data in the system in hours to days with access to the database.
First, you have all of the normal vulnerabilities of password-based cryptography, which stem from users picking predictable passwords. It is common to crack more than 50% of passwords from real-world applications in offline attacks with less than two hours of desktop computing time.
I assume the full text encryption key is derived from the password, or is encrypted by a password-derived key. So an attacker can test guesses against a selection of hashed index keys, and once she finds the password, decrypt all of the documents.
But, even if a user picks a high-entropy password, frequency analysis on the index could potentially reveal a lot about the plain text. Although word order is lost in indexing (if you don't support proximity searches), you are essentially creating an electronic code book for each user. This index would be vulnerable to centuries of well-developed cryptanalytical techniques. Modern encryption protocols avoid ECB, and provide "ciphertext indistinguishability"—the same plain text yields different cipher text each time it's encrypted. But that doesn't work with indexes.
A less vulnerable approach would be to index and search on the client. The necessary tables would be bundled as a single message and encrypted on the client, then transported to the server for storage. The obvious tradeoff is the cost of transmission of that bundle on each session. Client-side caching of index fragments could mitigate this cost somewhat.
In the end, only you can weigh the security cost of a breach against the performance costs of client-side indexing. But the statistical analysis enabled by an index is a significant vulnerability.
MSSQL Enterprise TDE encrypts Full-Text index as well as other indices when you set whole database encryption (Since 2008). in practice, it works pretty well, without a huge performance penalty. Can't comment on how, b/c it's a proprietary algo, but heres the docs.
https://learn.microsoft.com/en-us/sql/relational-databases/security/encryption/transparent-data-encryption-tde
it doesn't cover any of your application stack besides your db, but your FTS indices will work like normal and won't exist in plain text like they do in MySQL or PostGres. MariaDB and of course Oracle have their own implementation as well, from what i remember. MySQL and PGSQL do not.
As for passwords, TDE on all the implementations use AES keys, which can be rotated (though not always easily) - so the password vulnerability fall on the DBA's.
The problem is you need to pay for full enterprise licensing for MSSQL TDE (ie features not available in "standard" or "basic" cloud and on premise editions), and you do probably for TDE in Oracle as well. But if what you need is a quick solution and have the cash for enterprise licensing (probably cheaper than developing your own implementation), implementations are out there.

Does adding a constant string to the user's password before hashing it make it more secure?

Does adding a constant string that is stored in the code to the password before hashing make it harder for an attacker to figure out the original password?
This constant string is in addition to a salt. So, Hash(password + "string in code added to every password" + randomSaltForEachPassword)
Normally, if an attacker gets their hands on the database, they can possibly figure out someone's password by brute force. The database contains the salts corresponding to each password, so they would know what to salt their brute force attempts with. But, with the constant string in code, the attacker would also have to obtain the source code to know what to append to each of their brute force attempts.
I think it would be more secure, but I wanted to get other people's thoughts, and also make sure I'm not inadvertently making it less secure.
Given that you already have a random salt, appending some other string neither adds nor detracts from the security level.
Basically, it's just a waste of time.
update
This was getting a little long to use the comments.
First off, if the attacker has the database and the only thing you've encrypted is the password then games over anyhow. They have the data which is the truly important part.
Second, the salt means that they have to create a larger rainbow table to encompass the larger password length possibilities. The time this takes becomes impractical depending on salt length and the resources available to the cracker. See this question for a bit more info:
How to implement password protection for individual files?
update 2
;)
It is true that users reuse passwords (as some of the latest hacked sites reveal) and it's good that you want to prevent your data loss from impacting them. However, once you finish reading this update you'll see why that's not entirely possible.
The other questions will have to be taken together. The entire purpose of a salt is to ensure that the same two passwords result in a different hash value. Each salt value would require a rainbow table to be created encompassing all of the password hash possibilities.
Therefore not using a salt value means that a single global rainbow table can be referenced. It also means that if you use just one salt value for all passwords on the site, then, again, they can create a single rainbow table and grab all of the passwords at once.
However, when each password has a separate salt value this means they have to create a rainbow table for each salt value. Rainbow tables take time and resources to build. Things that can help limit the time it takes to create a table is knowing the password length restrictions. For example, if your passwords must be between 7 and 9 characters then the hacker only has to compute hash values in that range.
Now the salt value has to be available to the function that is going to hash a password attempt. Generally speaking you could hide this value elsewhere; but quite frankly if they've stolen the database then they'll be able to track it down pretty easily. So, placing the values next to the actual password has zero impact on security.
Adding an extra bit of characters that is common to ALL passwords adds nothing to the mix. Once a hacker cracks the first one it will be obvious that the others have this value and they can code their rainbow table generator accordingly. Meaning that it essentially saves no time. Further, it leads to a false sense of security on your part which can lead to you making bad choices.
Which leads us back to the purpose of salting passwords. The purpose is not to make it impossible, as anyone with time and resources can crack them. The purpose is to make it difficult and time consuming. The time consuming part is to allow you the time to detect the break in, notify everyone you have to, and enforce password changes in your system.
In other words, once the database is lost then all users should be notified so that they can take the appropriate action of changing their passwords on yours and other systems. The salt is just buying you and them time to do this.
The reason I mentioned "impractical" before with regards to cracking them is that the question is really one of the hacker determining the value of the passwords versus the cost in cracking them. Using reasonable salt values you can drive the computational costs up enough that very few hackers would bother. They tend to be low hanging fruit kind of people; unless you have a reason to be a target. At which point you should look into other forms of authentication.
This only helps if your threat model includes a situation in which your attacker somehow obtains your password database, but cannot read the secret key stored in your code. For most, this isn't a terribly likely scenario, so it's not worth catering for.
Even in that limited case, it doesn't gain you a great deal of additional security, as the attacker can simply take their own password, and iterate over all possible secret key values. Once they find the right one (because it hashes their own password correctly), they can use that to attack all the other passwords in the database as they would normally.
If you're concerned about storing passwords securely, you should use a standard scheme like PBKDF2, which uses key stretching to make brute forcing much less practical.

Increasing security of web-based login

Right now my login system is the following:
Password must be at least 8 characters long, and contain at least one upper and lowercase letter, a number and a symbol.
Password can't contain the username as its substring.
Username, salted+hashed (using SHA2) password stored on db.
The nonce (salt) is unique for each user and stored as plaintext along with the username and password.
The whole login process can only be made over TLS
How would you rank the effectiveness of the following measures to increase security?
Increase password length
Force the user to change the password every X period of time, and the new password can't be any of the last Y previous passwords
Increase nonce size from 32 bytes to 64 bytes (removed for uselessness)
Encrypt the salt using AES, with the key available only to the application doing authentication
Rehash the password multiple times
Use a salt that's a combination of a longer, application-wide salt + unique user salt on the db.
I am not very fond of 1 and 2 because it can inconvenience the user though.
4 and 6 of course are only effective when an attacker has compromised the db (eg: via SQL injection) but not the filesystem where the application is in.
The answers may depend somewhat on the nature of the website, its users and attackers. For instance, is it the kind of site where crackers might target specific accounts (perhaps "admin" accounts) or is it one where they'd just want to get as many accounts as possible? How technical are the users and how motivated are they to keep their own account secure? Without knowing the answers, I'll assume they're not technical and not motivated.
Measures that might make a difference
5) Rehash the password multiple times. This can slow down all brute force attacks significantly - hash 1000 times and brute force attacks become 1000 times slower.
4) Encrypt the salt using AES, with the key available only to the application doing authentication How would you make it available only to the application? It has to be stored somewhere and, chances are, if the app is compromised the attacker can get it. There might be some attacks directly against the DB where this makes a difference, so I wouldn't call this useless, but it's probably not worthwhile. If you do make the effort, you might as well encrypt the password itself and any other sensitive data in the DB.
6) Use a salt that's a combination of a longer, application-wide salt + unique user salt on the db. If you're only concerned about the password then yes, this would be a better way of achieving the same result as 4) and, of course, it's very easy to implement.
Ineffective measures
3) Increase nonce size from 32 bytes to 64 bytes. Computing rainbow tables is already completely impractical with any salt, so this would only make a difference if the salt was not known to the attacker. However, if they can get the hashed password they could also get the salt.
Ineffective and annoying measures
1) Increase password length Increasing password length beyond 8 won't make a practical difference to the brute force time.
2) Force the user to change the password I agree, this will always be worked around. In fact, it may make the site less secure, because people will write down the password somewhere!
Increasing password length add a few bits of entropy to the password.
Requiring frequent password changes will generally force the users to use less secure passwords. They will need to figure out what the password is in May, June, July. Some#05x, Some#06x, Some#07x.
Can't say for sure, but I would expect the password length to be more significant in your case.
Slightly more secure. But if someone gains access to your data, they can likely gain access to the key.
Other than increasing CPU costs, you won't gain anything.
There are a number of well tried one-way password encryption algorithms which are quite secure. I would use one of them rather than inventing my own. Your original items 1, 2, and 5 are all good. I would drop 3, and 4.
You could allow pass phrases to ease password length issues.
I would suggest that you read http://research.microsoft.com/en-us/um/people/cormac/papers/2009/SoLongAndNoThanks.pdf
This paper discusses part of the reason it is hard to get users to follwo good security advice; in short the costs lie with the users and they experience little or no benefit.
Increasing the password length and forcing more complex passwords can reduce seciryt by leading to one or both of; reused passwords between sites/applications and writing down of passwords.
3 Increase nonce size from 32 bytes to 64 bytes
4 Encrypt the salt using AES, with the key available only to the application doing authentication
5 Rehash the password multiple times
These steps only affect situations where the password file (DB columns) are stolen and visible to the attacker. The nonce only defeats pre-hashing (rainbow tables), but that's still a good thing and should be kept.
(Again, under the assumption you're trying to minimize the impact of a compromised DB.) Encrypting the nonce means the attacker has an extra brute-force step, but you didn't say where the encryption key for the nonce is stored. It's probably safe to assume that if the DB is compromised the nonce will be plaintext or trivially decrypted. So, the attacker's effort is once again a brute-force against each hash.
Rehashing just makes a brute-force attack take longer, but possibly not much more so depending on your assumptions about the potential attacker's cracks/second.
Regardless of your password composition requirements a user can still make a "more guessable" password like "P#ssw0rd" that adheres to the rule. So, brute force is likely to succeed for some population of users in any case. (By which I mean to highlight taking steps to prevent disclosure of the passwords.)
The password storage scheme sounds pretty good in terms of defense against disclosure. I would make sure other parts of the auth process are also secure (rate limiting login attempts, password expiration, SQL injection countermeasures, hard-to-predict session tokens, etc.) rather than over-engineering this part.
For existing:
e1: I see where you're coming from, but these rules are pretty harsh - it certainly increases security, but at the expense of user experience. As vulkanino mentions this is going to deter some users (depends on your audience - if this is an intranet application they have no choice... but they'll have a yellow sticky with their password on their monitor - cleaners and office loiterers are going to be your biggest issue).
e2: Is a start, but you should probably check against a list of bad passwords (eg: 'password', 'qwerty', the site URL)... there are several lists on the net to help you with this. Given your e1 conditions such a scan might be moot - but then surely users aren't going to have a username with 8 chars, upper+lower, a symbol and a number?
e3: Good call - prevent rainbow attacks.
e4: Unique salt prevents identification of multiple users with the same password, but there are other ways to make it unique - by using the username as a secondary salt+hash for example.
e5: Solid, although TLS has built in fall-backs, the lower end TLS protocols aren't very secure so you may want to check you're not allowing these connections.
New ideas:
n1+n2: e1 is already painful enough.
n3: No discernible benefit
n4: No discernible benefit - whatever the encryption process is would be available in the code, and so also likely compromised. That is unless your DB and App servers are two different machines hardened for their own tasks - in this case anything you can avoid storing with the password is helpful in the event the DB is compromised (in this case dropping unique salt from the database will help).
n5: Rehashing decreases brute force attack speed through your application - a worth while idea in many ways (within reason - a user won't notice a quarter second login delay, but will notice a 5 second delay... note this is also a moving target as hardware gets better/faster/stronger/work it)
Further points:
Your security is only as good as the systems it is stored on and processed through. Any system that could be compromised, or already has a back door (think: number of users who can access the system - server admins, DBAs, coders, etc) is a weak link.
Attack detection scripts in your application could be beneificial - but you should be aware of Denial of Service (DoS) attacks. Tracking failed logins and source is a good start - but be aware if you lock the account at 5 failures, someone could DoS a known account (including the admin account). Being unable to use the App may be as bad as loosing control of your account. Multi-hash (n5) slows down this process, picking a slower hash algorithm is a good step too, and perhaps building in re-attempt delays will help too (1 second on first fail, 2 on second, etc)- but again; be DoS aware. Two basic things you might want to filter: (1) multi attacks from the same source/IP (slow down, eventually prevent access from that IP - but only temporarily since it could be a legitimate user) perhaps further testing for multiple sets of multi attacks. (2) Multi attacks from different IPs - the first approach only locks a single user/source, but if someone uses a bot-net, or an anonymizing service you'll need to look for another type of suspicious activity.
Is it possible to piggy-back off another system? You could use an LDAP, or Active Directory server in your domain or use OpenID or OAuth or something similar. Save yourself all these headaches by off loading the work ;) {Inter-system security still needs to be addressed if you're a middle man} Plus the less passwords users have to remember (and rules attached to each) the more likely they are to have a good password, that isn't written down, etc.
I don't consider any of those things to increase your password security. The security of the password stored in the database is only relevant if you expect someone to obtain a copy of the database. Using a (perceived) stronger hash function in the database only obfuscates your application. In fact a salted MD5 would be fine (I am aware of the attacks on MD5, and don't believe any of them to be relevant to password hashing).
You might be better relaxing the password rules to get better security, as if you require at least one upper and lower LATIN letters, you effectively force non-latin keyboard users to use alien letters (try typing upper and lower case latin letters on a cyrilic keyboard). This makes them more likely to write it down.
The best security would be to avoid using passwords in their entirety. If it is an enterprise application in a corporate that uses Active Directory, consider delegating authentication instead of writing your own. Other approaches can include using an Information Card by making your application claims-aware.
How about encrypting the password in client browser already with MD5/SHA, then treat the hash as user's password at server side. This way the password isn't in plain text when it travels over SSL/TLS nor it is never-ever in plain text in server either. Thus even it is stolen by hackers at any point (man-in-the-middle, server/db hacks), it cannot be used to gain access to other web services where the user might have same email/username+password combo (yes, its very common...)
It doesn't help with YOUR site login security directly, but it certainly would stop hacked password lists spreading around the net if some server has been hacked. It might work to your advantage as well, if another, hacked site applies the same approach, your site user's aren't compromised.
It also guarantees all users will have decent alphanumeric password with enough length and complexity, you can perhaps then relax your requirements for password strength a little :-)

Ultimate Hash Protection - Discussion of Concepts

Ok, so the whole problem with hashes is that users don't enter passwords over 15 characters long. Most only use 4-8 characters making them easy for attackers to crack with a rainbow table.
Solution, use a user salt to make hash input more complex and over 50chars so that they will never be able to generate a table (way to big for strings that size). plus, they will have to create a new table for each user. Problem: if they download the db they will get the user salt so you are back to square one if they care enough.
Solution, use a site "pepper" plus the user salt, then even if they get the DB they will still have to know the config file. Problem: if they can get into your DB chances are they might also get into your filesystem and discover your site pepper.
So, with all of this known - lets assume that an attacker makes it into your site and gets everything, EVERYTHING. So what do you do now?
At this point in the discussion, most people reply with "who cares at this point?". But that is just a cheap way of saying "I don't know what to do next so it can't be that important". Sadly, everywhere else I have asked this question that has been the reply. Which shows that most programmers miss a very important point.
Lets image that your site is like the other 95% of sites out there and the user data - or even full sever access - isn't worth squat. The attacker happens to be after one of your users "Bob" because he knows that "Bob" uses the same password on your site as he does on the banks site. He also happens to know Bob has his life savings in there. Now, if the attacker can just crack our sites hashes the rest will be a piece of cake.
So here is my question - How do you extend the length of the password without any traceable path? Or how do you make the hashing process to complex to duplicate in a timely manner? The only thing that I have come up with is that you can re-hash a hash several thousand times and increase the time it would take to create the final rainbowtable by a factor of 1,000. This is because the attacker must follow that same path when creating his tables.
Any other ideas?
Solution, use a user salt to make hash
input more complex and over 50chars so
that they will never be able to
generate a table (way to big for
strings that size). plus, they will
have to create a new table for each
user. Problem: if they download the db
they will get the user salt so you are
back to square one if they care
enough.
This reasoning is fallacious.
A rainbow table (which is a specific implementation of the general dictionary attack) trades space for time. However, generating a dictionary (rainbow or otherwise) takes a lot of time. It is only worthwhile when it can be used against multiple hashes. Salt prevents this. The salt does not need to be secret, it just needs to be unpredictable for a given password. This makes the chance of an attacker having a dictionary generated for that particular salt negligibly small.
"The only thing that I have come up with is that you can re-hash a hash several thousand times and increase the time it would take to create the final rainbowtable by a factor of 1,000."
Isn't that exactly what the Blowfish-based BCrypt hash is about? Increasing the time it takes to compute a hash so that brute force cracking (and rainbow table creation) becomes undoable?
"We present two algorithms with adaptable cost (...)"
More about adaptable cost hashing algorithms: http://www.usenix.org/events/usenix99/provos.html
How about taking the "pepper" idea and implementing it on a separate server dedicated to hashing passwords - and locked down except for this one simple and secure-as-possible service - possibly even with rate-limits to prevent abuse. Gives the attacker one more hurdle to overcome, either gaining access to this server or reverse engineering the pepper, custom RNG and cleartext extension algorithm.
Of course if they have access to EVERYTHING they could just evesdrop on user activity for a little while..
uhmm... Okay, my take on this:
You can't get the original password back from a hash. I I have your hash, I may find a password that fits that hash, but I can not log in to any other site that uses this password, assuming they all use salting. No no real issue here.
If someone gets your DB or even your site to get your config, you're screwed anyway.
For Admin or other Super Accounts, implement a second mean of verification, i.e. limit logins to certain IP ranges, use Client-Side-SSL Certificates etc.
For normal users, you won't have much chance. Everything you do with their password needs to be stored in some config or database, so if have your site, I have your magic snake oil as well.
Strong Password limitations don't always work. Some sites require passwords to have a numeric character - and as a result, most users add 1 to their usual password.
So I'm not entirely sure what you want to achieve here? Adding a Salt to the front of the users password and protecting Admin accounts with a second mean of authentication seems to be the best way, given the fact that users simply don't pick proper passwords and can't be forced to either.
I was hoping that someone might have a solution but sadly I am no better off then when I first posted the question. It seems that there is nothing that can be done but to find a time-costly algorithm or re-hash 1,000's of times to slow down the whole process of generating rainbow tables (or brute-forcing) a hash.

Crypto, hashes and password questions, total noob?

I've read several stackoverflow posts about this topic, particularly this one:
Secure hash and salt for PHP passwords
but I still have a few questions, I need some clarification, please let me know if the following statements are true and explain your comments:
If someone has access to your database/data, then they would still have to figure out your hashing algorithm and your data would still be somewhat secure, depending on your algorithm? All they would have is the hash and the salt.
If someone has access to your database/data and your source code, then it seems like no matter what your do, your hashing algorithm can be reversed engineered, the only thing you would have on your side would be how complex and time consuming your algorithm is?
It seems like the weakest link is: how secure your own systems are and who has access to it?
Lasse V. Karlsen ... brings up a good point, if your data is compromised then game over ... my follow up question is: what types of attacks are these hashes trying to protect against? I've read about rainbow table and dictionary attacks (brute force), but how are these attacks administered?
The security of cryptographic algorithms is always in their secret input. Reasonable cryptanalysis is based on an assumption that any attacker knows what algorithm you use. Good cryptographic hashes are non-invertible and collision resistant. This means that there's still a lot of work to do going from a hash to the value that generated it, regardless of whether you know the algorithm applied.
If you used a secure hash, access to the hash, salt, and algorithm will still leave a lot of work for a would-be attacker.
Yes, a secure hash puts a very hard to invert algorithm on your side. Note that this inversion is not 'reverse-engineering'
The weak link is probably the processes and procedures that get those password hashes into the database. There are all sorts of ways to screw up and store sensitive data in the clear.
As I noted in a comment, there are attacks that these measures defend against. First, knowing the password may lead to authorization to do things beyond what the contents of the database suggest. Second, those passwords may be used elsewhere, and you expose your users to risk by revealing their passwords as a result of a break-in. Third, with hashing, an insider can't exploit read-only access to the database (subject to less auditing, etc.) to impersonate a user.
Dictionaries and rainbow tables are techniques for accelerating hash inversion.
You question is about using passwords as an authentication mechanism and how to securely store these passwords in a database using a hash. As you probably already know the goal is to be able to verify passwords without storing these passwords i clear text in the database. In this context let me try to answer each of your questions:
If someone has access to your database/data, then they would still have to figure out your hashing algorithm and your data would still be somewhat secure, depending on your algorithm? All they would have is the hash and the salt.
The basic idea of hashing passwords is that the attacker has knowledge of the hashing algorithm and has access to both the hash and the salt. By selecting a cryptographic strong hash function and a suitable salt value that is different for each password the computational effort required to guess the password is so high that the cost exceeds the possible gain the attacker can get from guessing the password. So to answer your question, hiding the hash function does not improve the security.
If someone has access to your database/data and your source code, then it seems like no matter what your do, your hashing algorithm can be reversed engineered, the only thing you would have on your side would be how complex and time consuming your algorithm is?
You should always use a well-known (and suitably strong) hashing algorithm, and reverse engineering this algorithm is not meaningful as there is nothing hidden in your code. If you didn't mean reverse engineer but actually reverse then, yes, the passwords are protected by the complexity of reversing the hash function (or guessing a password that matches a hash value). Good hash functions makes this very hard.
It seems like the weakest link is: how secure your own systems are and who has access to it?
In general this is true, but when it comes to securing passwords by storing them as hashes you should still assume that the attacker has full access to the hashes and design your system accordingly by choosing an appropriate hash function and using salts.
What types of attacks are these hashes trying to protect against? I've read about rainbow table and dictionary attacks (brute force), but how are these attacks administered?
The basic attack that password hashing protects against is when the attacker gets access to your database. The clear text password cannot be read from the database and the password is protected.
A more sophisticated attacker can generate a list of possible passwords and compute the hash using the same algorithm as you. He can then compare the computed hash to the stored hash and if he finds a match he has a valid password. This is a brute force attack and it is generally assumed that the attacker has "offline" access to your database. By requiring the users to use long and complex passwords the effort required to "brute force" a password is significantly increased.
When the attacker wants to attack not one password, but all the passwords in the database a large table of passwords and hash value pairs can be precomputed and further improved by using what is called hash chains. Rainbow tables is an application of this idea and can be used to brute force many passwords simultaneously without increasing the effort significantly. However, if a unique salt is used to compute the hash for each password a precomputed table becomes useless as it is different for each salt and cannot be reused.
To sum it up: Security by obscurity is not a good strategy for protecting sensitive information and modern cryptography allows you to secure information without having to resort to obscurity.
what types of attacks are these hashes trying to protect against?
That type when someone gets your password from poorly secured site, reverses it, and then tries to access your bank/PayPal/etc. account. It happens all the time, and many people are still using same (and often weak) passwords everywhere.
As a side note, from what I've read, key derivation functions (PBKDF2/scrypt/bcrypt) are considered better/more secure (#1, #2) than plain salted SHA-1/SHA-2 hashes by crypto people.
If you have just a hash, no salt, then once they know your data (and algorithm) they can get your password via a rainbow table lookup. If you have a hash and a salt, they can get your password by burning a lot of CPU cycles and building a rainbow table.
If your salt is the same for all your data, they only need to burn a lot of CPU cycles once to build the table and then they have all the passwords. If your salt is not always the same, they need to burn through the CPU cycles to make a unique rainbow table for each record.
If the salt is long enough, the CPU cycles they need become very cost-prohibitive.
If you know your data security is breached, of course, you need to reset all the passwords immediately anyway, because as far as you know the attacker is willing to spend that time.
If someone has access to your database/data, then they would still
have to figure out your hashing
algorithm and your data would still be
somewhat secure, depending on your
algorithm? All they would have is the
hash and the salt.
This might be all a really dedicated opponent would need. Much of this answer depends on how valuable the data is, which would tell you how motivated the opponent is. Credit card numbers are going to be extremely valuable, and criminal attackers seem to have plenty of time and accomplices to do their dirty work. Some bad guys have been known to farm out key decryption tasks to botnets!
If someone has access to your database/data and your source code,
then it seems like no matter what your
do, your hashing algorithm can be
reversed engineered, the only thing
you would have on your side would be
how complex and time consuming your
algorithm is?
If they have access to your source and all the data, the question is going to be "how did you load your key into the memory of the server in the first place?" If it's embedded in the data or in the program code, it's game over and you've lost. If it was hand-keyed by an operator at the machine's boot time, it should be as secure as your trust in your operator. If it is stored in an HSM*, it should still be secure.
And if they have root-level authority access to your running machine, then they can probably trigger and recover a memory dump that will reveal the secret key.
It seems like the weakest link is: how
secure your own systems are and who
has access to it?
This is true. But there are alternatives that help improve security.
For bank-like protection, the kind that passes security and industry audits, it's recommended that you use a *Hardware Security Module (HSM) to perform key storage and encryption/decryption functions. The commercial strength HSMs we're looking at cost 10s of thousands of dollars or more each, depending on capacity. But I have seen hardware encryption cards that plug into a PCI slot that cost substantially less.
The idea behind an HSM is that the encryption happens on a secure, hardened platform that nobody has access to without the secret keys. Most of them have cabinets with intrusion detection switches, trip wires, epoxied chips, and memory that will self-destruct if tampered with. Not even the legitimate owner or the factory should be able to recover the database key from an HSM without the set of authorized crypto keys (usually carried on smart cards.)
For a very small installation, an HSM can be as simple as a smart card. Smart cards aren't high performance encryption devices, though, so you can't pump more than about one decryption transaction per second through them. Systems using smart cards usually just store the root key, then decrypt the working database key on the smart card and send it back to the database accessing system. These will still yield the working database key if the attacker can access running memory, or if the attacker can sniff the USB traffic to and from the smart card.
And I have no experience getting TPM chips to work (yet), but theoretically they can be used to securely store keys on a machine. Again, it is still no defense against an attacker taking a memory dump while the key is loaded in memory, but they would prevent a stolen hard drive containing code and data from revealing its secrets.
A hash cannot be reversed. Conceptually, think of a hash as taking the value to be hashed as the seed to a random number generator, then taking the 500th number that it generates. This is a repeatable process, but it is not a reversible process.
If you store a hashed password in your database, when your user logs in, you take his password from the input to the login page, you apply the same hash to it, and then you compare the result of that operation to what you have stored in the database. If they match, the user typed the right password. (Or, in theory, they could have typed something that happens to hash to the same value, but in practice, you can completely ignore this.)
The purpose of the salt is so that even if users have the same password, you can't tell, and also lots of other things which are equivalent to this idea. If the user's password is "secret", and the salt is "abc", then instead of making a hash of "secret", you hash "secretabc" and store the results of that in your database. You also store the salt, but this is perfectly safe to store -- you can't figure out any information about the password from it.
The only reason to safeguard the hashed passwords and salt is that if an attacker has a copy of it, he can test passwords offline on his own machine, rather than repeatedly trying to log in to your server, which you would probably lock him out after three attempts or something like that. Even if you don't lock him out, it's much faster to test locally than to wait for the network round-trip.
( OP )
brings up a good point, if your data
is compromised then game over ... my
follow up question is: what types of
attacks are these hashes trying to
protect against? I've read about
rainbow table and dictionary attacks
(brute force), but how are these
attacks administered
( discussion )
It's not a game, except to the attacker. Research these terms:
Sarbanes-Oxley
Gramm-Leach-Bliley Act (GLBA)
HIPAA
Digital Millenium Copyright Act (DMCA)
PATRIOT Act
Then tell us ( as thought provocation for you ) how do we protect against whom? For one thing, it is the efforts of innocents vis-a-vis intruders - and for another it is data-recovery if part of the system fails.
It is an interesting experiment that the original intent of tcp/ip and so on is advertised as being a weapon of war, survivability under attacks. Okay, so passwords are hashed - no one can recover them ...
Which, duh, includes the owner-operator of the system.
So you build a robust record locking tool that implements key controls, then political pressures force the use of brand-x tools.
You can read Federal Information Security Management Act (FISMA) and by the time you have read it some governmental entity somewhere will have had an entire disk either stolen or compromised.
How would you protect that disk if it was your personal identity information on that disk.
I can tell you from the caliber of Martin Liversage and jadeters they will be paying attention.
Here are my thoughts to your points:
If people have access to your database you have bigger security concerns than your hash algorithm and salt phrase. Hashes are somewhat secure, however there are problems such as hash collisions and hash lookups.
Hashes are one-way, so unless they can guess the input there is no way to reverse out the original text even with the algorithm and salt; hence the name one-way hash.
Security is about obscurity and layers of defense. If you layer your defenses and make determining what those defenses are you stand a much better chance of staving off an attack than if you relied on a single approach to security such as password hashing and running OS/network hardware updates. Throw in some curveballs like obsfucation of the web server platform and clear boundaries between the prod web and database environments. Layers and hiding implementation details buy you valuable time.
When hashing a password, it is one way. So it is very difficult to get the password even if you have the salt, source and alot of cpu cyles to burn.

Resources