Full text search on encrypted data

Full text search on encrypted data - security

Suppose I have a server storing encrypted text (end-to-end: server never sees plain text).
I want to be able to do full text search on that text.
I know this is tricky, but my idea is to use the traditional full text design ("list" and "match" tables where words are stored and matched with ids from the content table). When users submit the encrypted text, they also send a salted MD5 of the words and respective matches. The salt used is unique for each user and is recovered from their password.
(in short: the only difference is that the "list" table will contain hashed words)
Now, how vulnerable would this system be?
Note that I said "how vulnerable" instead of "how safe", because I acknowledge that it can't be totally safe.
I DO understand the tradeoff between features (full text search) and security (disclosing some information from the word index). For example, I understand that an attacker able to get the list and match tables could get information about the original, encrypted text and possibly be able to decipher some words with statistical analysis (however, being the salt unique for each user, this would need to be repeated for each user).
How serious would this threat be? And would there be any other serious threats?
DISCLAIMER
What I'm trying to build (and with the help of a cryptographer for actual implementation; right now I'm just trying to understand wether this will be possible) is a consumer-grade product which will deal with confidential yet not totally secret data.
My goal is just to provide something safe enough, so that it would be easier for an attacker to try stealing users' passwords (e.g. breaching into clients - they're consumers, eventually) rather than spending a huge amount of time and computing power trying to brute force the index or run complicated statistical analysis.
Comments in response to #Matthew
(may be relevant for anyone else answering)
As you noted, other solutions are not viable. Storing all the data inside the client means that users cannot access their data from other clients. Server-side encryption would work, but then we won't be able to give users the added security of client-side encryption.The only "true alternative" is just to not implement search: while this is not a required feature, it's very important to me/us.
The salt will be protected in the exactly same way as the users' decryption key (the one used to decrypt stored texts). Thus, if someone was able to capture the salt, he or she would likely be able to capture also the key, creating a much bigger issue.To be precise, the key and the salt will be stored encrypted on the server. They will be decrypted by the client locally with the user's password and kept in memory; the server never sees the decrypted key and salt. Users can change passwords, then, and they just need to re-encrypt the key and the salt, and not all stored texts. This is a pretty standard approach in the industry, to my knowledge.
Actually, the design of the database will be as follow (reporting relevant entries only). This design is like you proposed in your comment. It disallows proximity searches (not very relevant to us) and makes frequency less accurate.
Table content, containing all encrypted texts. Columns are content.id and content.text.
Table words, containing the list of all hashes. Columns are words.id and words.hash.
Table match, that matches texts with hashes/words (in a one-to-many relationship). Columns are match.content_id and match.word_id.
We would have to implement features like removing stopwords etc. Sure. That is not a big issue (will, of course, be done on the client). Eventually, those lists have always been of limited utility for international (i.e. non English-speaking) users.
We expect the lookup/insert ratio to be pretty high (i.e. many lookups, but rare inserts and mostly in bulk).
Decrypting the whole hash database is certainly possible, but requires a brute force attack.
Suppose the salt is kept safe (as per point 2 above). If the salt is long enough (you cited 32 bits... but why not 320? - just an example) that would take A LOT of time.
To conclude... You confirmed my doubts about the possible risk of frequency analysis. However, I feel like this risk is not so high. Can you confirm that?
Indeed, first of all the salt would be unique per each user. This means that one user must be attacked at time.
Second, by reporting words only once per text (no matter how many times they appear), frequency analysis becomes less reliable.
Third... Frequency analysis on hashed words doesn't sound as something as good as frequency analysis on a Caesar-shift, for example. There are 250,000 words in English alone (and, again, not all our users will be English-speaking), and even if some words are more common than others, I believe it'd be hard to do this attack anyway.
PS: The data we'll be storing is messages, like instant messages. These are short, contain a lot of abbreviations, slang, etc. And every person has a different style in writing texts, further reducing the risk (in my opinion) of frequency attacks.

TL;DR: If this needs to be secure enough that it requires per-user end-to-end encryption: Don't do it.
Too long for a comment, so here goes - if I understand correctly:
You have encrypted data submitted by the user (client side encrypted, so not using the DB to handle).
You want this to be searchable to the user (without you knowing anything about this - so an encrypted block of text is useless).
Your proposed solution to this is to also store a list (or perhaps paragraph) of hashed words submitted from the client as well.
So the data record would look like:
Column 1: Encrypted data block
Column 2: (space) delimited hashed, ordered, individual words from the above encrypted text
Then to search you just hash the search terms and treat the hashed terms as words to search the paragraph(s) of "text" in column 2. This will definitely work - just consider searching nonsense text with nonsense search terms. You would even still be able to do some proximity ranking of terms with this approach.
Concerns:
The column with the individually hashed words as text will incredibly weak in comparison to the encrypted text. You are greatly weakening your solution as not only are there limited words to work from, the resultant text will be susceptible to word frequency analysis, etc.
If you do this: separately store a salt unrelated to the password. Given that a rainbow table will be easy to create if your salt is captured (dictionary words only) store it encrypted somewhere.
You will lose many benefits of FTS like ignoring words like 'the' - you will need to re-implement this functionality on your own if you want it (i.e. remove these terms on the client side before submitting the data / search terms).
Other approaches that you imply are not acceptable/workable:
Implement searching client side (all data has to exist on the client to search)
Centralized encryption leveraging the databases built in functionality
I understand the argument being that your approach provides the user with the only access to their data (i.e. you cannot see/decrypt it). I would argue that this hashed approach weakens the data sufficiently that you could reasonably work out a users data (that is, you have lowered the effort required to the point that it is very plausible you could decrypt a user's information without any knowledge of their keys/salts). I wouldn't quite lower the bar to describe this as just obfuscation, but you should really think through how significant this is.
If you are sure that weakening your system to implement searching like this makes sense, and the another approach is not sufficient, one thing that could help is to store the hashes of words in the text as a list of uniquely occuring words only (i.e. no frequency or proximity information would be available). This would reduce the attack surface area of your implementation a little, but would also lose the benefits you are implying you want by describing the approach as FTS. You could get very fast results like this though as the hashed words essentially become tags attached to all the records that include them. The search lookup then could become very fast (at the expense of your inserts).
*Just to be clear - I would want to be REALLY sure my business needs demanded something like this before I implemented it...
EDIT:
Quick example of the issues - say I know you are using 32-bit salts and are hashing common words like "the". 2^32 possible salts = 4 billion possible salts (that is, not that many if you only need to hash a handful of words for the initial attack). Assume the salt is appended or prepended, this still is only 8 billion entries to pre-calculate. Even if it is less common words you do not need to create too many lists to ensure you will get hits (if this is not the case your data would not be worth searching).
Then lookup the highest frequency salts for a given block of text in our each of our pre-calculated salt tables and use the match to see if it correctly decrypts other words in the text. Once you have a plausible candidate generate the 250,000 word English language rainbow table for that salt and decrypt the text.
I would guess you could decrypt the hashed data in the system in hours to days with access to the database.

First, you have all of the normal vulnerabilities of password-based cryptography, which stem from users picking predictable passwords. It is common to crack more than 50% of passwords from real-world applications in offline attacks with less than two hours of desktop computing time.
I assume the full text encryption key is derived from the password, or is encrypted by a password-derived key. So an attacker can test guesses against a selection of hashed index keys, and once she finds the password, decrypt all of the documents.
But, even if a user picks a high-entropy password, frequency analysis on the index could potentially reveal a lot about the plain text. Although word order is lost in indexing (if you don't support proximity searches), you are essentially creating an electronic code book for each user. This index would be vulnerable to centuries of well-developed cryptanalytical techniques. Modern encryption protocols avoid ECB, and provide "ciphertext indistinguishability"—the same plain text yields different cipher text each time it's encrypted. But that doesn't work with indexes.
A less vulnerable approach would be to index and search on the client. The necessary tables would be bundled as a single message and encrypted on the client, then transported to the server for storage. The obvious tradeoff is the cost of transmission of that bundle on each session. Client-side caching of index fragments could mitigate this cost somewhat.
In the end, only you can weigh the security cost of a breach against the performance costs of client-side indexing. But the statistical analysis enabled by an index is a significant vulnerability.

MSSQL Enterprise TDE encrypts Full-Text index as well as other indices when you set whole database encryption (Since 2008). in practice, it works pretty well, without a huge performance penalty. Can't comment on how, b/c it's a proprietary algo, but heres the docs.
https://learn.microsoft.com/en-us/sql/relational-databases/security/encryption/transparent-data-encryption-tde
it doesn't cover any of your application stack besides your db, but your FTS indices will work like normal and won't exist in plain text like they do in MySQL or PostGres. MariaDB and of course Oracle have their own implementation as well, from what i remember. MySQL and PGSQL do not.
As for passwords, TDE on all the implementations use AES keys, which can be rotated (though not always easily) - so the password vulnerability fall on the DBA's.
The problem is you need to pay for full enterprise licensing for MSSQL TDE (ie features not available in "standard" or "basic" cloud and on premise editions), and you do probably for TDE in Oracle as well. But if what you need is a quick solution and have the cash for enterprise licensing (probably cheaper than developing your own implementation), implementations are out there.

Related

Should we salt informations that are unique and random

Salt is used when storing passwords in databases in order to protect against dictionary attacks and rainbow tables.
However, let's assume we need to store unique and random (sensitive) information about users. Is there still an advantage in salting this information before hashing it ?
Wouldn't salt use, in this case, just add randomness to an already random data (unlike man-typed passwords) ?

It depends on how confidential your information is and what are the consequences when this data is compromised. Is it a PII information like SSN or DOB?
You mentioned that your data is random and unique. Which means it is difficult to identify a pattern. If the pattern is random enough then Salting your data may not be required. if you go with salting, then you will have an added responsibility of protecting those salts as well.
I would recommend to use low privileged account, hardening of server, authentication, authorization to protect your data and minimize the surface of attack.
Again, you should come to the conclusion after classification of your data based on CIA principles.

This depends very heavily on the size of the search space. For example, we could pretend that social security numbers are both random and unique (they're not actually either, but for the purpose of this discussion we will pretend they are). If you're hashing SSNs, not only do you need a salt, but a salt isn't sufficient. Why? Because there are fewer than 10 billion SSNs in existence. Creating a rainbow table for those is trivial. Even with a salt, it isn't that hard to brute force, even if the values are unique and random.
So to protect a random and unique value that lives in a small search space we have to use a stretching algorithm like PBKDF2, not just a hash. The point of a stretching algorithm is to make computing the hash very slow.
Stretching algorithms always include a salt. But it doesn't have to be a random salt. It could be deterministic (some database identifier + the user id for example, "com.example.mygreatapp:alice"). But for a small search space, you still need it to be unique per user because there are so few items in the search space.
On the other hand, if your random and unique data represents a large search space (not less than 2^64, and ideally at least 2^80), and that search space is sparse (you only use a very small fraction of legal elements), then salting and stretching is likely not required.

How to encrypt a astring in nodeJs that should be used as a an index

I am designing a microservice architecture using nodeJs and mongoDb. I have a usecase to save driver's license number which also be used to validate the user. Now, as DL number is PII, I don't want to save the string as is, I want to encrypt it before saving. I can use encryption logic to generate a common encrypted string everytime, so I can encrypt the DL number and do a lookup in db. But I am worried about the hackers can decrypt and get all DL numbers if they know the encryption logic for one. Can someone suggest me the best approach for this kind of use case?

Ignoring the index for a moment, it sounds like the best approach is to hash the license number using a keyed hash, and store the hash. This is similar to symmetric encryption in that you need to keep a secret, the key. However, it's one-way so attackers that obtain the secret will still need to brute-force each entry to obtain the number.
If the key is compromised, depending on the license number scheme, brute-forcing each number will vary in difficulty from easy to trivial. But, it's better than plaintext.
However, if you really need it as an index you have what appears to be conflicting priorities. I'll defer to someone else, I don't know much about DB indexing.
If it were me and I had time to spare I'd setup one table with the hashes and one with the plaintext license number as an index. Add 10 million rows (or some ceiling that's relevant to you) of test data and profile a few thousand random lookups of each one.

Hashing Passwords With Multiple Algorithms

Does using multiple algorithms make passwords more secure? (Or less?)
Just to be clear, I'm NOT talking about doing anything like this:
key = Hash(Hash(salt + password))
I'm talking about using two separate algorithms and matching both:
key1 = Hash1(user_salt1 + password)
key2 = Hash2(user_salt2 + password)
Then requiring both to match when authenticating. I've seen this suggested as a way eliminate collision matches, but I'm wondering about unintended consequences, such as creating a 'weakest link' scenario or providing information that makes the user database easier to crack, since this method provides more data than a single key does. E.g. something like combining information the hash to find them more easily. Also if collisions were truly eliminated, you could theoretically brute force the actual password not just a matching password. In fact, you'd have to in order to brute force the system at all.
I'm not actually planning to implement this, but I'm curious whether or not this is actually an improvement over the standard practice of single key = Hash(user_salt + password).
EDIT:
Many good answers, so just to surmise here, this should have been obvious looking back, but you do create a weakest link by using both, because the matches of weaker of the two algorithms can be tried against the other. Example if you used a weak (fast) MD5 and a PBKDF2, I'd brute force the MD5 first, then try any match I found against the other, so by having the MD5 (or whatever) you actual make the situation worse. Also even if both are among the more secure set (bcrypt+PBKDF2 for example), you double your exposure to one of them breaking.

The only thing this would help with would be reducing the possibility of collisions. As you mention, there are several drawbacks (weakest link being a big one).
If the goal is to reduce the possibility of collisions, the best solution would simply be to use a single secure algorithm (e.g. bcrypt) with a larger hash.

Collisions are not a concern with modern hashing algorithms. The point isn't to ensure that every hash in the database is unique. The real point is to ensure that, in the event your database is stolen or accidentally given away, the attacker has a tough time determining a user's actual password. And the chance of a modern hashing algorithm recognizing the wrong password as the right password is effectively zero -- which may be more what you're getting at here.
To be clear, there are two big reasons you might be concerned about collisions.
A collision between the "right" password and a supplied "wrong" password could allow a user with the "wrong" password to authenticate.
A collision between two users' passwords could "reveals" user A's password if user B's password is known.
Concern 1 is addressed by using a strong/modern hashing algorithm (and avoiding terribly anti-brilliant things, like looking for user records based solely on their password hash). Concern 2 is addressed with proper salting -- a "lengthy" unique salt for each password. Let me stress, proper salting is still necessary.
But, if you add hashes to the mix, you're just giving potential attackers more information. I'm not sure there's currently any known way to "triangulate" message data (passwords) from a pair of hashes, but you're not making significant gains by including another hash. It's not worth the risk that there is a way to leverage the additional information.

To answer your question:
Having a unique salt is better than having a generic salt. H(S1 + PW1) , H(S2 + PW2)
Using multiple algorithms may be better than using a single one H1(X) , H2(Y)
(But probably not, as svidgen mentions)
However,
The spirit of this question is a bit wrong for two reasons:
You should not be coming up with your own security protocol without guidance from a security expert. I know it's not your own algorithm, but most security problems start because they were used incorrectly; the algorithms themselves are usually air-tight.
You should not be using hash(salt+password) to store passwords in a database. This is because hashing was designed to be fast - not secure. It's somewhat easy with today's hardware (especially with GPU processing) to find hash collisions in older algorithms. You can of course use a newer secure Hashing Algorithm (SHA-256 or SHA-512) where collisions are not an issue - but why take chances?
You should be looking into Password-Based Key Derivation Functions (PBKDF2) which are designed to be slow to thwart this type of attack. Usually it takes a combination of salting, a secure hashing algorithm (SHA-256) and iterates a couple hundred thousand times.
Making the function take about a second is no problem for a user logging in where they won't notice such a slowdown. But for an attacker, this is a nightmare since they have to perform these iterations for every attempt; significantly slowing down any brute-force attempt.
Take a look at libraries supporting PBKDF encryption as a better way of doing this. Jasypt is one of my favorites for Java encryption.
See this related security question: How to securely hash passwords
and this loosely related SO question

A salt is added to password hashes to prevent the use of generic pre-built hash tables. The attacker would be forced to generate new tables based on their word list combined with your random salt.
As mentioned, hashes were designed to be fast for a reason. To use them for password storage, you need to slow them down (large number of nested repetitions).
You can create your own password-specific hashing method. Essentially, nest your preferred hashes on the salt+password and recurs.
string MyAlgorithm(string data) {
string temp = data;
for i = 0 to X {
temp = Hash3(Hash2(Hash1(temp)));
}
}
result = MyAlgorithm("salt+password");
Where "X" is a large number of repetitions, enough so that the whole thing takes at least a second on decent hardware. As mentioned elsewhere, the point of this delay is to be insignificant to the normal user (who knows the correct password and only waits once), but significant to the attacker (who must run this process for every combination). Of course, this is all for the sake of learning and probably simpler to just use proper existing APIs.

Hashing SSNs and other limited-domain information

I'm currently working on an application where we receive private health information. One of the biggest concerns is with the SSN. Currently, we don't use the SSN for anything, but in the future we'd like to be able to use it to uniquely identify a patient across multiple facilities. The only way I can see to do that reliably is through the SSN. However, we (in addition to our customers) REALLY don't want to store the SSN.
So naturally, I thought of just SHA hashing it since we're just using it for identification. The problem with that is that if an attacker knows the problem domain (an SSN), then they can focus on that domain. So it's much easier to calculate the billion SSNs rather than a virtually unlimited number of passwords. I know I should use a site salt and a per-patient salt, but is there anything else I can do to prevent an attacker from revealing the SSN? Instead of SHA, I was planning on using BCrypt, since Ruby has a good library and it handles scalable complexity and salting automagically.
It's not going to be used as a password. Essentially, we get messages from many facilities, and each describes a patient. The only thing close to a globally unique identifier for a patient is the SSN number. We are going to use the hash to identify the same patient at multiple facilities.

The algorithm for generating Social Security Numbers was created before the concept of a modern hacker and as a consequence they are extremely predictable. Using a SSN for authentication is a very bad idea, it really doesn't matter what cryptographic primitive you use or how large your salt value is. At the end of the day the "secret" that you are trying to protect doesn't have much entropy.
If you never need to know the plain text then you should use SHA-256. SHA-256 is a very good function to use for passwords.

If you seriously want to hash a social security number in a secure way, do this:
Find out how much entropy is in
an SSN (hint: there is very little.
Far less than a randomly chosen 9
digit number).
Use any hashing algorithm.
Keep fewer (half?) bits than
there is entropy in an SSN.
Result:
Pro: Secure hash of an SSN because of
a large number of hash collisions.
Pro: Your hashes are short and easy to store.
Con: Hash collisions.
Con: You can't use it for a unique
identifier because of Con#1.
Pro: That's good because you really
really need to not be using SSNs as
identifiers unless you are the Social
Security Administration.

First, much applause and praise for storing a hash of the SSN.
It appears as if you're reserving the SSNs as a sort of 'backup username.' In this case, you need another form of authentication besides the username - a password, a driver's license number, a passport number, proof of residence, etcetera.
Additionally, if you're concerned that an attacker is going to predict the top 10,000 SSNs for a patient born in 1984 in Arizona, and attempt each of them, then you can put in an exponentially increasing rate limiter in your application.* For additional defense, build in a notification system that alerts a sys-admin when it appears that there is an unusually high number of failed login attempts.**
*Example exponentially increasing rate limiter:
After each failed request, delay the next request by (1.1^N) seconds, where N is the number of failed requests from that IP. Track IP and failed login attempts in a DB table; shouldn't add too much load, depending on the audience of your application (do you work for Google?).
**In the case where an attacker has access to multiple IPs, the notification will alert a sys-admin who can use his or her judgment to see if you have an influx of stupid users or it's a malicious attempt.

Ultimate Hash Protection - Discussion of Concepts

Ok, so the whole problem with hashes is that users don't enter passwords over 15 characters long. Most only use 4-8 characters making them easy for attackers to crack with a rainbow table.
Solution, use a user salt to make hash input more complex and over 50chars so that they will never be able to generate a table (way to big for strings that size). plus, they will have to create a new table for each user. Problem: if they download the db they will get the user salt so you are back to square one if they care enough.
Solution, use a site "pepper" plus the user salt, then even if they get the DB they will still have to know the config file. Problem: if they can get into your DB chances are they might also get into your filesystem and discover your site pepper.
So, with all of this known - lets assume that an attacker makes it into your site and gets everything, EVERYTHING. So what do you do now?
At this point in the discussion, most people reply with "who cares at this point?". But that is just a cheap way of saying "I don't know what to do next so it can't be that important". Sadly, everywhere else I have asked this question that has been the reply. Which shows that most programmers miss a very important point.
Lets image that your site is like the other 95% of sites out there and the user data - or even full sever access - isn't worth squat. The attacker happens to be after one of your users "Bob" because he knows that "Bob" uses the same password on your site as he does on the banks site. He also happens to know Bob has his life savings in there. Now, if the attacker can just crack our sites hashes the rest will be a piece of cake.
So here is my question - How do you extend the length of the password without any traceable path? Or how do you make the hashing process to complex to duplicate in a timely manner? The only thing that I have come up with is that you can re-hash a hash several thousand times and increase the time it would take to create the final rainbowtable by a factor of 1,000. This is because the attacker must follow that same path when creating his tables.
Any other ideas?

Solution, use a user salt to make hash
input more complex and over 50chars so
that they will never be able to
generate a table (way to big for
strings that size). plus, they will
have to create a new table for each
user. Problem: if they download the db
they will get the user salt so you are
back to square one if they care
enough.
This reasoning is fallacious.
A rainbow table (which is a specific implementation of the general dictionary attack) trades space for time. However, generating a dictionary (rainbow or otherwise) takes a lot of time. It is only worthwhile when it can be used against multiple hashes. Salt prevents this. The salt does not need to be secret, it just needs to be unpredictable for a given password. This makes the chance of an attacker having a dictionary generated for that particular salt negligibly small.

"The only thing that I have come up with is that you can re-hash a hash several thousand times and increase the time it would take to create the final rainbowtable by a factor of 1,000."
Isn't that exactly what the Blowfish-based BCrypt hash is about? Increasing the time it takes to compute a hash so that brute force cracking (and rainbow table creation) becomes undoable?
"We present two algorithms with adaptable cost (...)"
More about adaptable cost hashing algorithms: http://www.usenix.org/events/usenix99/provos.html

How about taking the "pepper" idea and implementing it on a separate server dedicated to hashing passwords - and locked down except for this one simple and secure-as-possible service - possibly even with rate-limits to prevent abuse. Gives the attacker one more hurdle to overcome, either gaining access to this server or reverse engineering the pepper, custom RNG and cleartext extension algorithm.
Of course if they have access to EVERYTHING they could just evesdrop on user activity for a little while..

uhmm... Okay, my take on this:
You can't get the original password back from a hash. I I have your hash, I may find a password that fits that hash, but I can not log in to any other site that uses this password, assuming they all use salting. No no real issue here.
If someone gets your DB or even your site to get your config, you're screwed anyway.
For Admin or other Super Accounts, implement a second mean of verification, i.e. limit logins to certain IP ranges, use Client-Side-SSL Certificates etc.
For normal users, you won't have much chance. Everything you do with their password needs to be stored in some config or database, so if have your site, I have your magic snake oil as well.
Strong Password limitations don't always work. Some sites require passwords to have a numeric character - and as a result, most users add 1 to their usual password.
So I'm not entirely sure what you want to achieve here? Adding a Salt to the front of the users password and protecting Admin accounts with a second mean of authentication seems to be the best way, given the fact that users simply don't pick proper passwords and can't be forced to either.

I was hoping that someone might have a solution but sadly I am no better off then when I first posted the question. It seems that there is nothing that can be done but to find a time-costly algorithm or re-hash 1,000's of times to slow down the whole process of generating rainbow tables (or brute-forcing) a hash.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string