I have a list of sha224 hashes in a .txt file and I need to be able to search through them as fast as possible for a specific hash. Assuming processing power and/or storage space aren't an issue, what would be the fastest way to perform a search?
Ideally you'd be providing the hash who's value you'd need to read programmatically.
If you don't know the key ahead of time then you probably want to hash the hashes and "search" there, not really a search since you should be able to do it explicitly.
Related
I am designing a microservice architecture using nodeJs and mongoDb. I have a usecase to save driver's license number which also be used to validate the user. Now, as DL number is PII, I don't want to save the string as is, I want to encrypt it before saving. I can use encryption logic to generate a common encrypted string everytime, so I can encrypt the DL number and do a lookup in db. But I am worried about the hackers can decrypt and get all DL numbers if they know the encryption logic for one. Can someone suggest me the best approach for this kind of use case?
Ignoring the index for a moment, it sounds like the best approach is to hash the license number using a keyed hash, and store the hash. This is similar to symmetric encryption in that you need to keep a secret, the key. However, it's one-way so attackers that obtain the secret will still need to brute-force each entry to obtain the number.
If the key is compromised, depending on the license number scheme, brute-forcing each number will vary in difficulty from easy to trivial. But, it's better than plaintext.
However, if you really need it as an index you have what appears to be conflicting priorities. I'll defer to someone else, I don't know much about DB indexing.
If it were me and I had time to spare I'd setup one table with the hashes and one with the plaintext license number as an index. Add 10 million rows (or some ceiling that's relevant to you) of test data and profile a few thousand random lookups of each one.
I'm trying to find a way to transform a large number of similar strings into unique hashes. The reason is that each string (an url), is used to generate a file on s3, for later access.
I need to be able to rebuild the hash at a later stage for comparison purposes.
I've used MD5 up until now, but some strings are long and very similar, which gave me duplicates.
I believe that a 256 or 512 hash would work, but maybe there's a best practice? I'd just use an URLencode as a filename, but one requirement is that a user wouldn't be able to access the base file on our server from the title of the S3 file.
I am working on a small project to keep my skills from completely rusting
I am generating a lot of hashes(in his case md5) and I need to check if I've seen that hash before so I wanted to keep it in a list
whats the best way to list them that I can look if they exist in pior to doing calculations
The hash itself is already a key of sorts. Your best bet is a hash table. In a properly implemented hash table, you can check for the existence of a key in constant time. Common hash table implementations with this feature are C# Dictionaries, Python's dict type, PHP array (which are actually Maps, not arrays), Perl's hashes % and Ruby's Hash. If you included details of what language you're working in, an example wouldn't be too hard to lookup.
I would like to facilitate searching on a field that we cannot index or store in non hashed or encrypted form. Is there a way to tell solr to hash (or encrypt) a speicfic field prior to comparing against the index?
In a nutshell, I don't think it's easy, and it depends on what level of security you need.
As a generic, simple solution, you could store the whole index in an encrypted file system, e.g. eCryptfs or TrueCrypt (see difference between block-level encryption and fs-level encryption)
Depending on how you need to search in this field, if you can get away with just hashing the values then the solution would be purely client-side, i.e. hashing the value client-side, sending it to Solr and getting back the results.
Some years ago there was a patch to enable field-level encryption in Lucene, but for some reason it was rejected. Still, maybe you can borrow some ideas from that patch...
Is it recommended to create a column (unique key) that is a hash.
When people view my URL, it is currently like this:
url.com/?id=2134
But, people can look over this and data-mine all the content, right?
Is it RECOMMENDED to go 1 extra step to make this through hash?
url.com?id=3fjsdFNHDNSL
Thanks!
The first and most important step is to use some form of role-based security to ensure that no user can see data they aren't supposed to see. So, for example, if a user should only see their own information, then you should check that the id belongs to the logged-in user before you display it.
As a second level of protection, it's not a bad idea to have a unique key that doesn't let you predict other keys (a hash, as you suggest, or a UUID). However, that still means that, for example, a malicious user who obtained someone else's URL (e.g. by sniffing a network, by reading a log file, by viewing the history in someone's browser) could see that user's information. You need authentication and authorization, not simply obfuscating the IDs.
It sort of depends on your situation, but off hand I think if you think you need to hash you need to hash. If someone could data mine by, say, iterating through:
...
url.com?id=2134
url.com?id=2135
url.com?id=2136
...
Then using a hash for the id is necessary to avoid this, since it will be much harder to figure out the next one. Keep in mind, though, that you don't want to make the hash too obvious, so that a determined attacker would easily figure it out, e.g. just taking the MD5 of 2134 or whatever number you had.
Well, the problem here is that an actual Hash is technically one way. So if you hash the data you won't be able to recover it on the receiving side. Without knowing what technology you are using to create your web page it's hard to make any concrete suggestions, but if you must have sensitive information in your query string then I would recommend that you at least use a symmetric encryption algorithm on it to keep people from simply reading off the values and reverse engineering things.
Of course if you have the option - it's probably better to not have that information in the query string at all.