Human-readable short one-way hash for obscuring strings in logfiles? - string

I need to do large-scale anonymisation of database log-files.
Part of this will involve obscuring various field names (strings), as well as IP addresses.
1. Field Names
For example, we might have the string BusinessLogic.Categorisation.ExternalDeals. In the anonymised version, we would want it to be something like Jerrycan.Doorway.Fodmap (or something gibberish, but still "pronounceable")
The purpose is simply to obscure the original strings - however, we still want to be able to matchup occurrence of those strings across different logfiles.
The requirements of the hash are:
Repeatable - that is, the same inputs passed in each time would always produce the same outputs. We need to be able to match-up fields between different logfiles (all we're trying to prevent is somebody deriving the original string).
One-way - there is no way of reversing the outputs to product the inputs.
Low chance of collision - it will mess up our analysis if two fields are mapped to the same output.
Human readable (or pronounceable) - somebody scanning through logfiles by hand should be able to make out fields, and visually match them up. Or if need be, read them over the phone.
Short strings - I do understand there's a tradeoff between this and available entropy, however, ideally a string like HumanReadable should map to something like LizzyNasbeth.
I had a look around, and I found https://github.com/zacharyvoase/humanhash (output hash is a bit longer than what I want) and https://www.rfc-editor.org/rfc/rfc1751 (not really "pronouceable" - ideally, we'd want something that looks like a English-language human word, but isn't actually - and, once again, a bit long).
What algorithms or approaches are there to this problem? Or any libraries or implementations you could recommend?
2. IP Addresses
For the IP addresses, we need a way to mask them (i.e. not possible for an outside observer to derive the original IP address), but still have it be repeatable across different logfiles (i.e. the same input always produces the same output).
Ideally, the output would still "look" like an IP address. For example, maybe 192.168.1.55 would map to 33.41.22.44 (or we can use alphabetical codes as well, if that's easier).
Any thoughts on how to do this?

You could use codenamize :
$ codenamize BusinessLogic -j "" -c
AbsorbedUpper
You can use this from command line or as a Python library.
(Disclaimer, I wrote it).

I was discussing with a colleague, and he suggested one approach.
Take the field name - and pass it through a standard one-way hash (e.g. MD5).
Use the resulting digest as a index to map to a dictionary of English words (e.g. using mod).
That solves the issue of it always being repeatable - the same word hashed each time will always map to the same English word (assuming your dictionary list does not change).
If individuals companies were worried about dictionary attacks (i.e. the field name "firstname" would always map to say "Paris"), then we could also use a company-specific keyfile to salt the hash. This means that it would be repeatable for anonymised logfiles from them (i.e. "firstname" might always map to "Toulouse" for them), but it would not be the same as for other companies who use other keyfiles.
I'm still very keen to see what other people can suggest, or whether they might have any thoughts on the above.

Related

Turning unique filepath into unique integer

I often times use filepaths to provide some sort of unique id for some software system. Is there any way to take a filepath and turn it into a unique integer in relatively quick (computationally) way?
I am ok with larger integers. This would have to be a pretty nifty algorithm as far as I can tell, but would be very useful in some cases.
Anybody know if such a thing exists?
You could try the inode number:
fs.statSync(filename).ino
#djones's suggestion of the inode number is good if the program is only running on one machine and you don't care about a new file duplicating the id of an old, deleted one. Inode numbers are re-used.
Another simple approach is hashing the path to a big integer space. E.g. using a 128 bit murmurhash (in Java I'd use the Guava Hashing class; there are several js ports), the chance of a collision among a billion paths is still 1/2^96. If you're really paranoid, keep a set of the hash values you've already used and rehash on collision.
This is just my comment turned to an answer.
If you run it in the memory, you can use one of standard hashmaps in your corresponding language. Not just for file names, but for any similar situation. Normally, hashmaps in different programming languages are satisfying collisions by buckets, so the hash number and the corresponding bucket number will provide a unique id.
Btw, it is not hard to write your own hashmap, such that you have control on the underlying structure (e.g. to retrieve the number etc).

Fuzzy String Matching

I have a requirement within my application to fuzzy match a string value inputted by the user, against a datastore.
I am basically attempting to find possible duplicates in the process in which data is added to the system.
I have looked at Metaphone, Double Metaphone, and SoundEx, and the conclusion I have came to is they are all well and good when dealing with a single word input string; however I am trying to match against a undefined number of words (they are actually place names).
I did consider actually splitting each of the words from the string (removing any I define as noise words), then implementing some logic which would determine which place names within my data store, best matched (based on the keys from the algorithm I choose); the advantage I see in this, would be I could selectively tighten up, or loosen the match criteria to suit the application: however this does seem a little dirty to me.
So my question(s) are:
1: Am I approaching this problem in the right way, yes I understand it will be quite expensive; however (without going to deeply into the implementation) this information will be coming from a memcache database.
2: Are there any algorithms out there, that already specialise in phonetically matching multiple words? If so, could you please provide me with some information on them, and if possible their strengths and limitations.
You may want to look into a Locality-sensitive Hash such as the Nilsimsa Hash. I have used Nilsimsa to "hash" craigslists posts across various cities to search for duplicates (NOTE: I'm not a CL employee, just a personal project I was working on).
Most of these methods aren't as tunable as you may want (basically you can get some loosely-defined "edit distance" metric) and they're not phonetic, solely character based.

How to test mysqli's real_escape_string()?

I'm fairly new to mysqli (not to mysql!) and I'm updating a currently-mysql-function which secures a string (or an (recursive) array with strings) for basic sanitation.
The php.net mysqli::real_escape_string() manual has a very clear warning about that charset. I've implemented this.
How do I test this? I can't find the information I'm looking for. I guess I'm looking for certain strings to input resulting in an unsecure result, and a result considered save.
I don't mean "add slashes to ' <- those single quotes". I'm looking for some more advanced tricks or vulnerabilities.
I'm also not looking for prepared statements. Those are wonderful and I'd love to use those, but not an option at this point because updating I'm a old system as fast as possible, prepared statements are not an option at this point in time. I'll be adding those in the future.
Here is the code you are looking for, I believe. Just change mysql to mysqli.
Also please note that
this function is not to "secure" strings but to format them. Means every string that is going into query have to be processed, no matter if you count it "dangerous" or not.
this function have to be used to format SQL string literals only. And it is utterly useless for all other query parts.
this function should not to be used in the application code, but to support emulated prepared statements only.
Anyway, if your database encoding is conventional utf-8, there is no point to bother with encoding at all. "A clear warning" actually connected to some marginal and extremely rarely used encodings only.

How do AV engines search files for known signatures so efficiently?

Data in the form of search strings continue to grow as new virus variants are released, which prompts my question - how do AV engines search files for known signatures so efficiently? If I download a new file, my AV scanner rapidly identifies the file as being a threat or not, based on its signatures, but how can it do this so quickly? I'm sure by this point there are hundreds of thousands of signatures.
UPDATE: As tripleee pointed out, the Aho-Corasick algorithm seems very relevant to virus scanners. Here is some stuff to read:
http://www.dais.unive.it/~calpar/AA07-08/aho-corasick.pdf
http://www.researchgate.net/publication/4276168_Generalized_Aho-Corasick_Algorithm_for_Signature_Based_Anti-Virus_Applications/file/d912f50bd440de76b0.pdf
http://jason.spashett.com/av/index.htm
Aho-Corasick-like algorithm for use in anti-malware code
Below is my old answer. Its still relevant for easily detecting malware like worms which simply make copies of themselves:
I'll just write some of my thoughts on how AVs might work. I don't know for sure. If someone thinks the information is incorrect, please notify me.
There are many ways in which AVs detect possible threats. One way is signature-based
detection.
A signature is just a unique fingerprint of a file (which is just a sequence of bytes). In terms of computer science, it can be called a hash. A single hash could take about 4/8/16 bytes. Assuming a size of 4 bytes (for example, CRC32), about 67 million signatures could be stored in 256MB.
All these hashes can be stored in a signature database. This database could be implemented with a balanced tree structure, so that insertion, deletion and search operations can be done in O(logn) time, which is pretty fast even for large values of n (n is the number of entries). Or else if a lot of memory is available, a hashtable can be used, which gives O(1) insertion, deletion and search. This is can be faster as n grows bigger and a good hashing technique is used.
So what an antivirus does roughly is that it calculates the hash of the file or just its critical sections (where malicious injections are possible), and searches its signature database for it. As explained above, the search is very fast, which enables scanning huge amounts of files in a short amount of time. If it is found, the file is categorized as malicious.
Similarly, the database can be updated quickly since insertion and deletion is fast too.
You could read these pages to get some more insight.
Which is faster, Hash lookup or Binary search?
https://security.stackexchange.com/questions/379/what-are-rainbow-tables-and-how-are-they-used
Many signatures are anchored to a specific offset, or a specific section in the binary structure of the file. You can skip the parts of a binary which contain data sections with display strings, initialization data for internal structures, etc.
Many present-day worms are stand-alone files for which a whole-file signature (SHA1 hash or similar) is adequate.
The general question of how to scan for a large number of patterns in a file is best answered with a pointer to the Aho-Corasick algorithm.
I don't know how a practical AV works. but I think the question have some relative with finding words in a long text with a given dictionary.
For the above question, data structures like TRIE will make it very fast. processing a Length=N text dictionary of K words takes only O(N) time.

Are MongoDB ids guessable?

If you bind an api call to the object's id, could one simply brute force this api to get all objects? If you think of MySQL, this would be totally possible with incremental integer ids. But what about MongoDB? Are the ids guessable? For example, if you know one id, is it easy to guess other (next, previous) ids?
Thanks!
Update Jan 2019: As mentioned in the comments, the information below is true up until version 3.2. Version 3.4+ changed the spec so that machine ID and process ID were merged into a single random 5 byte value instead. That might make it harder to figure out where a document came from, but it also simplifies the generation and reduces the likelihood of collisions.
Original Answer:
+1 for Sergio's answer, in terms of answering whether they could be guessed or not, they are not hashes, they are predictable, so they can be "brute forced" given enough time. The likelihood depends on how the ObjectIDs were generated and how you go about guessing. To explain, first, read the spec here:
Object ID Spec
Let us then break it down piece by piece:
TimeStamp - completely predictable as long as you have a general idea of when the data was generated
Machine - this is an MD5 hash of one of several options, some of which are more easily determined than others, but highly dependent on the environment
PID - again, not a huge number of values here, and could be sleuthed for data generated from a known source
Increment - if this is a random number rather than an increment (both are allowed), then it is less predictable
To expand a bit on the sources. ObjectIDs can be generated by:
MongoDB itself (but can be migrated, moved, updated)
The driver (on any machine that inserts or updates data)
Your Application (you can manually insert your own ObjectID if you wish)
So, there are things you can do to make them harder to guess individually, but without a lot of forethought and safeguards, for a normal data set, the ranges of valid ObjectIDs should be fairly easy to work out since they are all prefixed with a timestamp (unless you are manipulating this in some way).
Mongo's ObjectId were never meant to be a protection from brute force attack (or any attack, for that matter). They simply offer global uniqueness. You should not assume that some object can't be accessed by a user because this user should not know its id.
For an actual protection of your resources, employ other techniques.
If you defend against an unauthorized access, place some authorization logic in your app (allow access to legitimate users, deny for everyone else).
If you want to hinder dumping all objects, use some kind of rate limiting. Combine with authorization if applicable.
Optional reading: Eric Lippert on GUIDs.

Resources