Generating a pseudo-natural phrase from a big integer in a reversible way - sha

I have a large and "unique" integer (actually a SHA1 hash).
Note: While I'm talking here about SHA1 hashes, this is not a cryptography / security question! I'm not trying to break SHA1. Imagine a random 160-bit integer instead of SHA1 if that will help.
I want (for no other reason than to have fun) to find an algorithm to map that SHA1 hash to a computer-generated (pseudo-)English phrase. The mapping should be bidirectional (i.e., knowing the algorithm, one must be able to calculate the original SHA1 hash from that phrase.)
The phrase need not make sense. I would even settle for a whole paragraph of nonsense. (Though quality — englishness — of a paragraph should probably be better than for a mere phrase.)
A better algorithm would produce shorter, more natural-looking, more unique phrases.
A variation: it is OK if I will be able to work only with a part of hash. Say, first six hex digits is fine.
The possible usage of the generated phrase: the human readable version of Git commit ID, to use as a motto for a given program version, which is built from that commit. (As I said, this is "for fun". I don't claim that this is very practical — or be much more readable than the SHA1 itself.)
Possible approach: In the past I've attempted to build a probability table (of words), and generate phrases as Markov chains, seeding the generator (picking branches from probability tree), according to the bits I read from the SHA. This was not very successful, the resulting phrases were too long and ugly. I'm not sure if this was a bug, or the general flaw in the algorithm, since I had to abandon it early enough.
Now I'm thinking about attempting to solve the problem once again. Any advice on how to approach this? Do you think Markov chain approach can work here? Something else?

A very simple approach would be:
Take list of say 1024 nouns, 1024 verbs and 1024 adjectives each. Your phrase could then be sentence of the form
noun[bits_01-10] verb[bits11-20] adjective[bits21-30] verb[bits31-40],
noun[bits_41-50] verb[bits51-60] adjective[bits61-70] verb[bits71-80],
noun[bits_81-90] verb[bits91-100] adjective[bits101-110] verb[bits111-120] and
noun[bits_121-130] verb[bits131-140] adjective[bits141-150] verb[bits151-160].
With a bit more linguistic thought you can probably construct slightly more complicated ad thus not so repetitive looking sentences (say, a bit for singular / plural, a bit of two for different tenses,...). Longer word lists use up a few more bits but my guess is that you reach rather exotic words quite fast.

We'll, lets see... The english language has about 1,000,000 words. That's about 20 bits per word. SHA1 is 160 bits, so you'll need 8 words. Theoretically, All you'll have to do is to take the n'th word of the oxford english dictionary, where n is a group of 20 bits at a time.
Now, to make it more natural, you can try to add "in/at/on/and/the..." between words, according to their type (nouns,verbs...) using some simple algorithm. (You should remove all these words from your base dictionary, of course).
The algorithm is reversible: Just remove all the words you've added, and convert each word to it's 20-bit index.
Also, try google "insult generator". Some of those generators are pretty nice. I'm not sure about the number of combinations, though.
You can buy the Oxford English Dictionary on CD-ROM with more than 500,000 words (19-bit). I'm not sure if it would be easy to extract the words and their types, however. I'm not sure if it is legal, but I think you can't claim a patent on dictionary entries...

This is an old question but entropoetry is a JavaScript (Node/frontend) library that also solves this problem. It combines Markov poetry with Huffman coding, so given the same dictionary (i.e., the same version of the library), converting poetry↔︎numbers will be bidirectional.
Example, from the Node command line:
> var Poet = require('entropoetry'); var p = new Poet();
> p.stringify(Buffer.from('deadbeef', 'hex'))
'old trick of loving you\nif you but'
> console.log(p.parse(`old trick of loving you
... if you but`))
<Buffer de ad be ef>
And as technology marches on, what seemed like a “fun only” idea in 2011 has some real uses in 2017: memorizing cryptocurrency private keys (brain wallet), Dat/IPFS links, etc.

Hash function means it is not possible (within reasonable limits) to get a data from hash, unless it is broken (insecure).
Question should be about breaking SHA-1 hash algorithm - look at Google, it's not that broken. So no, you cannot create English phrase from SHA-1 hash code, if you can, please make a huge paper about that, lot of them are useless, this would be breakthrough :-)
Edit: if only part of hash is enough, I suggest just brute force (+ simple map of hash<->phrase, possibly in a file or db), breaking hash algorithm is very "strong soup" (difficult problem).
Edit2: guys be more specific when asking question, not my fault... I will not delete this so that it scares off any other crypto guys around :-)

Related

Find the most similar text in a list of strings

First of all, I want to acknowledge that this is perhaps a very sophisticated problem; however, I have not been able to find a definitive answer for it online so I'm looking for suggestions.
Suppose I want to collect a list containing over hundred thousand strings, values of these strings are sentences that a user has typed. The values are added to the list as soon as a user types a new message. For example:
["Hello world!", "Good morning, my name is John", "Good morning, everyone"]
But I also want to have a timeout for each string so if they are not repeated within 5 min, they should be removed, so I change it to following format:
[{message:"Hello world!", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, my name is John", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, everyone", timeout: NodeJS.Timeout, count: 1}]
Now suppose a user types the following message:
Good morning, everyBODY
I want to compare this string to all the messages in list and if one is 70% or more similar, update the count of that message, otherwise insert it as a new message. For this message for example, the application should update the count for Good morning, everyone to be equal to 2.
Since users can type a lot of messages in a short amount of time, the algorithm must also support fast insertion, searching, and deleting after the timeout.
What is the best way to implement this? or are there any libraries to help me with this?
NOTE: The strings do not need to be in an array, any data structure would work.
The main purpose of this algorithm is to detect similar messages when the count reaches a predefined value. For example warning: Over 5 users typed messages similar to "Hello everybody" within 5 minutes
I have looked at B-Trees, Nearest Neighbor, etc but I can't figure out what would be the best solution.
Update:
I plan on using Levenshtein distance for string similarity, however the main problem is how to apply that to a list of strings in most time efficient way, without having to check every single string every time a new message is added.
Levenshtein distance
Unlike the other answer I think Levenshtein distance is perfectly capable of dealing with spelling mistakes. Indeed, Levenshtein and LevXenshtein only have Levenshtein distance 1, and thus can be concluded to likely be the same message.
However, if you want to use this distance, you will have to compute the distance between the new message and every message stored, every time a new message comes in. There is likely no way around this.
Unfortunately there is no real useful pre-processing you can do for this.
Other possibilities
If you can find a way to map every message to a fixed-size vector, you can use essentially any nearest neighbor search technique. I suggest doing so.
This leaves us with two problems to solve. Generating the fixed-length vector, and doing the search.
Fixed-size vector representation
There are multiple ways of doing this, all with their own set of drawbacks. I'll specifically mention two, but it will depend on your architecture and data which method is best for you.
First, you could go the machine-learning way. You could map every word to a pre-trained vector with fastText, average the words in the message, and use that as your vector. The drawbacks of this method are that it will ignore word order, and it will work less well if the words used tend to be very informal. If your messages have their own culture to them (such as for example Twitch chat) you would have to retrain these vectors instead of using pre-trained ones.
Alternatively, you could use the structure of the text directly, and make an occurrence vector of bigrams. That is, jot down how often every 2-character combination occurs in a message. This is fairly robust, but has the drawback that the vectors will become relatively large.
Regardless, these are just two options, and it's impossible to tell what method is ideal for you. Unless of course someone has a brilliant idea.
Nearest neighbor search
Given that we have fixed length vectors, we can now do nearest neighbor search. As you've probably found, there are once again many different methods for this, all with their own drawbacks. Exhausting, I know.
I'll choose to discuss three categories.
Approximate search: This method may seem a little silly, but it could be what you want. Specifically, Locality-sensitive hashing is essentially just making some hashing function where "similar" vectors are likely to end up in the same bucket. You could then do anything you want, such as Levenshtein, with all of the other members of the bucket, because there should not be too many of them. The advantage of such an approximate algorithm is that it can be fast, and with some smart hashing you don't even need fixed-length vectors. A downside, of course, is that it is not guaranteed to work.
Exact search: We can also choose to instead solve the problem of Fixed-radius near neighbors. That is, find the points within some distance of the target point. You could do this by mapping vectors to integers (if they aren't already) and simply checking every lattice point within the distance you want to search. The primary drawback here is that the search time grows very fast not with the number of points, but with the number of dimensions of the vector. This method would necessitate small vectors.
Fancy datastructures: This seems to me most likely to be the right solution. Unfortunately you have a lot of letter-trees. You mention B-trees, but there's also R-trees, R+-Trees, R*-Trees, X-Trees, and that's just the direct descendants of the R-tree. With the risk of missing the trees for the forest, I'd suggest taking a look at the k-d tree. It can do nearest neighbor search in logarithmic time, as well as insertion and deletion.
You want to covert all of the words to their Soundex value.
Then you need a database for the soundex values that ranks the importance of the word in the sentence, e.g. the should probably get 0. The more information the word carries the higher its value.
Then sort the words in the sentence into a list of integers.
Use the list of integers as the key to find similar sentences.
Since the key is a list of integers a Rose tree should work as data structure.
While some may suggest measuring using something like Levenshtein distance that presupposes that the sentences have no spelling mistakes or such. You need something that is flexible enough to deal with human error.
I would suggest you to use Algolia. Which has their own ranking algorithm rates each matching record on several criteria (such as the number of typos or the geo-distance), to which they individually assign a integer value score.
I would totally take a loook on it, since they have Search-as-you-type and different Ranking algorithm criterias.
https://blog.algolia.com/search-ranking-algorithm-unveiled/
I think Search Engine like SOLR or Elastic Search are best fit for your problem.
You have to create single collection in which you can store data as you have mention in the question after that you just have to add data to solr and search it in the solr search with your time limit.

Using "seed" based math to recreate application instances

Okay so I was thinking today about Minecraft a game which so many of you are so familiar with, I'm sure and while my question isn't directly related to the game I find it much simply to describe my question using the game as an example.
My question is, is there any way a type of "seed" or string of characters can be used to recreate an instance of a program (not in the literal programming sense) by storing a code which when re-entered into this program as a string at run-time, could recreate the data it once held again, in fields, text boxes, canvases, for example, exactly as it was.
As I understand it, Minecraft takes the string of ASCII characters you enter, all which truly are numbers, and performs a series of operations on it which evaluate to some type of hash or number which is finite... this number (again as I understand) is the representation of that string you entered. So it makes sense that because a string when parsed by this algorithm will always evaluate to the same hash. 1 + 1 will always = 2 so a seeds value must always equal that seeds value in the end. And in doing so you have the ability to replicate exactly, worlds, by entering this sort of key which is evaluated the same on every machine.
Now, if we can exactly replicate worlds like this this is it possible to bring it into a more abstract concept like the following?...
Say you have an application, like Microsoft Word. Word saved the data you have entered as a file on your hard drive it holds formatting data, the strings you've entered, the format of the file... all that on a physical file... Now imagine if when you entered your essay into Word instead of saving it and bringing your laptop to school you instead click on parse and instead of creating a file, you are given a hash code... Now you goto school you know you have to print it. so you log onto the computer and open Word... Now instead of open there is an option now called evaluate you click it and enter the hash your other computer formulated and it creates the exact essay you have written.
Is this possible, and if so are there obvious implementations of this i simply am not thinking of or are just so seemingly part of everyday I don't think recognize it? And finally... if possible, what methods and algorithms would go into such a thing?
[EDIT]
I had to do some research on the anatomy of a seed and I think this explains it well
The limit is 32 characters or for a
numeric seed, 19 digits plus the minus sign.
Numeric seeds can range from -9223372036854775808 to
9223372036854775807 which is a total of 18446744073709551616 Text
strings entered will be "hashed" to one of the numeric seeds in the
above range. The "Seed for the World Generator" window only allows 32
characters to be entered and will not show or use any more than that."
BUT looking back on it lossless compression IS EXACTLY what I was
describing after re-reading the wiki page and remembering that (you
are very correct) the seed only partakes in the generation, the final
data is stores as a "physical" file on the HDD (which again, you are correct) is raw uncompressed data in a file
So in retrospect, I believe I was describing lossless compression, trying in my mind to figure out how the seed was able to replicate the exact same world, forgetting the seed was only responsible for generating the code, not the saving or compression of it.
So thank you for your help guys! It's really appreciated I believe we can call this one solved!
There are several possibilities to achieve this "string" that recovers your data. However they're not all applicable depending on the context.
An actual seed, which initializes for example a peudo-random number generator, then allows to recreate the same sequence of pseudo-random numbers (see this question).
This is possibly similar to what Minecraft relies on, because the whole process of how to create a world based on some choices (possibly pseudo-random choices) is known in advance. Even if we pretend that we have random numbers, computers are actually deterministic, which makes this possible.
If your document were generated randomly then this would be applicable: with the same seed, the same gibberish comes out.
Some key-value dictionary, or hash map. Then the values have to be accessible by both sides and the string is the key that allows to retrieve the value.
Think for example of storing your word file on an online server, then your key is the URL linking to your file.
Compressing all the information that is in your data into the string. This is much harder, and there are strong limits due to the entropy of the data. See Shannon's source coding theorem for example.
You would be better off (as in, it would be easier) to just compress your file with a usual algorithm (zip or 7z or something else), rather than reimplementing it yourself, especially as soon as your document starts having fancy things (different styles, tables, pictures, unusual characters...)
With the simple hypothesis of 27 possible characters (26 letters and the space), Shannon himself shows in Prediction and Entropy of Printed English (Bell System Technical Journal, 30: 1. January 1951 pp 50-64, online version) that there is about 2.14 bits of entropy per letter in English. That's about 550 characters encoded with your 32 character string.
While this is significantly better than the 8 bits we use for each ASCII character, it also shows it is very likely to be impossible to encode a document in English in less than a fourth of its size. Then you'd still have to add punctuation, and all the rest of the fuss.

Most efficient algorithm for comparing a string with a group of strings

I have a scenario where a user can post a number of responses or phrases via a form field. I would like to be able to take the response and determine what they are asking for. For instance if the user types in car, train, bike, jet .... I can assume they are talking about a vehicle, and respond accordingly. I understand that I could use a switch statement or perhaps a regexp as well, however the larger the number of possible responses, the less efficient that computation will be. I'm wondering if there is an efficient algorithm for comparing a string with a group of strings. Any info would be great.
You may want to look into the Aho-Corasick algorithm. If you have a collection of strings that you want to search for, you can spend linear time doing preprocessing on those strings and from that point forward can, in O(n) time, check for all possible matches of those strings in a text corpus of length n. In other words, with a small preprocessing time to set up the algorithm once, you can extremely efficiently scan over numerous inputs again and again searching for those keywords.
Interestingly enough, the algorithm was specifically invented to build a fast index (that is, to look for a lot of different keywords in a huge body of text), and allegedly outperformed other methods by a factor of ten. I think it would work great in your application.
Hope this helps!
If you have a large number of "magic" words, I would suggest splitting the query into words, and using a hash-based lookup to check whether the words are recognized.
You can check Trie structure. I think one of best solution for your problem.

String Matching Algorithms

I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you

Why are passwords with repeating substrings weak?

Many websites have password strength checking tool, which tells you how strong your password is
Lets say I have
st4cK0v3rFl0W
which is always considered super strong, but when I do
st4cK0v3rFl0Wst4cK0v3rFl0W
it is suddenly super weak. I've also heard that when password have just small repeating sequence, it is much weaker.
But how possibly can the second one be weaker than the first one, when it is twice as long?
Sounds like the password strength checker is flawed. It's not a big issue, I suppose, but a repeated strong password is not weaker than the original password.
My guess is that it's simply trivial to check for someone attacking your password. Trying each password doubled and tripled too is only double or triple the work. However, including more possibly characters in a password, such as punctuation marks, raises the complexity of brute-forcing your password much more.
However, in practice, nearly every non-obvious (read: impervious to dictionary attacks [yes, that includes 1337ifying a dictionary word]) password with 8 or more characters can be considered reasonably secure. It's usually much less work to social engineer it from you in some way or just use a keylogger.
I guess because you need to type your password two times by using the keyboard, so for that maybe if some one is in front of you can notice it.
The algorithm is broken.
Either uses a doublet detection and immediately writes it off as bad. Or calculates a strength that is in some way relative to the string length, and the repeated string is weaker than the comparable totally random string of equal length.
It might be a flaw by the password trength checker - it recognises a pattern... A pattern is not good for a password, but in this case it is a pattern on a complex string... Another reason can be the one pointed out by answer from Wael Dalloul : Someone can see the repeated text when you type it. Any spies have two chances of seeing what you type...
The best reason that I could think of, comes from the Electronic Authentication Guide, published by NIST. It gives a general thumb rule on how to estimate entropy in a password.
Length is just one criteria for entropy. There is the password character set that is also involved, but these are not the only criteria. If you read Shannon's research on user selected passwords closely, you'll notice that higher entropy is assigned to initial bits, and and lesser entropy to the latter, since it is quite possible to infer the next bits of the password from the previous.
This is not to say that longer passwords are bad, just that long passwords with a poor selection of characters are just as likely to be weak as shorter passwords.
My guess is it can generate a more "obvious" hash.
For example abba -> a737y4gs, but abbaabba -> 1y3k1y3k , granted this is a silly example, but the idea is that repeating patterns in key would make hash appear less "random".
Whilst in practice the longer one is probably stronger, I think there may be potential weaknesses when you get into the nitty gritty of how encryption and ciphers work... possibly...
Other than that, I'd echo the other responses that the strength-checker you're using isn't taking all aspects into account very accurately.
Just a thought...

Resources