Efficient implementation of BPE using priority queue - string

I think that it is not strictly BPE (byte pair encoding), but there is a similar idea applied to strings.
Suppose there are three Chinese words in the dictionary (I will use a huge dictionary like CEDICT for practical use.)
我
喜欢
水果
Then take an input like this below.
我喜欢水果 (I like fruit)
Since Chinese texts are not splitted by white spaces, it's difficult to process.
We can decompose the input string into multiple single characters.
我 喜 欢 水 果
Then lookup new symbol pair at [left, right] and combine them. If the combined word is in the dictionary, we can replace the combined word with a new symbol.
我喜
喜欢 <- in the dic
欢水
水果 <- in the dic
We found two new symbols, so the input text becomes
我 喜欢 水果
We should iterate until we cannot find any combined word in the dictionary. In this case, we cannot find a new symbol in the dictionary.
我喜欢 水果
喜欢水果
It's not difficult to implement this naively but we need to scan adjoining two words many times. Some said we can implement BPE efficiently with a priority queue. I'm not familiar with compression algorithms. I would be grateful if someone could tell me the implementation or useful documentations.
In this method, out of vocabulary words are decomposed into single characters, so we can avoid unknown words problems.
Best regards,
Reference: Neural Machine Translation of Rare Words with Subword Units He had to start with pre-tokenized words because of computational complexity.

I would suggest storing the dictionary as a trie using hash lookups at each level. This replaces your scans with hash lookups, which are O(1).

Related

Algorithm to un-concatenate words from string without spaces and punctuation

I've been given a problem in my data structures class to find the solution to this problem. It's similar to an interview question. If someone could explain the thinking process or solution to the problem. Pseudocode can be used. So far i've been thinking to use tries to hold the dictionary and look up words that way for efficiency.
This is the problem:
Oh, no! You have just completed a lengthy document when you have an unfortunate Find/Replace mishap. You have accidentally removed all spaces, punctuation, and capitalization in the document. A sentence like "I reset the computer. It still didn't boot!" would become "iresetthecomputeritstilldidntboot". You figure that you can add back in the punctation and capitalization later, once you get the individual words properly separated. Most of the words will be in a dictionary, but some strings, like proper names, will not.
Given a dictionary (a list of words), design an algorithm to find the optimal way of "unconcatenating" a sequence of words. In this case, "optimal" is defined to be the parsing which minimizes the number of unrecognized sequences of characters.
For example, the string "jesslookedjustliketimherbrother" would be optimally parsed as "JESS looked just like TIM her brother". This parsing has seven unrecognized characters, which we have capitalized for clarity.
For each index, n, into the string, compute the cost C(n) of the optimal solution (ie: the number of unrecognised characters in the optimal parsing) starting at that index.
Then, the solution to your problem is C(0).
There's a recurrence relation for C. At each n, either you match a word of i characters, or you skip over character n, incurring a cost of 1, and then parse the rest optimally. You just need to find which of those choices incurs the lowest cost.
Let N be the length of the string, and let W(n) be a set containing the lengths of all words starting at index n in your string. Then:
C(N) = 0
C(n) = min({C(n+1) + 1} union {C(n+i) for i in W(n)})
This can be implemented using dynamic programming by constructing a table of C(n) starting from the end backwards.
If the length of the longest word in your dictionary is L, then the algorithm runs in O(NL) time in the worst case and can be implemented to use O(L) memory if you're careful.
You could use rolling hashes of different lengths to speed up the search.
You can try a partial pattern matcher for example aho-corasick algorithm. Basically it's a special space optimized version of a suffix tree.

Finding mnemonic sentences and acrostics

There are many websites with which one can carry out searches to return results with specific constraints from a dictionary with the purpose of solving word puzzles.
But is there a way to search a specific text document, other than a dictionary, so that one can locate meaningful mnemonic sentences from the text of one's choice?
For example, if I want to find a sentence to help me remember the order of a list of letters, M, I, S, S, C, G, which are the first letters of a list of words I need to remember: Muscular, Infrahyoid, Superior laryngeal, Sternomastoid, Cricothyroid, Glandular, how can I search a text document (or multiple documents) for sentences in which these letters appear as the first letters in the words of that sentence?, in a specified order?
This would allow me to memorize sentences from the text of my choice, which are meaningful and useful, and simultaneously memorize the order of the letters which I need to recall the list of vocabulary words.
Sure, I can generate a mnemonic of random, meaningless words, which would serve the purpose of memory recall just fine. I might end up with a funny sentence like: "May I Softly Squeeze Charlie's Girl?" to help me remember the above list of words. However, I would much prefer to find mnemonic sentences from a text of my choice which are personally meaningful to me.
Any suggestions you might have would be greatly appreciated.

Is there a faster solution to splitting string into words from dictionary

Given a string s and array a of words, split s using words from a so that the least characters were left without matching to any of words.
So given s = 'aabbac' and a = {'aabb', 'c', 'aab', 'bac'} I expect s to be splited into aab|bac not into aabb|a|c because the last option gives me an extra character.
Is there any solutions faster than O(|s|*|a|) with dynamic programming and hashes?
Seems to me that optimal processing involves scanning through s from left to right, using a trie to track matching words in a, keeping partial solutions in a vector or similar along with an indication of whether they currently are part-way through matching some part of the trie, or are waiting to start a new match, and the cummulative count of characters matched so far. You'd have to expand the vector when alternative matches/terminations are found, and you can "condense" all entries that aren't mid-match to an arbitrary selection therefrom with the (equal) highest cummulative count. I don't see any use for hashes - that implies something far slower than a trie - extracting and testing candidate words without any clear idea whether you're already "inside" a partial match.

Finding which word is occurring in given sentence

I've list of words. Number of words is around 1 million.
I've strings coming at runtime, I've to check which word from the list is present in string and return that word (need not to return all words occurring in sentence, returning first one also suffice the requirement).
One solution is checking all words one by one in string but it's inefficient.
Can someone please point out any efficient method of doing it?
Use the Knuth-Morris-Pratt algorithm. Although a million words is not all that much. You can also convert your text body into a Trie structure and then use that to check your search list against. There is a special kind of Trie called a Suffix Tree used especially for full text searching.
Put your word list in a tree or hash table.
Unless your word's list is ordered (or inserted in a efficient data structure like an ordered binary tree) to perform a binary search, the solution you are proposing is the most efficient one.

How do I determine if a random string sounds like English?

I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.
EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.
You can build a markov-chain of a huge english text.
Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.
See here: http://en.wikipedia.org/wiki/Markov_chain
At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.
In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.
The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')
>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]
>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]
You could approach this by tokenizing a candidate string into bigrams—pairs of adjascent letters—and checking each bigram against a table of English bigram frequencies.
Simple: if any bigram is sufficiently low on the frequency table (or outright absent), reject the string as implausible. (String contains a "QZ" bigram? Reject!)
Less simple: calculate the overall plausibility of the whole string in terms of, say, a product of the frequencies of each bigram divided by the mean frequency of a valid English string of that length. This would allow you to both (a) accept a string with an odd low-frequency bigram among otherwise high-frequency bigrams, and (b) reject a string with several individual low-but-not-quite-below-the-threshold bigrams.
Either of those would require some tuning of the threshold(s), the second technique more so than the first.
Doing the same thing with trigrams would likely be more robust, though it'll also likely lead to a somewhat more strict set of "valid" strings. Whether that's a win or not depends on your application.
Bigram and trigram tables based on existing research corpora may be available for free or purchase (I didn't find any freely available but only did a cursory google so far), but you can calculate a bigram or trigram table from yourself from any good-sized corpus of English text. Just crank through each word as a token and tally up each bigram—you might handle this as a hash with a given bigram as the key and an incremented integer counter as the value.
English morphology and English phonetics are (famously!) less than isometric, so this technique might well generate strings that "look" English but present troublesome prounciations. This is another argument for trigrams rather than bigrams—the weirdness produced by analysis of sounds that use several letters in sequence to produce a given phoneme will be reduced if the n-gram spans the whole sound. (Think "plough" or "tsunami", for example.)
It's quite easy to generate English sounding words using a Markov chain. Going backwards is more of a challenge, however. What's the acceptable margin of error for the results? You could always have a list of common letter pairs, triples, etc, and grade them based on that.
You should research "pronounceable" password generators, since they're trying to accomplish the same task.
A Perl solution would be Crypt::PassGen, which you can train with a dictionary (so you could train it to various languages if you need to). It walks through the dictionary and collects statistics on 1, 2, and 3-letter sequences, then builds new "words" based on relative frequencies.
I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.
Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.
Soundex is very easy to implement - see Wikipedia for a description of the algorithm.
An example implementation of what you want to do would be:
def soundex(name, len=4):
digits = '01230120022455012623010202'
sndx = ''
fc = ''
for c in name.upper():
if c.isalpha():
if not fc: fc = c
d = digits[ord(c)-ord('A')]
if not sndx or (d != sndx[-1]):
sndx += d
sndx = fc + sndx[1:]
sndx = sndx.replace('0','')
return (sndx + (len * '0'))[:len]
real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]
if soundex(candidate) in soundex_cache:
print "keep"
else:
print "discard"
Obviously you'll need to provide an implementation of read_english_dictionary.
EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.
Metaphone and Double Metaphone are similar to SOUNDEX, except they may be tuned more toward your goal than SOUNDEX. They're designed to "hash" words based on their phonetic "sound", and are good at doing this for the English language (but not so much other languages and proper names).
One thing to keep in mind with all three algorithms is that they're extremely sensitive to the first letter of your word. For example, if you're trying to figure out if KEAL is English-sounding, you won't find a match to REAL because the initial letters are different.
Do they have to be real English words, or just strings that look like they could be English words?
If they just need to look like possible English words you could do some statistical analysis on some real English texts and work out which combinations of letters occur frequently. Once you've done that you can throw out strings that are too improbable, although some of them may be real words.
Or you could just use a dictionary and reject words that aren't in it (with some allowances for plurals and other variations).
You could compare them to a dictionary (freely available on the internet), but that may be costly in terms of CPU usage. Other than that, I don't know of any other programmatic way to do it.
That sounds like quite an involved task! Off the top of my head, a consonant phoneme needs a vowel either before or after it. Determining what a phoneme is will be quite hard though! You'll probably need to manually write out a list of them. For example, "TR" is ok but not "TD", etc.
I would probably evaluate each word using a SOUNDEX algorithm against a database of english words. If you're doing this on a SQL-server it should be pretty easy to setup a database containing a list of most english words (using a freely available dictionary), and MSSQL server has SOUNDEX implemented as an available search-algorithm.
Obviously you can implement this yourself if you want, in any language - but it might be quite a task.
This way you'd get an evaluation of how much each word sounds like an existing english word, if any, and you could setup some limits for how low you'd want to accept results. You'd probably want to consider how to combine results for multiple words, and you would probably tweak the acceptance-limits based on testing.
I'd suggest looking at the phi test and index of coincidence. http://www.threaded.com/cryptography2.htm
I'd suggest a few simple rules and standard pairs and triplets would be good.
For example, english sounding words tend to follow the pattern of vowel-consonant-vowel, apart from some dipthongs and standard consonant pairs (e.g. th, ie and ei, oo, tr). With a system like that you should strip out almost all words that don't sound like they could be english. You'd find on closer inspection that you will probably strip out a lot of words that do sound like english as well, but you can then start adding rules that allow for a wider range of words and 'train' your algorithm manually.
You won't remove all false negatives (e.g. I don't think you could manage to come up with a rule to include 'rythm' without explicitly coding in that rythm is a word) but it will provide a method of filtering.
I'm also assuming that you want strings that could be english words (they sound reasonable when pronounced) rather than strings that are definitely words with an english meaning.

Resources