Finding all legal subsets. Backtrack Algorithm ? Dynamic Programming? - subset

I am confused about the implementation of the following problem.
I have a bag of items, and a bag of rules specifying which items could be grouped together.
Items [1,2,3]
Rules [{1,2}, {3}, {1,3}, {2}, {1}, {1,2,3,4,5}]
Example output
[[{1,2},{3}], [{1,3}, {2}], [{1},{2},{3}]]
I am not sure what's a decent implementation. And what do you call this type of problem?
This is similar to parsing, which used CYK algorithm. For example if you have a sentence for Context-Free Grammar "I love you", it could follow rules {VP --> Love You, PRON --> I} and be grouped together. However, in this case, order matters. You cannot group "I" together with "you" since they are not adjacent. But in my case, this is possible, like {1,3} is grouped together.
Thank you very much

Related

What's the correct implementation of "bag of n-grams"?

I'm reading François Chollet book "Deep Learning with Python", and in page 204 it suggests that the phrase The cat sat on the mat. would originate the following 2-grams:
{"The", "The cat", "cat", "cat sat", "sat",
"sat on", "on", "on the", "the", "the mat", "mat"}
Source:
However, every implementation of n-grams that I have saw (nltk, tensorflow), encodes the same phrase like this following:
[('The', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat.')]
Am I missing some detail? (I'm new to natural language processing, so that might be the case)
Or it's the book implementation wrong/outdated?
I want to slightly expand on the other answer given, specifically to the "clearly wrong". While I agree that it is not the standard approach (to my knowledge!), there is an important definition in the mentioned book, just before the shown excerpt, which states:
Word n-grams are groups of N (or fewer) consecutive words that you can extract froma sentence. The same concept may also be applied to characters instead of words
(bold highlight by me). It seems that Chollet defines n-grams slightly different from the common interpretation (namely, that a n-gram has to consist of exactly n words/chars etc.). With that, the subsequent example is entirely within the
defined circumstances, although you likely will find varying implementations of this in the real world.
One example aside from the mentioned Tensorflow/NLTK implementation would be scikit-learn's TfidfVectorizer, which has the parameter ngram_range. This is basically something in between Chollet's definition and a strict interpretation, where you can select an arbitrary minimum/maximum amount of "grams" for a single unit, which are then built similar to the above example where a single bag can have both unigrams and bigrams, for example.
Book implementation is incorrect. It is mixing unigrams (1-grams) with bigrams (2-grams).

Extracting <subject, predicate, object> triplet from unstructured text

I need to extract simple triplets from unstructured text. Usually it is of the form noun- verb- noun, so I have tried POS tagging and then extracting nouns and verbs from neighbourhood.
However it leads to lot of cases and gives low accuracy.
Will Syntactic/semantic parsing help in this scenario?
Will ontology based information extraction be more useful?
I expect that syntactic parsing would be the best fit for your scenario. Some trivial template-matching method with POS tags might work, where you find verbs preceded and followed by a single noun, and take the former to be the subject and the latter the object. However, it sounds like you've already tried something like that -- unless your neighborhood extraction ignores word order (which would be a bit silly - you'd be guessing which noun was the word and which was the object, and that's assuming exactly two nouns in each sentence).
Since you're looking for {s, v, o} triplets, chances are you won't need semantic or ontological information. That would be useful if you wanted more information, e.g. agent-patient relations or deeper knowledge extraction.
{s,v,o} is shallow syntactic information, and given that syntactic parsing is considerably more robust and accessible than semantic parsing, that might be your best bet. Syntactic parsing will be sensitive to simple word re-orderings, e.g. "The hamburger was eaten by John." => {John, eat, hamburger}; you'd also be able to specifically handle intransitive and ditransitive verbs, which might be issues for a more naive approach.

How Do Ranks Work?

The best way for me to understand J is emulating the interpreter. Since the language is compact and has little rules, it's been easy... with the exception of how ranks affect function evaluation.
I want to be able to see an expression and know what's J doing to get the result, step by step.
Is there a doc, or someone could give me an algorithm so I can calculate myself how a f " n m b is evaluated?
Thanks in advance.
For learning about Rank the most accessible text is probably chapter 6 of J for C Programmers. The section of Eric Iverson's Primer that begins with Atom and goes through Checkpoint E covers the topic more concisely. Chapter 7 of Learning J is another place Rank is covered. All are valuable.
The most in-depth examination of Rank is Roger Hui's essay Rank and Uniformity. Hui's paper will make better reading after you've studied the other texts on this topic. Should it come down to wanting the nitty-gritty of implementation, you could dive into the interpreter source code. Personally, I'd not do that last one. Were I wanting to look at implementation algorithms I'd build a little model, and check it against the results of a J interpreter to make sure that my understanding of Rank matches.
Rank, in my view, is the most important concept in J. It is quite abstract in that it applies across all the shapes that nouns can take. The associated concepts are important to learn. These include shape, frame, cell, and agreement. These are explained individually in the Primer, but they're explained in some manner every time the topic is dealt with in depth.
The better your understanding of the Rank conjunction, and the broader world of noun Rank and verb Rank in which it applies, the more useful you'll find the three sections of the Vocabulary that deal with this conjunction. (Those sections are m"n , u"n , and m"v u"v .)
If you do come to write any algorithms that help you examine things in a step-by-step fashion, other J programmers will enjoy seeing them, I'm sure. I don't know of anything along those lines other than the actual interpreter source code.

Algorithm for Negating Sentences

I was wondering if anyone was familiar with any attempts at algorithmic sentence negation.
For example, given a sentence like "This book is good" provide any number of alternative sentences meaning the opposite like "This book is not good" or even "This book is bad".
Obviously, accomplishing this with a high degree of accuracy would probably be beyond the scope of current NLP, but I'm sure there has been some work on the subject. If anybody knows of any work, care to point me to some papers?
While I'm not aware of any work that specifically looks at automatically generating negated sentences, I imagine a good place to start would be to read up on linguistics work in formal semantics and pragmatics. A good accessible introduction would be Steven C. Levinson's Pragmatics book.
One issue that I think you'll run into is that it can be very difficult to negate all the information that is conveyed by a sentence. For example, take:
John fixed the vase that he broke.
Even if you change this to John did not fix the vase that he broke, there is a presupposition that there is a vase and that John broke it.
Similarly, simply negating the sentence John did not stopped using drugs as John stopped using drugs still conveys that John, at one point, used drugs. A more thorough negation would be John never used drugs.
Some existing natural language processing (NLP) work that you might want to look at is MacCartney and Manning 2007's Natural Logic for Textual Inference. In this paper they use George Lakoff's notion of Natural Logic and Sanchez Valencia's monotonicity calculus to create software that automatically determines whether one sentence entails another. You could probably use some their techniques for detecting non-entailment to artificially construct negated and contradicting sentences.
I'd recommend checking out wordnet. You can use it to lookup antonyms for a word, so you could conceivably replace "bad" with "not good" since bad is an antonym of good. NLTK has a simple python interface to wordnet.
The naïve way of course, is to try to add "not" right after {am,are,is}. I have no idea how this will work in your setting though, it will probably only work with predicate-like sentences.
For simple sentences parse looking for adverbs or adjectives given the English grammar rules and substitute an antonym if only one meaning exists. Otherwise use the correct English negation rule to negate the verb (ie: is -> is not).
High level algorithm:
Look up each word for it's type (noun, verb, adjective, adverb, conjunction, etc...)
Infer sentence structure from word type sequences (Your sentence was: article, noun, verb, adjective/adverb; This is known to be a simple sentence.)
For simple sentences, choose one invertible word and invert it. Either by using an antonym, or negating the verb.
For more complex sentences, such as those with subordinate clauses, you will need to have more complex analysis, but for simple sentences, this shouldn't be infeasible.
There's a similar process for first-order logic. The usual algorithm is to map P to not P, and then perform valid translations to move the not somewhere convenient, e.g.:
Original: (not R(x) => exists(y) (O(y) and P(x, y)))
Negate it: not (not R(x) => exists(y) (O(y) and P(x, y)))
Rearrange: not (R(x) or exists(y) (O(y) and P(x, y)))
not R(x) and not exists(y) (O(y) and P(x, y))
not R(x) and forall(y) not (O(y) and P(x, y))
not R(x) and forall(y) (not O(y) or not P(x, y))
Performing the same on English you'd be negating "If it's not raining here, then there is some activity that is an outdoors activity and can be performed here" to "It is NOT the case that ..." and finally into "It's not raining and every possible activity is either not for outdoors or can't be performed here."
Natural language is a lot more complicated than first-order logic, of course... but if you can parse the sentence into something where the words "not", "and", "or", "exists" etc. can be identified, then you should be able to perform similar translations.
For a rule-based negation approach, you can take a look at the Python module negate1.
1 Disclaimer: I am the author of the module.
As for some papers related to the topic, you can take a look at:
Understanding by Understanding Not: Modeling Negation in Language Models
An Analysis of Natural Language Inference Benchmarks through the Lens of Negation
Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation
Nice demos using NTLK - http://text-processing.com/demo and a short writeup - http://text-processing.com/demo/sentiment/.

How do I determine if a random string sounds like English?

I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.
EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.
You can build a markov-chain of a huge english text.
Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.
See here: http://en.wikipedia.org/wiki/Markov_chain
At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.
In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.
The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')
>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]
>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]
You could approach this by tokenizing a candidate string into bigrams—pairs of adjascent letters—and checking each bigram against a table of English bigram frequencies.
Simple: if any bigram is sufficiently low on the frequency table (or outright absent), reject the string as implausible. (String contains a "QZ" bigram? Reject!)
Less simple: calculate the overall plausibility of the whole string in terms of, say, a product of the frequencies of each bigram divided by the mean frequency of a valid English string of that length. This would allow you to both (a) accept a string with an odd low-frequency bigram among otherwise high-frequency bigrams, and (b) reject a string with several individual low-but-not-quite-below-the-threshold bigrams.
Either of those would require some tuning of the threshold(s), the second technique more so than the first.
Doing the same thing with trigrams would likely be more robust, though it'll also likely lead to a somewhat more strict set of "valid" strings. Whether that's a win or not depends on your application.
Bigram and trigram tables based on existing research corpora may be available for free or purchase (I didn't find any freely available but only did a cursory google so far), but you can calculate a bigram or trigram table from yourself from any good-sized corpus of English text. Just crank through each word as a token and tally up each bigram—you might handle this as a hash with a given bigram as the key and an incremented integer counter as the value.
English morphology and English phonetics are (famously!) less than isometric, so this technique might well generate strings that "look" English but present troublesome prounciations. This is another argument for trigrams rather than bigrams—the weirdness produced by analysis of sounds that use several letters in sequence to produce a given phoneme will be reduced if the n-gram spans the whole sound. (Think "plough" or "tsunami", for example.)
It's quite easy to generate English sounding words using a Markov chain. Going backwards is more of a challenge, however. What's the acceptable margin of error for the results? You could always have a list of common letter pairs, triples, etc, and grade them based on that.
You should research "pronounceable" password generators, since they're trying to accomplish the same task.
A Perl solution would be Crypt::PassGen, which you can train with a dictionary (so you could train it to various languages if you need to). It walks through the dictionary and collects statistics on 1, 2, and 3-letter sequences, then builds new "words" based on relative frequencies.
I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.
Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.
Soundex is very easy to implement - see Wikipedia for a description of the algorithm.
An example implementation of what you want to do would be:
def soundex(name, len=4):
digits = '01230120022455012623010202'
sndx = ''
fc = ''
for c in name.upper():
if c.isalpha():
if not fc: fc = c
d = digits[ord(c)-ord('A')]
if not sndx or (d != sndx[-1]):
sndx += d
sndx = fc + sndx[1:]
sndx = sndx.replace('0','')
return (sndx + (len * '0'))[:len]
real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]
if soundex(candidate) in soundex_cache:
print "keep"
else:
print "discard"
Obviously you'll need to provide an implementation of read_english_dictionary.
EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.
Metaphone and Double Metaphone are similar to SOUNDEX, except they may be tuned more toward your goal than SOUNDEX. They're designed to "hash" words based on their phonetic "sound", and are good at doing this for the English language (but not so much other languages and proper names).
One thing to keep in mind with all three algorithms is that they're extremely sensitive to the first letter of your word. For example, if you're trying to figure out if KEAL is English-sounding, you won't find a match to REAL because the initial letters are different.
Do they have to be real English words, or just strings that look like they could be English words?
If they just need to look like possible English words you could do some statistical analysis on some real English texts and work out which combinations of letters occur frequently. Once you've done that you can throw out strings that are too improbable, although some of them may be real words.
Or you could just use a dictionary and reject words that aren't in it (with some allowances for plurals and other variations).
You could compare them to a dictionary (freely available on the internet), but that may be costly in terms of CPU usage. Other than that, I don't know of any other programmatic way to do it.
That sounds like quite an involved task! Off the top of my head, a consonant phoneme needs a vowel either before or after it. Determining what a phoneme is will be quite hard though! You'll probably need to manually write out a list of them. For example, "TR" is ok but not "TD", etc.
I would probably evaluate each word using a SOUNDEX algorithm against a database of english words. If you're doing this on a SQL-server it should be pretty easy to setup a database containing a list of most english words (using a freely available dictionary), and MSSQL server has SOUNDEX implemented as an available search-algorithm.
Obviously you can implement this yourself if you want, in any language - but it might be quite a task.
This way you'd get an evaluation of how much each word sounds like an existing english word, if any, and you could setup some limits for how low you'd want to accept results. You'd probably want to consider how to combine results for multiple words, and you would probably tweak the acceptance-limits based on testing.
I'd suggest looking at the phi test and index of coincidence. http://www.threaded.com/cryptography2.htm
I'd suggest a few simple rules and standard pairs and triplets would be good.
For example, english sounding words tend to follow the pattern of vowel-consonant-vowel, apart from some dipthongs and standard consonant pairs (e.g. th, ie and ei, oo, tr). With a system like that you should strip out almost all words that don't sound like they could be english. You'd find on closer inspection that you will probably strip out a lot of words that do sound like english as well, but you can then start adding rules that allow for a wider range of words and 'train' your algorithm manually.
You won't remove all false negatives (e.g. I don't think you could manage to come up with a rule to include 'rythm' without explicitly coding in that rythm is a word) but it will provide a method of filtering.
I'm also assuming that you want strings that could be english words (they sound reasonable when pronounced) rather than strings that are definitely words with an english meaning.

Resources