What is the best way to classify following words in POS tagging?

What is the best way to classify following words in POS tagging? - nlp

I am doing POS tagging. Given the following tokens in the training set, is it better to consider each token as Word1/POStag and Word2/POStag or consider them as one word that is Word1/Word2/POStag ?
Examples: (the POSTag is not required to be included)
Bard/EMS
Interstate/Johnson
Polo/Ralph
IBC/Donoghue
ISC/Bunker
Bendix/King
mystery/comedy
Jeep/Eagle
B/T
Hawaiian/Japanese
IBM/PC
Princeton/Newport
editing/electronic
Heller/Breene
Davis/Zweig
Fleet/Norstar
a/k/a
1/2
Any suggestion is appreciated.

The examples don't seem to fall into one category with respect to the use of the slash -- a/k/a is a phrase acronym, 1/2 is a number, mystery/comedy indicates something in between the two words, etc.
I feel there is no treatment of the component words that would work for all the cases in question, and therefore the better option is to handle them as unique words. At decoding stage, when the tagger will probably be presented with more previously unseen examples of such words, the decision can often be made based on the context, rather than the word itself.

Related

Extracting <subject, predicate, object> triplet from unstructured text

I need to extract simple triplets from unstructured text. Usually it is of the form noun- verb- noun, so I have tried POS tagging and then extracting nouns and verbs from neighbourhood.
However it leads to lot of cases and gives low accuracy.
Will Syntactic/semantic parsing help in this scenario?
Will ontology based information extraction be more useful?

I expect that syntactic parsing would be the best fit for your scenario. Some trivial template-matching method with POS tags might work, where you find verbs preceded and followed by a single noun, and take the former to be the subject and the latter the object. However, it sounds like you've already tried something like that -- unless your neighborhood extraction ignores word order (which would be a bit silly - you'd be guessing which noun was the word and which was the object, and that's assuming exactly two nouns in each sentence).
Since you're looking for {s, v, o} triplets, chances are you won't need semantic or ontological information. That would be useful if you wanted more information, e.g. agent-patient relations or deeper knowledge extraction.
{s,v,o} is shallow syntactic information, and given that syntactic parsing is considerably more robust and accessible than semantic parsing, that might be your best bet. Syntactic parsing will be sensitive to simple word re-orderings, e.g. "The hamburger was eaten by John." => {John, eat, hamburger}; you'd also be able to specifically handle intransitive and ditransitive verbs, which might be issues for a more naive approach.

Uses/Applications of Part-of-speech-tagging (POS Tagging)

I understand the implicit value of part-of-speech tagging and have seen mentions about its use in parsing, text-to-speech conversion, etc.
Could you tell me how is the output of a PoS tagger formated ?
Also, could you explain how is such an output used by other tasks/parts of an NLP system?

One purpose of PoS tagging is to disambiguate homonyms.
For instance, take this sentence :
I fish a fish
The same sentence in french would be Je pêche un poisson.
Without tagging, fish would be translated the same way in both case, which would lead to
a wrong traduction. However, after PoS tagging, the sentence would be
I_PRON fish_VERB a_DET fish_NOUN
From a computer point of view, both words are now distinct. This wat, they can be processed much more efficiently (in our example, fish_VERB will be translated to pêche and fish_NOUN to poisson).

Basically, the goal of a POS tagger is to assign linguistic (mostly grammatical) information to sub-sentential units. Such units are called tokens and, most of the time, correspond to words and symbols (e.g. punctuation).
Considering the format of the output, it doesn't really matter as long as you get a sequence of token/tag pairs. Some POS taggers allow you to specify some specific output format, others use XML or CSV/TSV, and so on.

Mapping interchangeably terms such as Weight to Mass for QAnswering NLP

I've been working on a Question Answering engine in C#. I have implemented the features of most modern systems and are achieving good results. Despite the aid of Wordnet , one problem I haven't been able to solve yet is changing the user input to the correct term.
For example
changing Weight -> Mass
changing Tall -> Height
My question is about the existence of some sort of resource that can aid me in this task of changing the terms to the correct terms.
Thank You

Looking at all the synsets in WordNet for both Mass and Weight I can see that there is no shared synset and thus there is no meaning in common. Words that actually do have the same meaning can be matched by means of their synset labels, as I'm sure you've realized.
In my own natural language engine (http://nlp.abodit.com) I allow users to use any synset label in the grammar they define but I would still create two separate grammar rules in this case, one recognizing questions about mass and one recognizing questions about weight.
However, there are also files for Wordnet that give you class relationships between synsets too. For example, if you type 'define mass' into my demo page you'll see:-
4. wn30:synset-mass-noun-1
the property of a body that causes it to have weight in a gravitational field
--type--> wn30:synset-fundamental_quantity-noun-1
--type--> wn30:synset-physical_property-noun-1
ITokenText, IToken, INoun, Singular
And if you do the same for 'weight' you'll also see that it too has a class relationship to 'physical property'.
In my system you can write a rule that recognizes a question about a 'physical property' and perhaps a named object and then try to figure out which physical property they are likely to be asking about. And, perhaps, if you can't match maybe just tell them all about the physical properties of the object.
The method signature in my system would be something like ...
... QuestionAboutPhysicalProperties (... IPhysicalProperty prop,
INamedObject obj, ...)
... and in code I would look at the properties of obj and try to find one called 'prop'.

The only way that I know how to do this effectively requires having a large corpus of user query sessions and a happiness measure on sessions, and then finding correlations between substituting word x for word y (possibly given some context z) that improves user happiness.
Here is a reasonable paper on generating query substitutions.
And here is a new paper on generating synonyms from anchor text, which doesn't require a query log.

Search with attribute values correspondence in Lucene

Here's a text with ambiguous words:
"A man saw an elephant."
Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.
For "saw" it is like:
{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.
I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.
I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.
Maybe some hacks with posting list payloads can be applied?
UPD: A draft of my solution
Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8).
So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage.
E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.
I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?

So... the cheater way of doing this is (indeed) to control how you build the lucene index.
When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.
One way:
This means for each type of query you do, you must also build an index in the same way.
Example:
saw becomes noun-saw -- index it as that.
saw also becomes noun-past-see -- index it as that.
saw also becomes noun-past-singular-see -- index it as that.
The other way:
If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.
Not sure if this is a good answer, but that's all I could think of.

How do I determine if a random string sounds like English?

I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.
EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.

You can build a markov-chain of a huge english text.
Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.
See here: http://en.wikipedia.org/wiki/Markov_chain
At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.
In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.

The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')
>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]
>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]

You could approach this by tokenizing a candidate string into bigrams—pairs of adjascent letters—and checking each bigram against a table of English bigram frequencies.
Simple: if any bigram is sufficiently low on the frequency table (or outright absent), reject the string as implausible. (String contains a "QZ" bigram? Reject!)
Less simple: calculate the overall plausibility of the whole string in terms of, say, a product of the frequencies of each bigram divided by the mean frequency of a valid English string of that length. This would allow you to both (a) accept a string with an odd low-frequency bigram among otherwise high-frequency bigrams, and (b) reject a string with several individual low-but-not-quite-below-the-threshold bigrams.
Either of those would require some tuning of the threshold(s), the second technique more so than the first.
Doing the same thing with trigrams would likely be more robust, though it'll also likely lead to a somewhat more strict set of "valid" strings. Whether that's a win or not depends on your application.
Bigram and trigram tables based on existing research corpora may be available for free or purchase (I didn't find any freely available but only did a cursory google so far), but you can calculate a bigram or trigram table from yourself from any good-sized corpus of English text. Just crank through each word as a token and tally up each bigram—you might handle this as a hash with a given bigram as the key and an incremented integer counter as the value.
English morphology and English phonetics are (famously!) less than isometric, so this technique might well generate strings that "look" English but present troublesome prounciations. This is another argument for trigrams rather than bigrams—the weirdness produced by analysis of sounds that use several letters in sequence to produce a given phoneme will be reduced if the n-gram spans the whole sound. (Think "plough" or "tsunami", for example.)

It's quite easy to generate English sounding words using a Markov chain. Going backwards is more of a challenge, however. What's the acceptable margin of error for the results? You could always have a list of common letter pairs, triples, etc, and grade them based on that.

You should research "pronounceable" password generators, since they're trying to accomplish the same task.
A Perl solution would be Crypt::PassGen, which you can train with a dictionary (so you could train it to various languages if you need to). It walks through the dictionary and collects statistics on 1, 2, and 3-letter sequences, then builds new "words" based on relative frequencies.

I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.
Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.
Soundex is very easy to implement - see Wikipedia for a description of the algorithm.
An example implementation of what you want to do would be:
def soundex(name, len=4):
digits = '01230120022455012623010202'
sndx = ''
fc = ''
for c in name.upper():
if c.isalpha():
if not fc: fc = c
d = digits[ord(c)-ord('A')]
if not sndx or (d != sndx[-1]):
sndx += d
sndx = fc + sndx[1:]
sndx = sndx.replace('0','')
return (sndx + (len * '0'))[:len]
real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]
if soundex(candidate) in soundex_cache:
print "keep"
else:
print "discard"
Obviously you'll need to provide an implementation of read_english_dictionary.
EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.

Metaphone and Double Metaphone are similar to SOUNDEX, except they may be tuned more toward your goal than SOUNDEX. They're designed to "hash" words based on their phonetic "sound", and are good at doing this for the English language (but not so much other languages and proper names).
One thing to keep in mind with all three algorithms is that they're extremely sensitive to the first letter of your word. For example, if you're trying to figure out if KEAL is English-sounding, you won't find a match to REAL because the initial letters are different.

Do they have to be real English words, or just strings that look like they could be English words?
If they just need to look like possible English words you could do some statistical analysis on some real English texts and work out which combinations of letters occur frequently. Once you've done that you can throw out strings that are too improbable, although some of them may be real words.
Or you could just use a dictionary and reject words that aren't in it (with some allowances for plurals and other variations).

You could compare them to a dictionary (freely available on the internet), but that may be costly in terms of CPU usage. Other than that, I don't know of any other programmatic way to do it.

That sounds like quite an involved task! Off the top of my head, a consonant phoneme needs a vowel either before or after it. Determining what a phoneme is will be quite hard though! You'll probably need to manually write out a list of them. For example, "TR" is ok but not "TD", etc.

I would probably evaluate each word using a SOUNDEX algorithm against a database of english words. If you're doing this on a SQL-server it should be pretty easy to setup a database containing a list of most english words (using a freely available dictionary), and MSSQL server has SOUNDEX implemented as an available search-algorithm.
Obviously you can implement this yourself if you want, in any language - but it might be quite a task.
This way you'd get an evaluation of how much each word sounds like an existing english word, if any, and you could setup some limits for how low you'd want to accept results. You'd probably want to consider how to combine results for multiple words, and you would probably tweak the acceptance-limits based on testing.

I'd suggest looking at the phi test and index of coincidence. http://www.threaded.com/cryptography2.htm

I'd suggest a few simple rules and standard pairs and triplets would be good.
For example, english sounding words tend to follow the pattern of vowel-consonant-vowel, apart from some dipthongs and standard consonant pairs (e.g. th, ie and ei, oo, tr). With a system like that you should strip out almost all words that don't sound like they could be english. You'd find on closer inspection that you will probably strip out a lot of words that do sound like english as well, but you can then start adding rules that allow for a wider range of words and 'train' your algorithm manually.
You won't remove all false negatives (e.g. I don't think you could manage to come up with a rule to include 'rythm' without explicitly coding in that rythm is a word) but it will provide a method of filtering.
I'm also assuming that you want strings that could be english words (they sound reasonable when pronounced) rather than strings that are definitely words with an english meaning.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string