Which sub-topic of Natural Language Processing will help me do this? - nlp

What I am trying to do is identify the context of the query a user might input. So if the user enters "High Proteins", I want to be able to understand that what he means by that is "protein > certain_threshold".
Example 2: User input : "Calories less than 250"
I should be able to understand that what the user means by this is calories < 250
If I am able to do this, I will be able to construct my queries accordingly. Which sub-topic of NLP will help me do this. Any leads woul be greatly appreciated.

You probably do not need NLP if you do not have rich vocabulary. You might just want to use simple dictionaries or regex to specify your queries, just as in a controlled language.
If indeed you need more than this, as you have a very rich vocabulary and complex syntactic relations between your phrases, you should probably start with part-of-speech tagging, chunking, and then maybe parsing. But I wouldn't go that way unless you specifically need to.

One way to see this is as a very simple programming language.
You have a set of special key words you want to look for like "Calories" or "Protein" or "LDL" and some operations you want to do like (keyword > 3000) or maybe (%RDA keyword < 2%) that need to be matched. Usually you can do this with a simple expression grammar parser. Depending on what kinds of things you want to connect together (like do you want to say AND or OR or NOT or UNLESS, etc) , and what programming language you want to do this with, it may even be available from a standard library like MARPA (multiple languages) or ANTLR (Java) or Nearly (JavaScript).
Some of the words you might search for include "Earley Parser" or "Context Free Parser" or "Expression Parser". You probably don't need to write your own code, just leverage what already exists.

Related

Extract recommendations/suggestions from text

My documents often include sentences like:
Had I known about this, I would have prevented this problem
or
If John was informed, this wouldn't happen
or
this wouldn't be a problem if Jason was smart
I'm interested in extracting these sort of information (not sure what they are called, linguistically). So I would like to extract either the whole sentence, or ideally, a summary like:
(inform John) (prevent)
Most, if not all, the examples of relation extraction, and information extraction that I've come across, follow fairly standard flow:
do NER, then relation extraction looks for relations like "in" or "at", etc (ch7 of nltk book for example).
Do these type of sentences fall under a certain category in NLP? Are there any papers/tutorials on something like this?
When you are asking for a suggestion on a topic which is pretty open, give more examples. I mean to say, if you just give one example and explain what are you targeting doesn't give enough information. For example, if you have sentences which following specific patterns, then it becomes easier to extract information (in your desired format) from them. Otherwise, it becomes broad and complex research problem!
From your example, it looks like you want to extract the head words of a sentence and other words which modify those heads. You can use dependency parsing for this task. Look at Stanford Neural Network Dependency Parser. A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads. So, i believe it should help you in your desired task.
If you want to make it more general, then this problem aligns well with Open Information Extraction. You may consider looking into Stanford OpenIE api.
You may also consider Stanford Relation Extractor api in your task. But i strongly believe relation extraction through dependency parsing best suits your problem definition. You can read this paper to get some idea and utilize them in your task.

list of english verbs and their tenses, various forms, etc

Is there a huge CSV/XML or whatever file somewhere that contains a list of english verbs and their variations (e.g sell -> sold, sale, selling, seller, sellee)?
I imagine this will be useful for NLP systems, but there doesn't seem to be a listing anywhere, or it could be my terrible googling skills. Does anybody have a clue otherwise?
Consider Catvar:
A Categorial-Variation Database (or Catvar) is a database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants. For example, the words hunger(V), hunger(N), hungry(AJ) and hungriness(N) are different English variants of some underlying concept describing the state of being hungry. Another example is the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).
I am not sure what you are looking for but I think WordNet -- a lexical database for the English language -- would be a good place to start. Read more at http://wordnet.princeton.edu/
The link I referred to you says that
WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
Considering getting a dump of wiktionary and extracting this information out of it.
http://en.wiktionary.org/wiki/sell mentions many of the forms of the word (sells, selling, sold).
If your aim is simply to normalize words to some base canonical form, considering using a lemmatizer or stemmer. Trying playing with morpha which is a really good english lemmatizer.

Proper approach to get words like "dentistry", "dentist" from query like "dental" (and vice versa)

I'm somewhat familiar with stemming, but the stemming library I've been given to use for a project doesn't work very well for a case where I want to find related words like if I do a query for any of these:
"dental", "dentist", "dentistry"
I should get a match for the others. I've been looking into this and I'm learning about parts of speech I didn't even know existed, like pertainyms and troponyms so I'm wondering if there isn't a library out there that has a mapping between all of these different parts of speech that could give back the sort of match I'm looking for?
I've been searching on this and haven't found a whole lot that I can make sense of. I probably don't know the right terminology, etc and I would greatly appreciate if anyone can point me in the right direction.
One approach common in IR is to stem all words in the index and the query itself. Meaning, documents containing the word 'dentistry' will be stemmed and stored in the index as 'dentist'. The keyword 'dental' is also stemmed as 'dentist' thereby matching it in the index.
Have a look at WordNet. WordNet is an organized ontology of words and concepts with links for various types of relations between words. I'm not sure if it will have exactly the relationships you want, but it's probably a good start. There are many interfaces in various programming languages (Java and Python that I've used; presumably many more).

Determining what a word "is" - categorizing a token

I'm writing a bridge between the user and a search engine, not a search engine. Part of my value added will be inferring the intent of a query. The intent of a tracking number, stock symbol, or address is fairly obvious. If I can categorise a query, then I can decide if the user even needs to see search results. Of course, if I cannot, then they will see search results. I am currently designing this inference engine.
I'm writing a parser; it should take any given token and assign it a category. Here are some theoretical English examples:
"denver" is a USCITY and a PLACENAME
"aapl" is a NASDAQSYMBOL and a STOCKTICKERSYMBOL
"555 555 5555" is a USPHONENUMBER
I know that each of these cases will most likely require specific handling, however I'm not sure where to start.
Ideally I'd end up with something simple like:
queryCategory = magicCategoryFinder( query )
>print queryCategory
>"SOMECATEGORY or a list"
Natural language parsing is a complicated topic. One of the problems here is that determining what a word is depends on context and implied knowledge. Also, you're not so much interested in words as you are in groups of words. Consider, "New York City" is a place but its three words, two of which (new and city) have other meanings.
also you have to consider ambiguity, which is once again where context and implied knowledge comes in. For example, JAVA is (or was) a stock symbol for Sun Microsystems. It's also a programming language, a place and has meaning associated with coffee. How do you classify it? You'd need to know the context in which it was used.
And if you can solve that problem reliably you can make yourself very wealthy.
What's all this in aid of anyway?
To learn about "tagging" (the term of art for what you're trying to do), I suggest playing around with NLTK's tag module. More generally, NLTK, the Natural Language ToolKit, is an excellent toolkit (based on the Python programming language) for experimentation and learning in the field of Natural Language Processing (whether it's suitable for a given production application may be a different issue, esp. if said application requires very high speed processing on large volumes of data -- but, you have to walk before you can run!-).
You're bumping up against one of the hardest problems in computer science today... determining semantics from english context. This is the classic text mining problem and get into some very advanced topics. I thiink I would suggest thinking more about you're problem and see if you can a) go without categorization or b) perhaps utilize structural info such as document position or something to give you a hint (is either a city or placename or an undetermined) and maybe some lookup tables to help. ie stock symbols are pretty easy to create a pretty full lookup for. You might consider downloading CIA world factbook for a lookup of cities... etc.
As others have already pointed out, this is an exceptionally difficult task. The classic test is a pair of sentences:Time flies like an arrow.Fruit flies like a bananna.
In the first sentence, "flies" is a verb. In the second, it's part of a noun. In the first, "like" is an adverb, but in the second it's a verb. The context doesn't make this particularly easy to sort out either -- there's no obvious difference between "Time" and "Fruit" (both normally nouns). Likewise, "arrow" and "bananna" are both normally nouns.
It can be done -- but it really is decidedly non-trivial.
Although it might not help you much with disambiguation, you could use Cyc. It's a huge database of what things are that's intended to be used in AI applications (though I haven't heard any success stories).

Natural English language words

I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough quality.
There are 1,000,000 words in the English language including foreign and/or technical words.
Can you please suggest such a source (or close to 500k words) that can be downloaded from the internet that is maybe a bit categorized? What input do you use for your language processing applications?
Kevin's wordlists is the best I know just for lists of words.
WordNet is better if you want to know about things being nouns, verbs etc, synonyms, etc.
`The "million word" hoax rolls along', I see ;-)
How to make your word lists longer: given a noun, add any of the following to it: non-, pseudo-, semi-, -arific, -geek, ...; mutatis mutandis for verbs etc.
I did research for Purdue on controlled / natural english and language domain knowledge processing.
I would take a look at the attempto project: http://attempto.ifi.uzh.ch/site/description/ which is a project to help build a controlled natural english.
You can download their entire word lexicon at: http://attempto.ifi.uzh.ch/site/downloads/files/clex-6.0-080806.zip it has ~ 100,000 natural English words.
You can also supply your own lexicon for domain specific words, this is what we did in our research. They offer webservices to parse and format natural english text.
Who told you there was 1 million words? According to Wikipedia, the Oxford English Dictionary only has 600,000. And the OED tries to include all technical and slang terms that are used.
Try directly Wikipedia's extracts : http://dbpedia.org
There aren't too many base words(171k according to this- oxford. Which is what I remember being told in my CS program in college.
But if include all forms of the words- then it rises considerably.
That said, why not make one yourself? Get a Wikipedia dump and parse it and create a set of all tokens you encounter.
Expect misspellings though- like all things crowd-sources there will be errors.

Resources