How to determine if a piece of text mentions a product - nlp

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.

The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.

It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

Related

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

Possible approach to sentiment analysis (I apologize, I'm very new to NLP)

So I have an idea for classifying sentiments of sentences talking about a given brand product (in this case, pepsi). Basically, let's say I wanted to figure out how people feel about the taste of pepsi. Given this problem, I want to construct abstract sentence templates, basically possible sentence structures that would indicate an opinion about the taste of pepsi. Here's one example for a three word sentence:
[Pepsi] [tastes] [good, bad, great, horrible, etc.]
I then look through my database of sentences, and try to find ones that match this particular structure. Once I have this, I can simply extract the third component and get a sentiment regarding this particular aspect (taste) of this particular entity (pepsi).
The application for this would be looking at tweets, so this might yield a few tweets from the past year or so, but it wouldn't be enough to get an accurate read on the general sentiment, so I would create other possible structures, like:
[I] [love, hate, dislike, like, etc.] [the taste of pepsi]
[I] [love, hate, dislike, like, etc.] [the way pepsi tastes]
[I] [love, hate, dislike, like, etc.] [how pepsi tastes]
And so on and so forth.
Of course most tweets won't be this simple, there would be possible words that would mean the same as pepsi, or words in between the major components, etc - deviations that it would not be practical to account for.
What I'm looking for is just a general direction, or a subfield of sentiment analysis that discusses this particular problem. I have no problem coming up with a large list of possible structures, it's just the deviations from the structures that I'm worried about. I know this is something like a syntax tree, but most of what I've read about them has just been about generating text - in this case I'm trying to match a sentence to a structure, and pull out the entity, sentiment, and aspect components to get a basic three word answer.
This templates approach is the core idea behind my own sentiment mining work. You might find study of EBMT (example-based machine translation) interesting, as a similar (but under-studied) approach in the realm of machine translation.
Get familiar with Wordnet, for automatically generating rephrasings (there are hundreds of papers that build on WordNet, some of which will be useful to you). (The WordNet book is getting old now, but worth at least a skim read if you can find it in a library.)
I found Bing Liu's book a very useful overview of all the different aspects and approachs to sentiment mining, and a good introduction to further reading. (The Amazon UK reviews are so negative I wondered if it was a different book! The Amazon US reviews are more positive, though.)

How do you extract various meanings of a certain word

Given "violence" as input would it be possible to come up with how violence construed by a person (e.g. physical violence, a book, an album, a musical group ..) as mentioned below in Ref #1.
Assuming if the user meant an Album, what would be the best way to look for violence as an album from a set of tweets.
Is there a way to infer this via any of the NLP API(s) say OpenNLP.
Ref #1
violence/N1 - intentional harmful physical action.
violence/N2 - the property of being wild or turbulent.
Violence/N6 - a book from Neil L. Whitehead; nonfiction
Violence/N7 - an album by The Last Resort
Violence/N8 - Violence is the third album by the Washington-based Alternative metal music group Nothingface.
Violence/N9 - a musical group which produced the albums Eternal Nightmare and Nothing to Gain
Violence/N10 - a song by Aesthetic Perfection, Angel Witch, Arsenic, Beth Torbert, Brigada Flores Magon, etc on the albums A Natural Disaster, Adult Themes for Voice, I Bificus, Retribution, S.D.E., etc
Violence/N11 - an album by Bombardier, Dark Quarterer and Invisible Limits
Violence/N12 - a song by CharlElie Couture, EsprieM, Fraebbblarnir, Ian Hunter, Implant, etc on the albums All the Young Dudes, Broke, No Regrets, Power of Limits, Repercussions, etc
Violence/N18 - Violence: The Roleplaying Game of Egregious and Repulsive Bloodshed is a short, 32-page roleplaying game written by Greg Costikyan under the pseudonym "Designer X" and published by Hogshead Publishing as part of its New Style line of games.
Violence/N42 - Violence (1947) is an American drama film noir directed by Jack Bernhard.
Pure automatic inference is a little to hard in general for this problem.
Instead we might use :
Resources like WordNet, or a semantics dictionary.
For languages other than English you can look at eurowordnet (non free) dataset.
To get more meaning (i.e. for the album sense) we process some well managed resource like Wikipedia. Wikipedia as a lot of meta information that would be very useful for this kind of processing.
The reliability of the process is achieve just by combining the maximum number of data source and processing them correctly, with specialized programs.
As a last resort you may try hand processing/annotating. Long and costly, but useful in enterprise context where you need only a small part of a language.
No free lunch here.
If you're working on English NLP in python, then you can try the wordnet API as such:
from nltk.corpus import wordnet as wn
query = 'violence'
for ss in wn.synsets(query):
print query, str(ss.offset).zfill(8)+'-'+ss.pos, ss.definition
If you're working on other human languages, maybe you can take a look at the open wordnets available from http://casta-net.jp/~kuribayashi/multi/
NOTE: the reason for str(ss.offset).zfill(8)+'-'+ss.pos, it's because it is used as the unique id for each sense of a specific word. And this id is consistent across the open wordnets for every language. the first 8 digits gives the id and the character after the dash is the Part-of-Speech of the sense.
Check this out: Twitter Filtering Demo from Idilia. It does exactly what you want by first analyzing a piece of text to discover the meaning of its words and then filtering the texts that contain the sense that you are looking for. It's available as an API.
Disclaimer: I work for Idilia.
You can extract all contexts "violence" occurs in (context can be a whole document, or a window of say 50 words), then convert them into features (using say bag of words), then cluster these features. As clustering is unsupervised, you won't have names for the clusters, but you can label them with some typical context.
Then you need to see which cluster "violence" in the query belongs to. Either based on other words in the query which act as a context or by asking explicitly (Do you mean violence as in "...." or as in "....")
This will be incredibly difficult due to the fact that the proper noun uses of the word 'Violence' will be incredibly infrequent as a proportion of all words and their frequency distribution is likely highly skewed in some way. We run into these problems almost any time we want to do some form of Named Entity Disambiguation.
No tool I'm aware of will do this for you, so you will be building your own classifier. Using Wikipedia as a training resource as Mr K suggested is probably your best bet.

How to categorize and tabularize free-form answers to a question in a survey?

I want to analyze answers to a web survey (Git User's Survey 2008 if one is interested). Some of the questions were free-form questions, like "How did you hear about Git?". With more than 3,000 replies analyzing those replies entirely by hand is out of the question (especially that there is quite a bit of free-form questions in this survey).
How can I group those replies (probably based on the key words used in response) into categories at least semi-automatically (i.e. program can ask for confirmation), and later how to tabularize (count number of entries in each category) those free-form replies (answers)? One answer can belong to more than one category, although for simplicity one can assume that categories are orthogonal / exclusive.
What I'd like to know is at least keyword to search for, or an algorithm (a method) to use. I would prefer solutions in Perl (or C).
Possible solution No 1. (partial): Bayesian categorization
(added 2009-05-21)
One solution I thought about would be to use something like algorithm (and mathematical method behind it) for Bayesian spam filtering, only instead of one or two categories ("spam" and "ham") there would be more; and categories itself would be created adaptively / interactively.
Text::Ngrams + Algorithm::Cluster
Generate some vector representation for each answer (e.g. word count) using Text::Ngrams.
Cluster the vectors using Algorithm::Cluster to determine the groupings and also the keywords which correspond to the groups.
You are not going to like this. But: If you do a survey and you include lots of free-form questions, you better be prepared to categorize them manually. If that is out of the question, why did you have those questions in the first place?
I've brute forced stuff like this in the past with quite large corpuses. Lingua::EN::Tagger, Lingua::Stem::En. Also the Net::Calais API is (unfortunately, as Thomposon Reuters are not exactly open source friendly) pretty useful for extracting named entities from text. Of course once you've cleaned up the raw data with this stuff, the actual data munging is up to you. I'd be inclined to suspect that frequency counts and a bit of mechanical turk cross-validation of the output would be sufficient for your needs.
Look for common words as keywords, but through away meaningless ones like "the", "a", etc. After that you get into natural language stuff that is beyond me.
It just dawned on me that the perfect solution for this is AAI (Artificial Artificial Intelligence). Use Amazon's Mechanical Turk. The Perl bindings are Net::Amazon::MechanicalTurk. At one penny per reply with a decent overlap (say three humans per reply) that would come to about $90 USD.

Determining what a word "is" - categorizing a token

I'm writing a bridge between the user and a search engine, not a search engine. Part of my value added will be inferring the intent of a query. The intent of a tracking number, stock symbol, or address is fairly obvious. If I can categorise a query, then I can decide if the user even needs to see search results. Of course, if I cannot, then they will see search results. I am currently designing this inference engine.
I'm writing a parser; it should take any given token and assign it a category. Here are some theoretical English examples:
"denver" is a USCITY and a PLACENAME
"aapl" is a NASDAQSYMBOL and a STOCKTICKERSYMBOL
"555 555 5555" is a USPHONENUMBER
I know that each of these cases will most likely require specific handling, however I'm not sure where to start.
Ideally I'd end up with something simple like:
queryCategory = magicCategoryFinder( query )
>print queryCategory
>"SOMECATEGORY or a list"
Natural language parsing is a complicated topic. One of the problems here is that determining what a word is depends on context and implied knowledge. Also, you're not so much interested in words as you are in groups of words. Consider, "New York City" is a place but its three words, two of which (new and city) have other meanings.
also you have to consider ambiguity, which is once again where context and implied knowledge comes in. For example, JAVA is (or was) a stock symbol for Sun Microsystems. It's also a programming language, a place and has meaning associated with coffee. How do you classify it? You'd need to know the context in which it was used.
And if you can solve that problem reliably you can make yourself very wealthy.
What's all this in aid of anyway?
To learn about "tagging" (the term of art for what you're trying to do), I suggest playing around with NLTK's tag module. More generally, NLTK, the Natural Language ToolKit, is an excellent toolkit (based on the Python programming language) for experimentation and learning in the field of Natural Language Processing (whether it's suitable for a given production application may be a different issue, esp. if said application requires very high speed processing on large volumes of data -- but, you have to walk before you can run!-).
You're bumping up against one of the hardest problems in computer science today... determining semantics from english context. This is the classic text mining problem and get into some very advanced topics. I thiink I would suggest thinking more about you're problem and see if you can a) go without categorization or b) perhaps utilize structural info such as document position or something to give you a hint (is either a city or placename or an undetermined) and maybe some lookup tables to help. ie stock symbols are pretty easy to create a pretty full lookup for. You might consider downloading CIA world factbook for a lookup of cities... etc.
As others have already pointed out, this is an exceptionally difficult task. The classic test is a pair of sentences:Time flies like an arrow.Fruit flies like a bananna.
In the first sentence, "flies" is a verb. In the second, it's part of a noun. In the first, "like" is an adverb, but in the second it's a verb. The context doesn't make this particularly easy to sort out either -- there's no obvious difference between "Time" and "Fruit" (both normally nouns). Likewise, "arrow" and "bananna" are both normally nouns.
It can be done -- but it really is decidedly non-trivial.
Although it might not help you much with disambiguation, you could use Cyc. It's a huge database of what things are that's intended to be used in AI applications (though I haven't heard any success stories).

Resources