I was wondering if there's a NLP/ML technique for this.
Suppose given a set of sentences,
I watched the movie.
Heard the movie is great, have to watch it.
Got the tickets for the movie.
I am at the movie.
If i have to assign a probability to each of these sentences, that they have "actually" watched the movie, i would assign it in decreasing order of 1,4,3,2.
Is there a way to do this automatically, using some classifier or rules? Any paper/link would help.
These are common issues in textual entailment. I'll refer you to some papers. While their motivation is for textual entailment, I believe your problem should be easier than that.
Determining Modality and Factuality for Textual Entailment
Learning to recognize features of valid textual entailments
Some of these suggestions should help you decide on some features/keywords to consider when ranking.
Except 1, none of the other statements necessarily imply that the person has watched the movie. They could have bought the tickets for somebody else (3) and might be the person who sells popcorn outside the halls (4). I don't think there is any clever system out there that will read between the lines for each sentence and return an answer that exactly agrees with your intuitions (which might be different from that of other people for the same sentence btw).
If this strangely is the only case that you care about (which is possible if you are explicitly working with movie reviews), then it might be worth your time to come with a large number of heuristics patched together that yields a function that near exactly agrees with your intuitions about this.
Otherwise look for context available in all the other sentences these sentences originate from to find relevant clues. Somebody who has actually watched the movie may comment on how they liked it, express opinions about specific scenes, characters and actors from the movie, etc. So if the text contains a lot of high sentiment sentences and refers to words and phrases from the movie, then the person has probably watched the movie. If a lot of it is in future tense, then maybe not.
If you are working with an specific domain, such as "watched the movie or not", or maybe more generally "attended to an event or not", it's basically a case of the Text Classification task.
The common approach in NLP is to use a large amount of sentences tagged as watched or didn't watch to train a machine learning based classifier. The most commonly used features are the presence/absence of keywords, bigrams (sequences of 2 words) and maybe trigrams (sequences of 3 words).
Since you talked about probability, things may get a little more complex. As adi92 noted, in 3 of your sentences the answer is not clear. A way to represent that in the training data could be that a sentence with 0.3 probability of watched appear 3 times tagged as watched and 7 as didn't watch. Most classifiers can have their output easily turned into probabilities.
Anyway, I believe that the main difficulty would be creating a training dataset for the task.
Related
I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.
With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).
This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.
I think there are (at least) three different approaches and maybe the solution consists of a combination of them:
search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally
For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.
We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.
Any ideas about how to do that? Or just some tips?
Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.
You need lemmatization - it's similar to your 2nd item, but not exactly (difference).
Try nltk lemmatizer for Python or Standford NLP/Clear NLP for Java. Actually nltk uses WordNet, so it is really combination of 1st and 2nd approaches.
In order to cope with mistakes use spelling correction before lemmatization. Take a look at related questions or Google for appropriate libs.
About part of speech tag - unfortunately, nltk doesn't consider POS tag (and context in general), so you should provide it with the tag that can be found by nltk pos tagging. Again, it is already discussed here (and related/linked questions). I'm not sure about Stanford NLP here - I guess it should consider context, but I was sure that NLTK does so. As I can see from this code snippet, Stanford doesn't use POS tags, while Clear NLP does.
About other languages - google for lemmatization models, since algorithm for most languages (at least from the same family) is almost the same, differences are in training data. Take a look here for example of German; it is a wrapper for several lemmatizers, as I can see.
However, you always can use stemmer at cost of precision, and stemmer is more easily available for different languages.
Topic Word has become an integral part of the rising debate in the present world. Some people perceive that Topic Word (Synonyms) beneficial, while opponents reject this notion by saying that it leads to numerous problems. From my point of view, Topic Word (Synonyms) has more positive impacts than negative around the globe. This essay will further elaborate on both positive and negative effects of this trend and thus will lead to a plausible conclusion.
On the one hand, there is a myriad of arguments in favour of my belief. The topic has a plethora of merits. The most prominent one is that the Topic Word (Synonyms). According to the research conducted by Western Sydney University, more than 70 percentages of the users were in favour of the benefits provided by the Topic Word (Synonyms). Secondly, Advantage of Essay topic. Thus, it can say that Topic Word (Synonyms) plays a vital role in our lives.
On the flip side, critics may point out that one of the most significant disadvantages of the Topic Word (Synonyms) is that due to Demerits relates to the topic. For instance, a survey conducted in the United States reveals that demerit. Consequently, this example explicit shows that it has various negative impacts on our existence.
As a result, after inspection upon further paragraphs, I profoundly believe that its benefits hold more water instead of drawbacks. Topic Word (Synonyms) has become a crucial part of our life. Therefore, efficient use of Topic Word (Synonyms) method should promote; however, excessive and misuse should condemn.
So I have an idea for classifying sentiments of sentences talking about a given brand product (in this case, pepsi). Basically, let's say I wanted to figure out how people feel about the taste of pepsi. Given this problem, I want to construct abstract sentence templates, basically possible sentence structures that would indicate an opinion about the taste of pepsi. Here's one example for a three word sentence:
[Pepsi] [tastes] [good, bad, great, horrible, etc.]
I then look through my database of sentences, and try to find ones that match this particular structure. Once I have this, I can simply extract the third component and get a sentiment regarding this particular aspect (taste) of this particular entity (pepsi).
The application for this would be looking at tweets, so this might yield a few tweets from the past year or so, but it wouldn't be enough to get an accurate read on the general sentiment, so I would create other possible structures, like:
[I] [love, hate, dislike, like, etc.] [the taste of pepsi]
[I] [love, hate, dislike, like, etc.] [the way pepsi tastes]
[I] [love, hate, dislike, like, etc.] [how pepsi tastes]
And so on and so forth.
Of course most tweets won't be this simple, there would be possible words that would mean the same as pepsi, or words in between the major components, etc - deviations that it would not be practical to account for.
What I'm looking for is just a general direction, or a subfield of sentiment analysis that discusses this particular problem. I have no problem coming up with a large list of possible structures, it's just the deviations from the structures that I'm worried about. I know this is something like a syntax tree, but most of what I've read about them has just been about generating text - in this case I'm trying to match a sentence to a structure, and pull out the entity, sentiment, and aspect components to get a basic three word answer.
This templates approach is the core idea behind my own sentiment mining work. You might find study of EBMT (example-based machine translation) interesting, as a similar (but under-studied) approach in the realm of machine translation.
Get familiar with Wordnet, for automatically generating rephrasings (there are hundreds of papers that build on WordNet, some of which will be useful to you). (The WordNet book is getting old now, but worth at least a skim read if you can find it in a library.)
I found Bing Liu's book a very useful overview of all the different aspects and approachs to sentiment mining, and a good introduction to further reading. (The Amazon UK reviews are so negative I wondered if it was a different book! The Amazon US reviews are more positive, though.)
Are there any known ways (above and beyond statistical analysis, but not necessarily excluding it as being part of the solution) to relate sentences or concepts to one another using Natural Language Processing. Thus far I've only worked with NLTK and Stanford-NLP to aid in my project, but I am open to alternative open source solutions.
As an example take the following George Orwell essay (http://orwell.ru/library/essays/wiw/english/e_wiw). Suppose I gave the application the sentence
"What are George Orwell's opinions on writers."
or perhaps
"George Orwell believes writers enjoy writing to express their creativity, to make a point and for their egos."
Might yield lines from the essay like
"The aesthetic motive is very feeble in a lot of writers, but even a pamphleteer or writer of textbooks will have pet words and phrases which appeal to him for non-utilitarian reasons; or he may feel strongly about typography, width of margins, etc."
or
"Serious writers, I should say, are on the whole more vain and self-centered than journalists, though less interested in money."
I understand that this is not easy and I may not achieve much accuracy, but I was hoping for ideas on what already exists and what I could try to start off, or at least get the best results possible based on what is already known and out there.
The simplest way of doing this might be using some distance functions (such as Cosine similarity) between your query sentence and the sentence pool. It's easy to implement. Create a vocabulary from the text collection and each sentence is represented as a vector. You can use TF-IDF to represent values in the vector, and calculate the cosine similarity between sentences, and get the highest scored sentence with respect to your query sentence.
Or you can build index from your corpus and use for example Lucene and let it do the work for you.
You may also consider using LSA (Latent Semantic Analysis) where you can get the similarity between sentences.
From what I understand from your question (and also your comment) is you are more interested in understanding the meaning of individual sentence and then equate with each other in proximity. Statistical approach, in my opinion, is more for "getting a feel" of the sentence than understanding it. In my opinion I would suggest deep parsing approach.
Deep parse the sentence, understand what roles the words play in the sentence, understand what the subject-verb-object model (left to right parsing and such techniques) and then have a vocabulary that helps you categorise the nouns and verbs.
e.g.
"Serious writers, I should say, are on the whole more vain and self-centered than journalists, though less interested in money."
Parsing this sentence, lets you understand the subject of the sentence is "serious writers" (serious being an adjective, writers basically). In the verb form it states "are" (current state) and "interested". Each verb then points to some more vocabulary including adjectives. If you arrange this vocabulary in correct manner (and keep building it) I think you should get somewhere with your problem.
I have a list of several dozen product attributes that people are concerned with, like
Financing
Manufacturing quality
Durability
Sales experience
and several million free-text statements from customers about the product, e.g.
"The financing was easy but the housing is flimsy."
I would like to score each free text statement in terms of how strongly it relates to each of the attributes, and whether that is a positive or negative association.
In the given example, there would be a strong positive association to Financing and a strong negative association to Manufacturing quality.
It feels like this type of problem is probably the realm of Natural Language Programming (NLP). However, I spent several hours reading up on things like OpenNLP and NLTK and find there's so much domain specific terminology that I cannot figure out where to focus to solve this specific problem.
So my three-part question:
Is NLP the correct route to solve this class of problem?
What aspect of NLP should I focus on learning for this specific problem?
Are there alternatives I have not considered?
A resource you might find handy is SentiWordNet. (http://sentiwordnet.isti.cnr.it/) Which is like a dictionary that has a sentiment grade for words. It will tell you to what degree it thinks a word is positive, negative, or objective.
You can then combine that with some nltk code that looks through your sentences for the words you want to associate the sentiment with. So you would write a script to get some level of meaningful chunks of text that surround the words you were looking at, maybe sentence or clause level. Then you can have another thing that runs through the surrounding words and grab all the sentiment scores from the SentiWordNet.
I have some old code that did this and can place on github if you'd like, but you'd still need to make your own request for SentiWordNet.
I guess your problem is more on association rather than just classification. Now moving forward with this assumption:
Is NLP the correct route to solve this class of problem?
Yes.
What aspect of NLP should I focus on learning for this specific problem?
Part of speech tagging
Sentiment analysis
Maximum entrophy
Are there alternatives I have not considered?
In depth study of automata theory with respect to NLP will help you a lot, it helped me a lot in grasping the implementations like OpenNLP.
Yes, this is a NLP problem by the name of Sentiment analysis. Sentiment analysis is an active research area with different approaches and a task where a lot of other NLP-methods have to work together, so it is certainly not the easiest field to get started with in NLP.
A more or less recent survey of the academic research in the field can be found in Pang & Lee (2008).
Suppose I have a set of phrases - about 10 000 - of average length - 7-20 words in which I want to find some given phrase. The phrase I am looking for could have some errors - for example miss one or two words, have some words misplaced, or some random words - for example my database contains "As I was riding my red bike, I saw Christine", and I want it to much "As I was riding my blue bike, saw Christine", or "I was riding my bike, I saw Christine and Marion". What could be some good approach to this problem? I know about Levenhstein's distance, and I also suppose that this problem may have no easy, good solution.
A good text search engine will provide capabilities such as you describe, fsh. A typical approach would be to create a query that matches if any of the words occurs and orders the results using a weight based on number of terms occurring in proximity to each other and weighted inversely to their probability of occurring, since uncommon words will be less likely to co-occur by chance. There's a whole theory of this sort of thing called information retrieval, but maybe you know about that. Furthermore you'd like to make sure that word-level fuzziness gets accounted for by normalizing case, punctuation and the like and applying some basic linguistic transformations (stemming), and in some cases introducing a dictionary of synonyms, especially when there is domain knowledge available to condition it.
If you're interested in messing around with this stuff, try an open-source search engine, this article by Vik gives a reasonable survey from the perspective of 2009, and this one by Middleton and Baeza-Yates gives a good detailed introduction to the topic.