Let's say we have two sentences:
Jacob is going to watch a movie with Justin.
He will be back by 10 pm.
How does Stanford NLP identify "he" refers to Jacob and not Justin?
This is called coreference resolution and is a well-studied problem in NLP. As such, there are many possible ways to do it. Stackoverflow is not the right venue for a literature review, but here are a few links to get you started
http://www-labs.iro.umontreal.ca/~felipe/IFT6010-Hiver2015/Presentations/Abbas-Coreference.pdf
https://nlp.stanford.edu/projects/coref.shtml and links therein
Related
"I had safe journey" ,assume this is a feedback for a driver ,provided by a passenger. I need to extract theses information from this sentence..
"I had safe journey" ->
SUBJECT= "driving"
SENTIMENT= "positive"
I tried with NLP Extracting Information from Text method. But I don't know how recognized Entities from these kind of sentences.How am I supposed to do that ?
To categorize entities of a sentence or a sentence as a whole, you first need to have defined set of classes/categories/groups.
for eg: To categorize journey to travelling/driving, you should train your system/algorithm to identify specific pattern of sentences which will fall under the category of driving/journey.
This training involves concepts of machine learning, Text Categorization is what you should be searching for.
Here is a reference (to just give you an idea) and you can find many more over the web.
Good Luck!
Note: Below are some links from Coursera which offers a course on NLP
Link 1
Link 2
I created classifier to classy the class of nouns,adjectives, Named entities in given sentence. I have used large Wikipedia dataset for classification.
Like :
Where Abraham Lincoln was born?
So classifier will give this short of result - word - class
Where - question
Abraham Lincoln - Person, Movie, Book (because classifier find Abraham Lincoln in all there categories)
born - time
When Titanic was released?
when - question
Titanic - Song, movie, Vehicle, Game (Titanic
classified in all these categories)
Is there any way to identify exact context for word?
Please see :
Word sense disambiguation would not help here. Because there might not be near by word in sentence which can help
Lesk algorithm with wordnet or sysnet also does not help. Because it for suppose word Bank lesk algo will behave like this
======== TESTING simple_lesk ===========
TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
Here for word bank it suggested as financial institute and slopping land. While in my case I am already getting such prediction like Titanic then it can be movie or game.
I want to know is there any other approach apart from Lesk algo, baseline algo, traditional word sense disambiguation which can help me to identify which class is correct for particular keyword?
Titanic -
Thanks for using the pywsd examples. With regards to wsd, there are many other variants and i'm coding them by myself during my free time. So if you want to see it improve do join me in coding the open source tool =)
Meanwhile, you will find the following technologies more relevant to your task, such as:
Knowledge base population (http://www.nist.gov/tac/2014/KBP/) where tokens/segments of text are assigned an entity and the task is to link them or to solve a simplified question and answer task.
Knowledge representation (http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html)
Knowledge extraction (https://en.wikipedia.org/wiki/Knowledge_extraction)
The above technologies usually includes several sub-tasks such as:
Wikification (http://nlp.cs.rpi.edu/kbp/2014/elreading.html)
Entity linking
Slot filling (http://surdeanu.info/kbp2014/def.php)
Essentially you're asking for a tool that is an NP-complete AI system for language/text processing, so I don't really think such a tool exists as of yet. Maybe it's IBM Watson.
if you're looking for the field to look into, the field is out there but if you're looking at tools, most probably wikification tools are closest to what you might need. (http://nlp.cs.rpi.edu/paper/WikificationProposal.pdf)
My current understanding is that it's possible to extract entities from a text document using toolkits such as OpenNLP, Stanford NLP.
However, is there a way to find relationships between these entities?
For example consider the following text :
"As some of you may know, I spent last week at CERN, the European high-energy physics laboratory where the famous Higgs boson was discovered last July. Every time I go to CERN I feel a deep sense of reverence. Apart from quick visits over the years, I was there for three months in the late 1990s as a visiting scientist, doing work on early Universe physics, trying to figure out how to connect the Universe we see today with what may have happened in its infancy."
Entities: I (author), CERN, Higgs boson
Relationships :
- I "visited" CERN
- CERN "discovered" Higgs boson
Thanks.
Yes absolutely. This is called Relation Extraction. Stanford has developed several useful tools for working on this problem.
Here is there website: http://deepdive.stanford.edu/relation_extraction
Here is the github repository: https://github.com/philipperemy/Stanford-OpenIE-Python
In general here is how the process works.
results = entract_entity_relations("Barack Obama was born in Hawaii.")
print(results)
# [['Barack Obama','was born in', 'Hawaii']]
Of some importance is that only triples are extracted of the form (subject,predicate,object).
You can extract verbs with their dependants using Stanford Parser, for example. E.g., you might get "dependency chains" like
"I :: spent :: at :: CERN".
It is a much tougher task to recognise that "I spent at CERN" and "I visited CERN" and "CERN hosted my visit" (etc) denote the same kind of event. Going into how this can be done is beyond the scope of an SO question, but you can read up literature of paraphrases recognition (here is one overview paper). There is also a related question on SO.
Once you can cluster similar chains, you'd need to find a way to label them. You could simply choose the verb of the most common chain in a cluster.
If, however, you have a pre-defined set of relation types you want to extract and lots of texts manually annotated for these relations, then the approach could be very different, e.g., using machine learning to learn how to recognize a relation type based on annotated data.
Don't know if you're still interested but CoreNLP added a new annotator called OpenIE (Open Information Extraction), which should accomplish what you're looking for. Check it out: OpenIE
Similar to the Stanford parser, you can also use the Google Language API, where you send a string and get a dependency tree response.
You can test this API first to see if it works well with your corpus: https://cloud.google.com/natural-language/
The outcome here is a subject predicate object (SPO) triplet, where your predicate describes the relationship. You'll need to traverse the dependency graph and write a script to parse out the triplet.
There are many ways to do relation extraction. As colleagues mentioned that you have to know about NER and coreference resolution. Different techniques require different approaches. Nowadays, Distant Supervision is most common, and for detecting the relation between entities, they used FREEBASE.
I've been working on document level sentiment analysis since past 1 year. Document level sentiment analysis provides the sentiment of the complete document. For example - The text "Nokia is good but vodafone sucks big time" would have a negative polarity associated with it as it would be agnostic to the entities Nokia and Vodafone. How would it be possible to get entity level sentiment, like positive for Nokia but negative for Vodafone ? Are there any research papers providing a solution to such problems ?
You can try Aspect-level or Entity-level Sentiment Analysis. There are good efforts have been already done to find the opinions about the aspects in a sentence. You can find some of works here. You can also go further and deeper and review those papers that are related to feature (aspect) extraction. What does it mean? Let me give you an example:
"The quality of screen is great, however, the battery life is short."
Document-level sentiment analysis may not give us the real sense of this document here because we have one positive and one negative sentence in the document. However, by aspect-based (aspect-level) opinion mining, we can figure out the senses/polarities towards different entities in the document separately. By doing feature extraction, in the first step, you try to find the features (aspects) in different sentences (in here "quality of screen" or simply "quality" and "battery life"). Afterwards, when you have these aspects, you try to extract opinions related to these aspects ("great" for "quality" and "short" for "battery life"). In researches and academic papers, we also name features (aspects) as target words (those words or entities on which users comment), and the opinions as opinion words, the comments that have been stated about the target words.
By searching the keywords that I have just mentioned, you can become more familiar with these concepts.
You could look for entities and their coreferents, and have a simple heuristic like giving each entity sentiment from the closest sentiment term, perhaps closest by distance in a dependency parse tree, as opposed to linearly. Each of those steps seems to be an open research topic.
http://scholar.google.com/scholar?q=entity+identification
http://scholar.google.com/scholar?q=coreference+resolution
http://scholar.google.com/scholar?q=sentiment+phrase
http://scholar.google.com/scholar?q=dependency+parsing
This can be achieved using Google Cloud Natural Language API.
I also tried getting research articles on this but haven't found any. I would suggest you to try using the aspect based sentiment analysis algorithms. The similarity i found is there we recognize aspects of a single entity in a sentence and then find the sentiment of each aspect.Similarly we can train our model using the same algorithm which can detect the entities as it does for aspects and find the sentiment of such entities. I didn't try this but I am going to.Let me know if this worked or not.Also there are various ways to do this. The following are the links for few articles.
http://arxiv.org/pdf/1605.08900v1.pdf
https://cs224d.stanford.edu/reports/MarxElliot.pdf
I have a list of several dozen product attributes that people are concerned with, like
Financing
Manufacturing quality
Durability
Sales experience
and several million free-text statements from customers about the product, e.g.
"The financing was easy but the housing is flimsy."
I would like to score each free text statement in terms of how strongly it relates to each of the attributes, and whether that is a positive or negative association.
In the given example, there would be a strong positive association to Financing and a strong negative association to Manufacturing quality.
It feels like this type of problem is probably the realm of Natural Language Programming (NLP). However, I spent several hours reading up on things like OpenNLP and NLTK and find there's so much domain specific terminology that I cannot figure out where to focus to solve this specific problem.
So my three-part question:
Is NLP the correct route to solve this class of problem?
What aspect of NLP should I focus on learning for this specific problem?
Are there alternatives I have not considered?
A resource you might find handy is SentiWordNet. (http://sentiwordnet.isti.cnr.it/) Which is like a dictionary that has a sentiment grade for words. It will tell you to what degree it thinks a word is positive, negative, or objective.
You can then combine that with some nltk code that looks through your sentences for the words you want to associate the sentiment with. So you would write a script to get some level of meaningful chunks of text that surround the words you were looking at, maybe sentence or clause level. Then you can have another thing that runs through the surrounding words and grab all the sentiment scores from the SentiWordNet.
I have some old code that did this and can place on github if you'd like, but you'd still need to make your own request for SentiWordNet.
I guess your problem is more on association rather than just classification. Now moving forward with this assumption:
Is NLP the correct route to solve this class of problem?
Yes.
What aspect of NLP should I focus on learning for this specific problem?
Part of speech tagging
Sentiment analysis
Maximum entrophy
Are there alternatives I have not considered?
In depth study of automata theory with respect to NLP will help you a lot, it helped me a lot in grasping the implementations like OpenNLP.
Yes, this is a NLP problem by the name of Sentiment analysis. Sentiment analysis is an active research area with different approaches and a task where a lot of other NLP-methods have to work together, so it is certainly not the easiest field to get started with in NLP.
A more or less recent survey of the academic research in the field can be found in Pang & Lee (2008).