How to obtain lexicalized dependency path between two arguments? - nlp

According to the authors of Discrete-State Variational Autoencoders
for Joint Discovery and Factorization of Relations paper, the first field of this dataset is a lexicalized dependency path between the pair of entities of the training sentences.
What tool (preferably in python) can extract such lexicalized path from a sentence with an identified pair of entities?

You can use NLTK
NLTK has been called “a wonderful tool for teaching, and working in,
computational linguistics using Python,” and “an amazing library to
play with natural language.”
Using NLTK, you can parse a given sentence to get dependency relations between its words and their POS tags.
It doesn't provide a way to get those lexicalized dependency paths directly,
but it gives you what you need to write your own method to achieve that.

Related

How to use dependency parsing features for text classification?

I did dependency parsing for a sentence using spacy and obtained syntactic dependency tags.
import spacy
nlp = spacy.load('en')
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
for token in doc:
print("{0}/{1} <--{2}-- {3}/{4}".format(
token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))
Output
Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--compound-- currencies/NNS
currencies/NNS <--pobj-- on/IN
I'm not unable to understand, how can I use this information to generate dependency-based features for text classification. What are the possible ways to generate features from this for text classification?
Thanks in advance............
In spaCy, there is currently no direct way to include the dependency features into the textcat component, unless you hack your way through the internals of the code.
In general, you'll have to think about what kind of features would be beneficial to give clues to your textcat algorithm. You could generate binary features for any possible "dependency path" in your data, such as "RB --advmod-- VBD" being one feature and then count how many times it occurs, but you'll very quickly have a very sparse dataset.
You may also be interested in other features like "what POS is the ROOT word" or does the sentence include patterns like "two nouns connected by a verb". But it really depends on the application.

How do you differentiate between names, places, and things?

Here is a list of proper nouns taken from The Lord of the Rings. I was wondering if there is a good way to sort them based on whether they refer to a person, place or thing. Does there exist a natural language processing library that can do this? Is there a way to differentiate between places, names, and things?
Shire, Tookland, Bagginses, Boffins, Marches, Buckland, Fornost, Norbury, Hobbits, Took, Thain, Oldbucks, Hobbitry, Thainship, Isengrim, Michel, Delving, Midsummer, Postmaster, Shirriff, Farthing, Bounders, Bilbo, Frodo
You're talking about Named Entity Recognition. It is the task of information extraction that seeks to locate and classify piece of text into predefined categories such as pre-defined names, location, organizations, time expressions, monetary values, etc. You can either do that by unsupervised methods using a dictionary such as the words you have. Or though supervised methods, using methods such as CRFs, Neural Networks etc. But you need a list of predefined sentences with the respective annotated names and classes. In this example here, using Spacy (A NLP library), the authors applied NER to Lord of the rings novels. You can read more in the link.
Here is the solution:
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Wikipedia link: https://en.wikipedia.org/wiki/Named-entity_recognition
Named Entity Recognition (NER) is a standard NLP problem which involves spotting named entities (people, places, organizations etc.) from a chunk of text, and classifying them into a predefined set of categories. Some of the practical applications of NER include:
Scanning news articles for the people, organizations and locations reported.
Providing concise features for search optimization: instead of searching the entire content, one may simply search for the major entities involved.
Quickly retrieving geographical locations talked about in Twitter posts.
NER with spaCy
spaCy is regarded as the fastest NLP framework in Python, with single optimized functions for each of the NLP tasks it implements. Being easy to learn and use, one can easily perform simple tasks using a few lines of code.
Installation :
!pip install spacy
!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(sentence)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Output:
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
In the output, the first column specifies the entity, the next two columns the start and end characters within the sentence/document, and the final column specifies the category.
Further, it is interesting to note that spaCy’s NER model uses capitalization as one of the cues to identify named entities. The same example, when tested with a slight modification, produces a different result.

How does TreeTagger get the lemma of a word?

I am using TreeTagger to get the lemmas of words in Spanish, but I have observed there are too much words which are not transformed as should be. I would like to know how this operations works, if it is done with techniques such as decision trees or machine learning algorithms or it simply contains a list of words with its corresponding lemma. Does someone know it?
Thanks!!
On basis of personal communication via email with H. Schmid, the author of TreeTagger, the answer to your question is:
The lemmatization function is based on the XTAG Project, which includes a morphological analyzer. Within the XTAG project several corpora have been analyzed. Considerung TreeTagger, especially the analysis of the Penn Treebank Corpus seems relevant, since this corpus is the training corpus for the English parameter file of TreeTagger. Considering lemmatization, the lemmata have simply been stored in a lexicon. TreeTagger finally uses this lexicon as a lookup table.
Hence, with TreeTagger you may only retreive the lemmata that are available in the lexicon.
In case you need additional funtionality regarding lemmatization beyond the options in TreeeTagger, you will need a morphological analyzer and, depending on your approach, a suitable training corpus, although this does not seem mandatoriy, since several analyzers perform quite well even when directly applied on the corpus of interest to be analyzed.

Calculating grammar similarity between two sentences

I'm making a program which provides some english sentences which user has to learn more.
For example:
First, I provide a sentence "I have to go school today" to user.
Then if the user wants to learn more sentences like that, I find some sentences which have high grammar similarity with that sentence.
I think the only way for providing sentences is to calculate similarity.
Is there a way to calculate grammar similarity between two sentences?
or is there a better way to make that algorithm?
Any advice or suggestions would be appreciated. Thank you.
My approach for solving this problem would be to do a Part Of Speech Tagging of using a tool like NLTK and compare the trees structure of your phrase with your database.
Other way, if you already have a training dataset, use the WEKA to use a machine learn approach to connect the phrases.
You can parse your sentence as either a constituent or dependency tree and use these representations to formulate some form of query that you can use to find candidate sentences with similar structures.
You can check this available tool from Stanford NLP:
Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions"). Tregex comes with Tsurgeon, a tree transformation language. Also included from version 2.0 on is a similar package which operates on dependency graphs (class SemanticGraph, called semgrex.

Identifying the entity in sentiment analysis using Lingpipe

I have implemented sentiment analysis using the sentiment analysis module of Lingpipe. I know that they use a Dynamic LR model for this. It just tells me if the test string is a positive sentiment or negative sentiment. What ideas could I use to determine the object for which the sentiment has been expressed?
If the text is categorized as positive sentiment, I would like to get the object for which the sentiment has been expressed - this could be a movie name, product name or others.
Although this question is really old but I would like to answer it for others' benefit.
What you want here is concept level sentiment analysis. For a very basic version, I would recommend following these steps:
Apply sentence splitter. You can either use Lingpipe's Sentence Splitter or the OpenNLP Sentence Detector.
Apply part-of-spech tagging. Again you can either use Lingpipe's POS tagger or OpenNLP POS Tagger.
You then need to identify tokens(s) identified as 'Nouns' by the POS tagger. These token(s) have the potential of being the targeted entity in the sentence.
Then you need to find sentiment words in the sentence. The easiest way to do this is by using a dictionary of sentiment bearing words. You can find many such dictionaries online.
The next step will be find out dependency relations in sentences. This can be achieved by using the Stanford Dependency Parser. For example, if you try out the sentence - "This phone is good." in their online demo, you can see the following 'Typed Dependencies':
det(phone-2, This-1),
nsubj(good-4, phone-2),
cop(good-4, is-3),
root(ROOT-0, good-4)
The dependency nsubj(good-4, phone-2) here indicates that phone is the nominal subject of the token good, implying that the word good is expressed for phone. I am sure that your sentiment dictionary will contain the word good and phone would have been identified as a noun by the POS tagger. Thus, you can conclude that the sentiment good was expressed for the entity phone.
This was a very basic example. You can go a step further and create rules around the dependency relations to extract more complex sentiment-entity pairs. You can also assign scores to your sentiment terms and come up with a total score for the sentence depending upon the number of occurrences of sentiment words in that sentence.
Usually sentiment sentence means that the main entity of such sentence is the object of that sentiment. So basic heuristic is to NER and get first object. Otherwise you should use deep parsing NLP toolkits and write some rules to link sentiment to object.

Resources