How to use dependency parsing features for text classification? - python-3.x

I did dependency parsing for a sentence using spacy and obtained syntactic dependency tags.
import spacy
nlp = spacy.load('en')
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
for token in doc:
print("{0}/{1} <--{2}-- {3}/{4}".format(
token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))
Output
Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--compound-- currencies/NNS
currencies/NNS <--pobj-- on/IN
I'm not unable to understand, how can I use this information to generate dependency-based features for text classification. What are the possible ways to generate features from this for text classification?
Thanks in advance............

In spaCy, there is currently no direct way to include the dependency features into the textcat component, unless you hack your way through the internals of the code.
In general, you'll have to think about what kind of features would be beneficial to give clues to your textcat algorithm. You could generate binary features for any possible "dependency path" in your data, such as "RB --advmod-- VBD" being one feature and then count how many times it occurs, but you'll very quickly have a very sparse dataset.
You may also be interested in other features like "what POS is the ROOT word" or does the sentence include patterns like "two nouns connected by a verb". But it really depends on the application.

Related

NLP Structure Question (best way for doing feature extraction)

I am building an NLP pipeline and I am trying to get my head around in regards to the optimal structure. My understanding at the moment is the following:
Step1 - Text Pre-processing [a. Lowercasing, b. Stopwords removal, c. stemming, d. lemmatisation,]
Step 2 - Feature extraction
Step 3 - Classification - using the different types of classifier(linearSvC etc)
From what I read online there are several approaches in regard to feature extraction but there isn't a solid example/answer.
a. Is there a solid strategy for feature extraction ?
I read online that you can do [a. Vectorising usin ScikitLearn b. TF-IDF]
but also I read that you can use Part of Speech or word2Vec or other embedding and Name entity recognition.
b. What is the optimal process/structure of using these?
c. On the text pre-processing I am ding the processing on a text column on a df and the last modified version of it is what I use as an input in my classifier. If you do feature extraction do you do that in the same column or you create a new one and you only send to the classifier the features from that column?
Thanks so much in advance
The preprocessing pipeline depends mainly upon your problem which you are trying to solve. The use of TF-IDF, word embeddings etc. have their own restrictions and advantages.
You need to understand the problem and also the data associated with it. In order to make the best use of the data, we need to implement the proper pipeline.
Specifically for text related problems, you will find word embeddings to be very useful. TF-IDF is useful when the problem needs to be solved emphasising the words with lesser frequency. Word embeddings, on the other hand, convert the text to a N-dimensional vector which may show up similarity with some other vector. This could bring a sense of association in your data and the model can learn the best features possible.
In simple cases, we can use a bag of words representation to tokenize the texts.
So, you need to discover the best approach for your problem. If you are solving a problems which closely resembles the famous NLP problems like IMDB review classification, sentiment analysis on Twitter data, then you can find a number of approaches on the internet.

NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences:
The corpora were lemmatized and POS-tagged with the Stanford CoreNLP (Manning et al., 2014) and each token was replaced with its lemma and POS tag
(http://www.ep.liu.se/ecp/131/039/ecp17131039.pdf)
For pre-processing, we tokenise and lowercase the words using Stanford CoreNLP
(https://arxiv.org/pdf/1607.05368.pdf)
So my questions are:
Why does the first paper apply POS-tagging? Would each token then be replaced with something like {lemma}_{POS} and the whole thing used to train the model? Or are the tags used to filter tokens?
For example, gensims WikiCorpus applies lemmatization per default and then only keeps a few types of part of speech (verbs, nouns, etc.) and gets rid of the rest. So what is the recommended way?
The quote from the second paper seems to me like they only split up words and then lowercase them. This is also what I first tried before I used WikiCorpus. In my opinion, this should give better results for document embeddings as most of POS types contribute to the meaning of a sentence. Am I right?
In the original doc2vec paper I did not find details about their pre-processing.
For your first question, the answer is "it depends on what you are trying to accomplish!"
There isn't a recommended way per say, to pre-process text. To clean a text corpus, usually the first steps are tokenization and lemmatization. Next, to remove not important terms/tokens, you can remove stop-words or even apply POS tags, to be able to remove tokens based on their grammatical category, based on the assumption that some grammatical categories (such as adjectives), do not contain valuable information for modelling a topic for example. But this purely depends on the type of analysis you are going to follow after the pre-processing step.
For you second part of the question, as explained above, tokenisation and lower case tokens, are standard parts of the pre-processing routine. So I also suspect, that regardless of the ML algorithm used later on, your results will be better if you carefully pre-process your data. I am not sure whether POS tags contribute to the meaning of a sentence though.
Hope I provided some valuable feedback to your research. If not you could provide a code sample to further discuss this issue.

How to obtain lexicalized dependency path between two arguments?

According to the authors of Discrete-State Variational Autoencoders
for Joint Discovery and Factorization of Relations paper, the first field of this dataset is a lexicalized dependency path between the pair of entities of the training sentences.
What tool (preferably in python) can extract such lexicalized path from a sentence with an identified pair of entities?
You can use NLTK
NLTK has been called β€œa wonderful tool for teaching, and working in,
computational linguistics using Python,” and β€œan amazing library to
play with natural language.”
Using NLTK, you can parse a given sentence to get dependency relations between its words and their POS tags.
It doesn't provide a way to get those lexicalized dependency paths directly,
but it gives you what you need to write your own method to achieve that.

Calculating grammar similarity between two sentences

I'm making a program which provides some english sentences which user has to learn more.
For example:
First, I provide a sentence "I have to go school today" to user.
Then if the user wants to learn more sentences like that, I find some sentences which have high grammar similarity with that sentence.
I think the only way for providing sentences is to calculate similarity.
Is there a way to calculate grammar similarity between two sentences?
or is there a better way to make that algorithm?
Any advice or suggestions would be appreciated. Thank you.
My approach for solving this problem would be to do a Part Of Speech Tagging of using a tool like NLTK and compare the trees structure of your phrase with your database.
Other way, if you already have a training dataset, use the WEKA to use a machine learn approach to connect the phrases.
You can parse your sentence as either a constituent or dependency tree and use these representations to formulate some form of query that you can use to find candidate sentences with similar structures.
You can check this available tool from Stanford NLP:
Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions"). Tregex comes with Tsurgeon, a tree transformation language. Also included from version 2.0 on is a similar package which operates on dependency graphs (class SemanticGraph, called semgrex.

Part of speech tagging in OpenNLP vs. StanfordNLP

I'm new to part of speech (pos) taging and I'm doing a pos tagging on a text document. I'm considering using either OpenNLP or StanfordNLP for this. For StanfordNLP I'm using a MaxentTagger and I use english-left3words-distsim.tagger to train it. In OpenNLP I'm using POSModel and train it using en-pos-maxent.bin. How these two taggers (MaxentTagger and POSTagger) and the training sets (english-left3words-distsim.tagger and en-pos-maxent.bin) are different and which one is usually giving a better result.
Both POS taggers are based on Maximum Entropy machine learning. They differ in the parameters/features used to determine POS tags. For example, StanfordNLP pos tagger uses: "(i) more extensive treatment of capitalization for unknown words; (ii) features for the disambiguation of the tense forms of verbs; (iii) features for disambiguating particles from prepositions and adverbs" (read more in the paper). Features of OpenNLP are documented somewhere else which I currently don't know.
The models are probably trained on different corpora.
In general, it is really hard to tell which NLP tool performs better in term of quality. This is really dependent on your domain and you need to test your tools. See following papers for more information:
Is Part-Of-Tagging a Solved Task
Large Dataset for Keyphrases Extraction
In order to address this problem practically, I'm developing a Maven plugin and an annotation tool to create domain-specific NLP models more effectively.

Resources