NLP Aspect Mining approach - nlp

I'm trying to implement as aspect miner based on consumer reviews in amazon for durable- washing machine, refrigerator. The idea is to output sentiment polarity for aspects instead of the entire sentence. For eg: 'Food was good but service was bad' review must output food to be positive and service to be negative. I read through Richard Socher's paper on RNTN model for fine grained sentiment classifier but I guess I'll need to manually tag sentiment for phrases for a different domain and create my own treebank for better accuracy.
Here's an alternate approach I'd thought of. Could someone pls validate/guide me with your feedback
Break the approach into 2 sub tasks. 1) Identify aspects 2) Identify sentiment
Identify aspects
Use POS tagger to identify all nouns. This should shortlist
potentially all aspects in the reviews.
Use word2vec of these nouns to determine similar nouns and reduce the dataset size
Identify sentiments
Train a CNN or dense net model on reviews with rating 1,2,4,5(ignore
3 as we need data that has polarity)
Breakdown the test set reviews into phrases(eg 'Food was good') and then score them using the above model
Find the aspects identified in the 1st sub task and tag them to
their respective phrases.

I don't know how to answer this question but have a few suggestions:
Take a look at multitask learning in neuralnets literature and try an end2end neuralnet for multiple tasks.
Use pretrained word vectors like w2v or glov as inputs.
Don't rely on pos taggers when you use internet data,
Find a way to represent your name entities and oov in your design.
Don't ignore 3!!
You should annotate some data periodically.

Related

How to get sentiment score for a word in a given dataset

I have a sentiment analysis dataset that is labeled in three categories: positive, negative, and neutral. I also have a list of words (mostly nouns), for which I want to calculate the sentiment value, to understand "how" (positively or negatively) these entities were talked about in the dataset. I have read some online resources like blogs and thought about a couple of approaches for calculating the sentiment score for a particular word X.
Calculate how many data instances (sentences) which have the word X in those, have "positive" labels, have "negative" labels, and "neutral" labels. Then, calculate the weighted average sentiment for that word.
Take a generic untrained BERT architecture, and then train it using the dataset. Then, pass each word from the list to that trained model to get the sentiment scores for the word.
Does any of these approaches make sense? If so, can you suggest some related works that I can look at?
If these approaches don't make sense, could you please advise how I can calculate the sentiment score for a word, in a given dataset?
The first method will suffer from the same drawbacks as other bag-of-words models do. Consider that you have a dataset of movie reviews with their sentiment scores, and you want to find the sentiment for a particular actor called X. A label for a sample like "X's acting was the only good thing in an otherwise bad movie" will be negative, but the sentiment towards X is positive. A simple approach like the first one can't handle such cases.
The second approach also does not make much sense, as the BERT models may not perform well without context. You can try using weakly supervised learning which can help in creating token-level labels. Read section 3.3 for this paper to get an idea about this. Disclaimer: I'm one of the authors of this paper.

Text representations : How to differentiate between strings of similar topic but opposite polarities?

I have been doing clustering of a certain corpus, and obtaining results that group sentences together by obtaining their tf-idf, checking similarity weights > a certain threshold value from the gensim model.
tfidf_dic = DocSim.get_tf_idf()
ds = DocSim(model,stopwords=stopwords, tfidf_dict=tfidf_dic)
sim_scores = ds.calculate_similarity(source_doc, target_docs)
The problem is that despite putting high threshold values, sentences of similar topics but opposite polarities get clustered together as such:
Here is an example of the similarity weights obtained between "don't like it" & "i like it"
Are there any other methods, libraries or alternative models that can differentiate the polarities effectively by assigning them very low similarities or opposite vectors?
This is so that the outputs "i like it" and "dont like it" are in separate clusters.
PS: Pardon me if there are any conceptual errors as I am rather new to NLP. Thank you in advance!
The problem is in how you represent your documents. Tf-idf is good for representing long documents where keywords play a more important role. Here, it is probably the idf part of tf-idf that disregards the polarity because negative particles like "no" or "not" will appear in most documents and they will always receive a low weight.
I would recommend trying some neural embeddings that might capture the polarity. If you want to keep using Gensim, you can try doc2vec but you would need quite a lot of training data for that. If you don't have much data to estimate the representation, I would use some pre-trained embeddings.
Even averaging word embeddings (you can load FastText embeddings in Gensim). Alternatively, if you want a stronger model, you can try BERT or another large pre-trained model from the Transformers package.
Unfortunately, simple text representations based merely on the sets-of-words don't distinguish such grammar-driven reversals-of-meaning very well.
The method needs to be sensitive to meaningful phrases, and the hierarchical, grammar-driven inter-word dependencies, to model that.
Deeper neural networks using convolutional/recurrent techniques do better, or methods which tree-model sentence-structure.
For ideas see for example...
"Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank"
...or a more recent summary presentation...
"Representations for Language: From Word Embeddings to Sentence Meanings"

Unsupervised sentiment Analysis using doc2vec

Folks,
I have searched Google for different type of papers/blogs/tutorials etc but haven't found anything helpful. I would appreciate if anyone can help me. Please note that I am not asking for code step-by-step but rather an idea/blog/paper or some tutorial.
Here's my problem statement:
Just like sentiment analysis is used for identifying positive and
negative tone of a sentence, I want to find whether a sentence is
forward-looking (future outlook) statement or not.
I do not want to use bag of words approach to sum up the number of forward-looking words/phrases such as "going forward", "in near future" or "In 5 years from now" etc. I am not sure if word2vec or doc2vec can be used. Please enlighten me.
Thanks.
It seems what you are interested in doing is finding temporal statements in texts.
Not sure of your final output, but let's assume you want to find temporal phrases or sentences which contain them.
One methodology could be the following:
Create list of temporal terms [days, years, months, now, later]
Pick only sentences with key terms
Use sentences in doc2vec model
Infer vector and use distance metric for new sentence
GMM Cluster + Limit
Distance from average
Another methodology could be:
Create list of temporal terms [days, years, months, now, later]
Do Bigram and Trigram collocation extraction
Keep relevant collocations with temporal terms
Use relevant collocations in a kind of bag-of-collocations approach
Matched binary feature vectors for relevant collocations
Train classifier to recognise higher level text
This sounds like a good case for a Bootstrapping approach if you have large amounts of texts.
Both are semi-supervised really, since there is some need for finding initial temporal terms, but even that could be automated using a word2vec scheme and bootstrapping

Multiclass text classification with python and nltk

I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.

Multitask learning

Can anybody please explain multitask learning in simple and intuitive way? May be some real
world problem would be useful.Mostly, these days i am seeing many people are using it for natural language processing tasks.
Let's say you've built a sentiment classifier for a few different domains. Say, movies, music DVDs, and electronics. These are easy to build high quality classifiers for, because there is tons of training data that you've scraped from Amazon. Along with each classifier, you also build a similarity detector that will tell you for a given piece of text, how similar it was to the dataset each of the classifiers was trained on.
Now you want to find the sentiment of some text from an unknown domain or one in which there isn't such a great dataset to train on. Well, how about we take a similarity weighted combination of the classifications from the three high quality classifiers we already have. If we are trying to classify a dish washer review (there is no giant corpus of dish washer reviews, unfortunately), it's probably most similar to electronics, and so the electronics classifier will be given the most weight. On the other hand, if we are trying to classify a review of a TV show, probably the movies classifier will do the best job.

Resources