How to solve difficult sentences for nlp sentiment analysis - nlp

Such as the following sentence,
"Don't pay attention to people if they say it's no good."
As humans, we understand the overall sentiment from the sentence is positive.
Technique of "Bag of Words" or BOW
Then, we have the two categories of "positive" words as Polarity of 1, "negative" words of Polarity of 0.
In this case, the word of "good" fits into category, but here it is accidentally correct.
Thus, this technique is eliminated.
Still use BOW technique (sort of "Word Embedding")
But take into consideration of its surrounding words, in this case, the "no" word preceding it, thus, it's "no good", not the adj alone "good". However, "no good" is not what the author intended from the context of the entire sentence.
Thus, this question. Thanks in advance.

Word embeddings are one possible way to try to take into account the complexity coming from the sequence of terms in your example. Using pre-trained models on general English such as BERT should give you interesting results for your sentiment analysis problem. You can leverage on several implementation provided by Hugging face library.
Another approach, that doesn't rely on compute intensive techniques (such as word embeddings), would be to use n-gram which will capture the sequence aspect and should provide good features for sentiment estimation. You can try different depth (unigram, bigrams, trigrams...) and combine with different types of preprocesing and/or tokenizers. Scikit-learn provides a good reference implementation for n-gramss in its CountVectorizer class.

Related

How to get sentiment score for a word in a given dataset

I have a sentiment analysis dataset that is labeled in three categories: positive, negative, and neutral. I also have a list of words (mostly nouns), for which I want to calculate the sentiment value, to understand "how" (positively or negatively) these entities were talked about in the dataset. I have read some online resources like blogs and thought about a couple of approaches for calculating the sentiment score for a particular word X.
Calculate how many data instances (sentences) which have the word X in those, have "positive" labels, have "negative" labels, and "neutral" labels. Then, calculate the weighted average sentiment for that word.
Take a generic untrained BERT architecture, and then train it using the dataset. Then, pass each word from the list to that trained model to get the sentiment scores for the word.
Does any of these approaches make sense? If so, can you suggest some related works that I can look at?
If these approaches don't make sense, could you please advise how I can calculate the sentiment score for a word, in a given dataset?
The first method will suffer from the same drawbacks as other bag-of-words models do. Consider that you have a dataset of movie reviews with their sentiment scores, and you want to find the sentiment for a particular actor called X. A label for a sample like "X's acting was the only good thing in an otherwise bad movie" will be negative, but the sentiment towards X is positive. A simple approach like the first one can't handle such cases.
The second approach also does not make much sense, as the BERT models may not perform well without context. You can try using weakly supervised learning which can help in creating token-level labels. Read section 3.3 for this paper to get an idea about this. Disclaimer: I'm one of the authors of this paper.

Text representations : How to differentiate between strings of similar topic but opposite polarities?

I have been doing clustering of a certain corpus, and obtaining results that group sentences together by obtaining their tf-idf, checking similarity weights > a certain threshold value from the gensim model.
tfidf_dic = DocSim.get_tf_idf()
ds = DocSim(model,stopwords=stopwords, tfidf_dict=tfidf_dic)
sim_scores = ds.calculate_similarity(source_doc, target_docs)
The problem is that despite putting high threshold values, sentences of similar topics but opposite polarities get clustered together as such:
Here is an example of the similarity weights obtained between "don't like it" & "i like it"
Are there any other methods, libraries or alternative models that can differentiate the polarities effectively by assigning them very low similarities or opposite vectors?
This is so that the outputs "i like it" and "dont like it" are in separate clusters.
PS: Pardon me if there are any conceptual errors as I am rather new to NLP. Thank you in advance!
The problem is in how you represent your documents. Tf-idf is good for representing long documents where keywords play a more important role. Here, it is probably the idf part of tf-idf that disregards the polarity because negative particles like "no" or "not" will appear in most documents and they will always receive a low weight.
I would recommend trying some neural embeddings that might capture the polarity. If you want to keep using Gensim, you can try doc2vec but you would need quite a lot of training data for that. If you don't have much data to estimate the representation, I would use some pre-trained embeddings.
Even averaging word embeddings (you can load FastText embeddings in Gensim). Alternatively, if you want a stronger model, you can try BERT or another large pre-trained model from the Transformers package.
Unfortunately, simple text representations based merely on the sets-of-words don't distinguish such grammar-driven reversals-of-meaning very well.
The method needs to be sensitive to meaningful phrases, and the hierarchical, grammar-driven inter-word dependencies, to model that.
Deeper neural networks using convolutional/recurrent techniques do better, or methods which tree-model sentence-structure.
For ideas see for example...
"Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank"
...or a more recent summary presentation...
"Representations for Language: From Word Embeddings to Sentence Meanings"

Unsupervised sentiment Analysis using doc2vec

Folks,
I have searched Google for different type of papers/blogs/tutorials etc but haven't found anything helpful. I would appreciate if anyone can help me. Please note that I am not asking for code step-by-step but rather an idea/blog/paper or some tutorial.
Here's my problem statement:
Just like sentiment analysis is used for identifying positive and
negative tone of a sentence, I want to find whether a sentence is
forward-looking (future outlook) statement or not.
I do not want to use bag of words approach to sum up the number of forward-looking words/phrases such as "going forward", "in near future" or "In 5 years from now" etc. I am not sure if word2vec or doc2vec can be used. Please enlighten me.
Thanks.
It seems what you are interested in doing is finding temporal statements in texts.
Not sure of your final output, but let's assume you want to find temporal phrases or sentences which contain them.
One methodology could be the following:
Create list of temporal terms [days, years, months, now, later]
Pick only sentences with key terms
Use sentences in doc2vec model
Infer vector and use distance metric for new sentence
GMM Cluster + Limit
Distance from average
Another methodology could be:
Create list of temporal terms [days, years, months, now, later]
Do Bigram and Trigram collocation extraction
Keep relevant collocations with temporal terms
Use relevant collocations in a kind of bag-of-collocations approach
Matched binary feature vectors for relevant collocations
Train classifier to recognise higher level text
This sounds like a good case for a Bootstrapping approach if you have large amounts of texts.
Both are semi-supervised really, since there is some need for finding initial temporal terms, but even that could be automated using a word2vec scheme and bootstrapping

Text Classification + NLP + Data-mining + Data Science: Should I do stop word removal and stemming before applying tf-idf?

I am working on a text classification problem. The problem is explained below:
I have a dataset of events which contains three columns - name of the event, description of the event, category of the event. There are about 32 categories in the dataset, such as, travel, sport, education, business etc. I have to classify each event to a category depending on its name and description.
What I understood is this particular task of classification is highly dependent on keywords, rather than, semantics. I am giving you two examples:
If the word 'football' is found either in the name or description or in both, it is highly likely that the event is about sport.
If the word 'trekking' is found either in the name or description or in both, it is highly likely that the event is about travel.
We are not considering multiple categories for an event(however, that's a plan for future !! )
I hope applying tf-idf before Multinomial Naive Bayes would lead to decent result for this problem. My question is:
Should I do stop word removal and stemming before applying tf-idf or should I apply tf-idf just on raw text? Here text means entries in name of event and description columns.
The question is too generic and you are not providing samples of the dataset, code, and not even indicating the language you are using. To this regard, I will presume that you are using English, since the two words that you are providing as an example are "football" and "trekking". The answer will however necessarily be generic.
Should I do stop word removal
Yes. Have a look at this to see the most frequent words in the English language. As you can see they have no semantic meaning, and thus would not contribute to solving the classification task that you have proposed. if stopwords is a list containing stopwords, the parameter stop_words=stopwords passed to the CountVectorizer or TfidfVectorizer constructor will automatically exclude the stopwords when invoking the .fit_transform() method.
Should I do stemming
It depends. Languages other than English, whose grammar rules allow for a big number of possible prefixes-suffixes, normally require stemming when performing classification task, in order to reach any useful result. The English language however has very poor grammar rules, and thus you can often get away without stemming/lemmatization. You should check the results obtained against the desired accuracy first, and if it is insufficient, try adding a stemming/lemmatization step in the preprocessing of your data. Stemming is a computationally expensive process for large corpora, and I personally use it only for languages that require it.
I hope applying tf-idf before Multinomial Naive Bayes would lead to decent result for this problem
Careful with this. While tf-idf in practice works with Naive Bayesian classifiers, this is not the way that specific classifier is meant to be used. From the documentation,
The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. It is in your best interest to tackle the classification task with CountVectorizer first and score it, and after you have a baseline accuracy for evaluating the TfidfVectorizer, check whether its results are better or worse than those of the CountVectorizer.
If you post some code and a sample of your dataset we can help you with that, otherwise this should be enough.

word2vec lemmatization of corpus before training

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.
I think it really matters about what you want to solve with this. It depends on the task.
Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.
But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.
Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.
The current project I am working on involves identifying gene names within Biology papers abstracts using the vector space created by Word2Vec. When we run the algorithm without lemmatizing the Corpus mainly 2 problems arise:
The vocabulary gets way too big, since you have words in different forms which in the end have the same meaning.
As noted above, your space get less sparse, since you get more representatives of a certain "meaning", but at the same time, some of these meanings might get split among its representatives, let me clarify with an example
We are currently interest in a gene recognized by the acronym BAD. At the same time, "bad" is a english word which has different forms (badly, worst, ...). Since Word2vec build its vectors based on the context (its surrounding words) probability, when you don't lemmatize some of these forms, you might end up losing the relationship between some of these words. This way, in the BAD case, you might end up with a word closer to gene names instead of adjectives in the vector space.

Resources