I have run text through a sentiment analysis model, and it gives me 3 probabilities for if the text is Negative, Neutral or Positive e.g.
[Negative = 0.38]
[Neutral = 0.42]
[Positive = 0.20]
I am wanting to convert this to a score so that each text string has a score with higher scores meaning the text is more positive. I was thinking of just doing [Positive - Negative] and ignoring the neutral case to generate a score, but wondered if there were any better ideas as this seems overly simple.
I want to generate a score so I can run regressions on these scores for predictive purposes.
Doing that sum on my dataset gets me a distribution that looks like this: https://i.imgur.com/DewzBRM.jpeg
(Sorry for not embedding, I don't have the 'reputation' required to do so).
This seems pretty decent, but I was curious if anyone had other alternative ideas.
Thanks in advance
I have a sentiment analysis dataset that is labeled in three categories: positive, negative, and neutral. I also have a list of words (mostly nouns), for which I want to calculate the sentiment value, to understand "how" (positively or negatively) these entities were talked about in the dataset. I have read some online resources like blogs and thought about a couple of approaches for calculating the sentiment score for a particular word X.
Calculate how many data instances (sentences) which have the word X in those, have "positive" labels, have "negative" labels, and "neutral" labels. Then, calculate the weighted average sentiment for that word.
Take a generic untrained BERT architecture, and then train it using the dataset. Then, pass each word from the list to that trained model to get the sentiment scores for the word.
Does any of these approaches make sense? If so, can you suggest some related works that I can look at?
If these approaches don't make sense, could you please advise how I can calculate the sentiment score for a word, in a given dataset?
The first method will suffer from the same drawbacks as other bag-of-words models do. Consider that you have a dataset of movie reviews with their sentiment scores, and you want to find the sentiment for a particular actor called X. A label for a sample like "X's acting was the only good thing in an otherwise bad movie" will be negative, but the sentiment towards X is positive. A simple approach like the first one can't handle such cases.
The second approach also does not make much sense, as the BERT models may not perform well without context. You can try using weakly supervised learning which can help in creating token-level labels. Read section 3.3 for this paper to get an idea about this. Disclaimer: I'm one of the authors of this paper.
So i am doing a project regarding twitter sentiment analysis, wherein i happen to need to use TFIDF on the collected tweets.
So i converted the list of tweets into a single string and fed that to the object, the problem is that i am getting identical values for most of the words with some different values but they are also very frequent. Why is this happening ? Is it cause i am using a single string as input?
here is the code https://trinket.io/python/9c2daed912
Here is the screenshot, as you can see many have same TFIDF values
The frequency can be the same number for different words based on how many times it occurs. My code for project gutenberg results in the following.
from sklearn.feature_extraction.text import TfidfVectorizer
tfvect = TfidfVectorizer(stop_words='english')
### The corpus is from Project Gutenberg after all the text cleanup.
karlmarx_freq = tfvect.fit_transform(gutenberg_KarlMarx_Corpus)
tftermFreq = pd.DataFrame(karlmarx_freq.toarray(),columns=tfvect.get_feature_names())
tfsumdf = tftermFreq.sum(axis=0)
pd.DataFrame({'Vocab': tfsumdf.index, 'Frequency': tfsumdf.values}).sort_values(by='Frequency', ascending=False)
The results are:
1412 production 0.177513
345 conditions 0.174032
1151 modern 0.142706
1704 social 0.128784
1000 labour 0.128784
641 existence 0.111381
1705 socialism 0.104419
1117 means 0.100939
923 industry 0.100939
For details regarding, how this calculation is made, please refer to scikit-learn documentation.
Tf-Idf = Term frequency * Inverse Document frequency
I have written an application that measures text importance. It takes a text article, splits it into words, drops stopwords, performs stemming, and counts word-frequency and document-frequency. Word-frequency is a measure that counts how many times the given word appeared in all documents, and document-frequency is a measure that counts how many documents the given word appeared.
Here's an example with two text articles:
Article I) "A fox jumps over another fox."
Article II) "A hunter saw a fox."
Article I gets split into words (afters stemming and dropping stopwords):
["fox", "jump", "another", "fox"].
Article II gets split into words:
["hunter", "see", "fox"].
These two articles produce the following word-frequency and document-frequency counters:
fox (word-frequency: 3, document-frequency: 2)
jump (word-frequency: 1, document-frequency: 1)
another (word-frequency: 1, document-frequency: 1)
hunter (word-frequency: 1, document-frequency: 1)
see (word-frequency: 1, document-frequency: 1)
Given a new text article, how do I measure how similar this article is to previous articles?
I've read about df-idf measure but it doesn't apply here as I'm dropping stopwords, so words like "a" and "the" don't appear in the counters.
For example, I have a new text article that says "hunters love foxes", how do I come up with a measure that says this article is pretty similar to ones previously seen?
Another example, I have a new text article that says "deer are funny", then this one is a totally new article and similarity should be 0.
I imagine I somehow need to sum word-frequency and document-frequency counter values but what's a good formula to use?
A standard solution is to apply the Naive Bayes classifier which estimates the posterior probability of a class C given a document D, denoted as P(C=k|D) (for a binary classification problem, k=0 and 1).
This is estimated by computing the priors from a training set of class labeled documents, where given a document D we know its class C.
P(C|D) = P(D|C) * P(D) (1)
Naive Bayes assumes that terms are independent, in which case you can write P(D|C) as
P(D|C) = \prod_{t \in D} P(t|C) (2)
P(t|C) can simply be computed by counting how many times does a term occur in a given class, e.g. you expect that the word football will occur a large number of times in documents belonging to the class (category) sports.
When it comes to the other factor P(D), you can estimate it by counting how many labeled documents are given from each class, may be you have more sports articles than finance ones, which makes you believe that there is a higher likelihood of an unseen document to be classified into the sports category.
It is very easy to incorporate factors, such as term importance (idf), or term dependence into Equation (1). For idf, you add it as a term sampling event from the collection (irrespective of the class).
For term dependence, you have to plugin probabilities of the form P(u|C)*P(u|t), which means that you sample a different term u and change (transform) it to t.
Standard implementations of Naive Bayes classifier can be found in the Stanford NLP package, Weka and Scipy among many others.
It seems that you are trying to answer several related questions:
How to measure similarity between documents A and B? (Metric learning)
How to measure how unusual document C is, compared to some collection of documents? (Anomaly detection)
How to split a collection of documents into groups of similar ones? (Clustering)
How to predict to which class a document belongs? (Classification)
All of these problems are normally solved in 2 steps:
Extract the features: Document --> Representation (usually a numeric vector)
Apply the model: Representation --> Result (usually a single number)
There are lots of options for both feature engineering and modeling. Here are just a few.
Feature extraction
Bag of words: Document --> number of occurences of each individual word (that is, term frequencies). This is the basic option, but not the only one.
Bag of n-grams (on word-level or character-level): co-occurence of several tokens is taken into account.
Bag of words + grammatic features (e.g. POS tags)
Bag of word embeddings (learned by an external model, e.g. word2vec). You can use embedding as a sequence or take their weighted average.
Whatever you can invent (e.g. rules based on dictionary lookup)...
Features may be preprocessed in order to decrease relative amount of noise in them. Some options for preprocessing are:
dividing by IDF, if you don't have a hard list of stop words or believe that words might be more or less "stoppy"
normalizing each column (e.g. word count) to have zero mean and unit variance
taking logs of word counts to reduce noise
normalizing each row to have L2 norm equal to 1
You cannot know in advance which option(s) is(are) best for your specific application - you have to do experiments.
Now you can build the ML model. Each of 4 problems has its own good solutions.
For classification, the best studied problem, you can use multiple kinds of models, including Naive Bayes, k-nearest-neighbors, logistic regression, SVM, decision trees and neural networks. Again, you cannot know in advance which would perform best.
Most of these models can use almost any kind of features. However, KNN and kernel-based SVM require your features to have special structure: representations of documents of one class should be close to each other in sense of Euclidean distance metric. This sometimes can be achieved by simple linear and/or logarithmic normalization (see above). More difficult cases require non-linear transformations, which in principle may be learned by neural networks. Learning of these transformations is something people call metric learning, and in general it is an problem which is not yet solved.
The most conventional distance metric is indeed Euclidean. However, other distance metrics are possible (e.g. manhattan distance), or different approaches, not based on vector representations of texts. For example, you can try to calculate Levenstein distance between texts, based on count of number of operations needed to transform one text to another. Or you can calculate "word mover distance" - the sum of distances of word pairs with closest embeddings.
For clustering, basic options are K-means and DBScan. Both these models require your feature space have this Euclidean property.
For anomaly detection you can use density estimations, which are produced by various probabilistic algorithms: classification (e.g. naive Bayes or neural networks), clustering (e.g. mixture of gaussian models), or other unsupervised methods (e.g. probabilistic PCA). For texts, you can exploit the sequential language structure, estimating probabilitiy of each word conditional on the previous words (using n-grams or convolutional/recurrent neural nets) - this is called language models, and it is usually more efficient than bag-of-word assumption of Naive Bayes, which ignores word order. Several language models (one for each class) may be combined into one classifier.
Whatever problem you solve, it is strongly recommended to have a good test set with the known "ground truth": which documents are close to each other, or belong to the same class, or are (un)usual. With this set, you can evaluate different approaches to feature engineering and modelling, and choose the best one.
If you don't have resourses or willingness to do multiple experiments, I would recommend to choose one of the following approaches to evaluate similarity between texts:
word counts + idf normalization + L2 normalization (equivalent to the solution of #mcoav) + Euclidean distance
mean word2vec embedding over all words in text (the embedding dictionary may be googled up and downloaded) + Euclidean distance
Based on one of these representations, you can build models for the other problems - e.g. KNN for classifications or k-means for clustering.
I would suggest tf-idf and cosine similarity.
You can still use tf-idf if you drop out stop-words. It is even probable that whether you include stop-words or not would not make such a difference: the Inverse Document Frequency measure automatically downweighs stop-words since they are very frequent and appear in most documents.
If your new document is entirely made of unknown terms, the cosine similarity will be 0 with every known document.
When I search on df-idf I find nothing.
tf-idf with cosine similarity is very accepted and common practice
Filtering out stop words does not break it. For common words idf gives them low weight anyway.
tf-idf is used by Lucene.
Don't get why you want to reinvent the wheel here.
Don't get why you think the sum of df idf is a similarity measure.
For classification do you have some predefined classes and sample documents to learn from? If so can use Naive Bayes. With tf-idf.
If you don't have predefined classes you can use k means clustering. With tf-idf.
It depend a lot on your knowledge of the corpus and classification objective. In like litigation support documents produced to you, you have and no knowledge of. In Enron they used names of raptors for a lot of the bad stuff and no way you would know that up front. k means lets the documents find their own clusters.
Stemming does not always yield better classification. If you later want to highlight the hits it makes that very complex and the stem will not be the length of the word.
Have you evaluated sent2vec or doc2vec approaches? You can play around with the vectors to see how close the sentences are. Just an idea. Not a verified solution to your question.
While in English a word alone may be enough, it isn't the case in some other more complex languages.
A word has many meanings, and many different uses cases. One text can talk about the same things while using fews to none matching words.
You need to find the most important words in a text. Then you need to catch their possible synonyms.
For that, the following api can help. It is doable to create something similar with some dictionaries.
synonyms("complex")
function synonyms(me){
var url = 'https://api.datamuse.com/words?ml=' + me;
fetch(url).then(v => v.json()).then((function(v){
syn = JSON.stringify(v)
syn = JSON.parse(syn)
for(var k in syn){
document.body.innerHTML += "<span>"+syn[k].word+"</span> "
}
})
)
}
From there comparing arrays will give much more accuracy, much less false positive.
A sufficient solution, in a possibly similar task:
Use of a binary bag-of-word (BOW) approach for the vector representation (frequent words aren't higher weighted than seldom words), rather than a real TF approach
The embedding "word2vec" approach, is sensitive to sequence and distances effects. It might make - depending on your hyper-parameters - a difference between 'a hunter saw a fox' and 'a fox saw a jumping hunter' ... so you have to decide, if this means adding noise to your task - or, alternatively, to use it as an averaged vector only, over all of your text
Extract high within-sentence-correlation words ( e.g., by using variables- mean-normalized- cosine-similaritities )
Second Step: Use this list of high-correlated words, as a positive list, i.e. as new vocab for an new binary vectorizer
This isolated meaningful words for the 2nd step cosine comparisons - in my case, even for rather small amounts of training texts
I'm trying to figure out how to properly interpret nltk's "likelihood ratio" given the below code (taken from this question).
import nltk.collocations
import nltk.corpus
import collections
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words())
scored = finder.score_ngrams(bgm.likelihood_ratio)
# Group bigrams by first word in bigram.
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
prefix_keys[key[0]].append((key[1], scores))
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
prefix_keys['baseball']
With the following output:
[('game', 32.11075451975229),
('cap', 27.81891372457088),
('park', 23.509042621473505),
('games', 23.10503351305401),
("player's", 16.22787286342467),
('rightfully', 16.22787286342467),
[...]
Looking at the docs, it looks like the likelihood ratio printed next to each bigram is from
"Scores ngrams using likelihood ratios as in Manning and Schutze
5.3.4."
Referring to this article, which states on pg. 22:
One advantage of likelihood ratios is that they have a clear intuitive
interpretation. For example, the bigram powerful computers is
e^(.5*82.96) = 1.3*10^18 times more likely under the hypothesis that
computers is more likely to follow powerful than its base rate of
occurrence would suggest. This number is easier to interpret than the
scores of the t test or the 2 test which we have to look up in a
table.
What I'm confused about is what would be the "base rate of occurence" in the event that I'm using the nltk code noted above with my own data. Would it be safe to say, for example, that "game" is 32 times more likely to appear next to "baseball" in the current dataset than in the average use of the standard English language? Or is it that "game" is more likely to appear next to "baseball" than other words appearing next to "baseball" within the same set of data?
Any help/guidance towards a clearer interpretation or example is much appreciated!
nltk does not have a universal corpus of English language usage from which to model the probability of 'game' following 'baseball'.
using the corpus it does have available, the likelihood is calculated as the posterior probability of ‘baseball’ given the word before being ‘game’.
nltk.corpus.brown
is a built in corpus, or set of observations, and the predictive power of any probability-based model is entirely defined by the observations used to construct or train it.
nltk.collocations.BigramAssocMeasures().raw_freq
models raw frequency with t tests, not well suited to sparse data such as bigrams, thus the provision of the likelihood ratio.
The likelihood ratio as calculated by Manning and Schutze is not equivalent to frequency.
https://nlp.stanford.edu/fsnlp/promo/colloc.pdf
Section 5.3.4 describes their calculations in detail on how the calculation is done.
The likelihood can be infinitely large.
This chart may be helpful:
The likelihood is calculated as the leftmost column.