I am currently working on a feature that gives the count of words for different emotions present in the text using a lexicon in python.Is there a possible to way to create emotion word density from the count of words?
It's relatively simple to do standard word counts with python - or more easily using a package, spacy, sklearn, etc.
In standard python, it would look like this:
sentence = 'This is stackoverflow, the word stackoverflow should appear twice.'
# Split on space.
wordlist = sentence.split()
freq = []
for w in wordlist:
freq.append(wordlist.count(w))
print("Pairs\n" + str(list(zip(wordlist, freq))))
A more advanced approach would be to look at using TF-IDF vectors, which helps deal with frequent, but less important words appearing. Sklearn has an implementation of this as well as word-frequency counts.
Without knowing your dataset and its structure more, my sense is that you're more looking at Sentiment analysis. In which case, hopefully you'll have labels for your dataset and possibly follow similar steps as described in NLTK's documentation as a start.
I don't have the points to comment, hence the general answer and not a comment sorry!
Related
I am trying to find words that are similar to two different words. I know that I can find the most similar word with FastText but I was wondering if there is a way to find a keyword that is similar to two keywords. For example, "apple" is similar to "orange" and also similar to "kiwi". So, what I want to do is if I have two words, "organ" and "kiwi", then I would like to get a suggestion of the keyword "apple" or any other fruits. Is there a way to do this?
I think that there isn't out of the box function for this feature.
In any case, you can think about this simple approach:
Load a pretrained embedding (availaible here)
Get a decent amount of nearest neighbors for every interested word
Search for intersections in the nearest neighbors of the two words
A small note: this is a crude approach. If necessary, even more sophisticated operations can be performed using the similarity cosine.
Code example:
import fasttext
# load the pretrained model
# (in the example I use the Italian model)
model=fasttext.load_model('./ml_models/cc.it.300.bin')
# get nearest neighbors for the interested words (100 neighbors)
arancia_nn=model.get_nearest_neighbors('arancia', k=100)
kiwi_nn=model.get_nearest_neighbors('kiwi', k=100)
# get only words sets (discard the similarity cosine)
arancia_nn_words=set([el[1] for el in arancia_nn])
kiwi_nn_words=set([el[1] for el in kiwi_nn])
# compute the intersection
common_similar_words=arancia_nn_words.intersection(kiwi_nn_words)
Example output (in Italian):
{'agrume',
'agrumi',
'ananas',
'arance',
'arancie',
'arancio',
'avocado',
'banana',
'ciliegia',
'fragola',
'frutta',
'lime',
'limone',
'limoni',
'mandarino',
'mela',
'mele',
'melograno',
'melone',
'papaia',
'papaya',
'pera',
'pompelmi',
'pompelmo',
'renetta',
'succo'}
I have used the Gensim W2V implementation for such computations for years now, but Gensim has also FastText implementation: https://radimrehurek.com/gensim/models/fasttext.html
I'm trying to figure out how to properly interpret nltk's "likelihood ratio" given the below code (taken from this question).
import nltk.collocations
import nltk.corpus
import collections
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words())
scored = finder.score_ngrams(bgm.likelihood_ratio)
# Group bigrams by first word in bigram.
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
prefix_keys[key[0]].append((key[1], scores))
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
prefix_keys['baseball']
With the following output:
[('game', 32.11075451975229),
('cap', 27.81891372457088),
('park', 23.509042621473505),
('games', 23.10503351305401),
("player's", 16.22787286342467),
('rightfully', 16.22787286342467),
[...]
Looking at the docs, it looks like the likelihood ratio printed next to each bigram is from
"Scores ngrams using likelihood ratios as in Manning and Schutze
5.3.4."
Referring to this article, which states on pg. 22:
One advantage of likelihood ratios is that they have a clear intuitive
interpretation. For example, the bigram powerful computers is
e^(.5*82.96) = 1.3*10^18 times more likely under the hypothesis that
computers is more likely to follow powerful than its base rate of
occurrence would suggest. This number is easier to interpret than the
scores of the t test or the 2 test which we have to look up in a
table.
What I'm confused about is what would be the "base rate of occurence" in the event that I'm using the nltk code noted above with my own data. Would it be safe to say, for example, that "game" is 32 times more likely to appear next to "baseball" in the current dataset than in the average use of the standard English language? Or is it that "game" is more likely to appear next to "baseball" than other words appearing next to "baseball" within the same set of data?
Any help/guidance towards a clearer interpretation or example is much appreciated!
nltk does not have a universal corpus of English language usage from which to model the probability of 'game' following 'baseball'.
using the corpus it does have available, the likelihood is calculated as the posterior probability of ‘baseball’ given the word before being ‘game’.
nltk.corpus.brown
is a built in corpus, or set of observations, and the predictive power of any probability-based model is entirely defined by the observations used to construct or train it.
nltk.collocations.BigramAssocMeasures().raw_freq
models raw frequency with t tests, not well suited to sparse data such as bigrams, thus the provision of the likelihood ratio.
The likelihood ratio as calculated by Manning and Schutze is not equivalent to frequency.
https://nlp.stanford.edu/fsnlp/promo/colloc.pdf
Section 5.3.4 describes their calculations in detail on how the calculation is done.
The likelihood can be infinitely large.
This chart may be helpful:
The likelihood is calculated as the leftmost column.
I have 50.000 subject lines from emails and i want to classify the words in them based on synonyms or words that can be used instead of others.
For example:
Top sales!
Best sales
I want them to be in the same group.
I build the following function with nltk's wordnet but it doesn't work well.
def synonyms(w,group,guide):
try:
# Check if the words is similar
w1 = wordnet.synset(w +'.'+guide+'.01')
w2 = wordnet.synset(group +'.'+guide+'.01')
if w1.wup_similarity(w2)>=0.7:
return True
elif w1.wup_similarity(w2)<0.7:
return False
except:
return False
Any ideas or tools to accomplish this?
The easiest way to accomplish this would be to compare the similarity of the respective word embeddings (the most common implementation of this is Word2Vec).
Word2Vec is a way of representing the semantic meaning of a token in a vector space, which enables the meanings of words to be compared without requiring a large dictionary/thesaurus like WordNet.
One problem with regular implementations of Word2Vec is that it does differentiate between different senses of the same word. For example, the word bank would have the same Word2Vec representation in all of these sentences:
The river bank was dry.
The bank loaned money to me.
The plane may bank to the left.
Bank has the same vector in each of these cases, but you may want them to be sorted into different groups.
One way to solve this is to use a Sense2Vec implementation. Sense2Vec models take into account the context and part of speech (and potentially other features) of the token, allowing you to differentiate between the meanings of different senses of the word.
A great library for this in Python is Spacy. It is like NLTK, but much faster as it is written in Cython (20x faster for tokenization and 400x faster for tagging). It also has Sense2Vec embeddings inbuilt, so you can accomplish your similarity task without needing other libraries.
It's as simple as:
import spacy
nlp = spacy.load('en')
apples, and_, oranges = nlp(u'apples and oranges')
apples.similarity(oranges)
It's free and has a liberal license!
An idea is to solve this with embeddings and word2vec , the outcome will be a mapping from words to vectors which are "near" when they have similar meanings, for example "car" and "vehicle" will be near and "car" and "food" will not, you can then measure the vector distance between 2 words and define a threshold to select if they are so near that they mean the same, as i said its just an idea of word2vec
The computation behind what Nick said is to calculate the distance (cosine distance) between two phrases vectors.
Top sales!
Best sales
Here is one way to do so: How to calculate phrase similarity between phrases
So, for instance, I'm typing, as an input, some sentence with some semantic meaning and, as an output, I get some list of closest (in cosine distance) words (mostly single words).
But I want to understand which cluster my sentence belongs to and compute how far is located each word from it. And eliminate non-meaningful words from sentence.
For example:
"I want to buy a pizza";
"pizza": 0.99123
"buy": 0.7834
"want": 0.1443
How such requirement can be achieved out of the box, without any C coding?
Maybe I need to compute cosine distance equation for this?
Thank you!
It seems like you need topic modeling instead of word2vec. Word2vec is used to capture local information, it is not a good idea to use it directly to classify or clustering words or sentences.
One other aspect can be stop word removal since you are mentioning about non-meaningful words. By the way, they are not non-meaningful, they are actually not aligned with any topic. So, you are thinking them as non-meaningful.
I believe you should use LDA topic modeling approach and you don't need to implement anything since there are many implementation out there for LDA.
I am embarking upon a NLP project for sentiment analysis.
I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task.
Here is my task:
I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice)
I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??)
Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron"
Then I would like to check for positive/negative sentiment in each sentence and count them accordingly
NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm.
Here are the troubles I am having:
All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?)
I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary?
I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link?
Any help is much appreciated and will save me much hair!
Cheers Ke
The movie review data has already been marked by humans as being positive or negative (the person who made the review gave the movie a rating which is used to determine polarity). These gold standard labels allow you to train a classifier, which you could then use for other movie reviews. You could train a classifier in NLTK with that data, but applying the results to election tweets might be less accurate than randomly guessing positive or negative. Alternatively, you can go through and label a few thousand tweets yourself as positive or negative and use this as your training set.
For a description of using Naive Bayes for sentiment analysis with NLTK: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
Then in that code, instead of using the movie corpus, use your own data to calculate word counts (in the word_feats method).
Why dont you use WSD. Use Disambiguation tool to find senses. and use map polarity to the senses instead of word. In this case you will get a bit more accurate results as compared to word index polarity.