How to label the tokens as positive or negative - nlp

I was trying to analyze of cutomer reviews. I have divided it into a list of tokens, How can I label them which token is positive or negative? Is there any library?
I want to build a word cloud of the positive words and negative words.

I think you can try many stuffs here (However people usually classify the review as whole in not the words):
Try Brown clustering to cluster your words, then if you have labels you can better assess the quality of the word clustering.
Create label for words depending on the label of the review where they are (positive or negative), however this may not be accurate cause negativity is sometime a composition of words (e.k not like).
ou can also use your labels to derive negative and positive words by their frequencies in negative and positive documents.
There are plenty of libraries to do sentiment classification : scikit-learn,TensorFlow,....ect.

Related

What do negative vectors mean on word2vec?

I am doing a research on travel reviews and used word2vec to analyze the reviews. However, when I showed my output to my adviser, he said that I have a lot of words with negative vector values and that only words with positive values are considered logical.
What could these negative values mean? Is there a way to ensure that all vector values I will get in my analysis would be positive?
While some other word-modeling algorithms do in fact model words into spaces where dimensions are 0 or positive, and the individual positive dimensions might be clearly meaningful to humans, that is not the case with the original, canonical 'word2vec' algorithm.
The positive/negativeness of any word2vec word-vector – in a particular dimension, or in net magnitude – has no strong meaning. Meaningful words will be spread out in every direction from the origin point. Directions or neighborhoods in this space that loosely correlate to recognizable categories may appear anywhere, and skew with respect to any of the dimensional axes.
(Here's a related algorithm that does use non-negative constraints – https://www.cs.cmu.edu/~bmurphy/NNSE/. But most references to 'word2vec' mean the classic approach where dimensions usefully range over all reals.)

Trying to detect products from text while using a dictionary

I have a list of products names and a collection of text generated from random users. I am trying to detect products mentioned in the text while talking into account spelling variation. For example the text
Text = i am interested in galxy s8
Mentions the product samsung galaxy s8
But note the difference in spellings.
I've implemented the following approaches:
1- max tokenized products names and users text (i split words by punctuation and digits so s8 will be tokenized into 's' and '8'. Then i did a check on each token in user's text to see if it is in my vocabulary with damerau levenshtein distance <= 1 to allow for variation in spelling. Once i have detected a sequence of tokens that do exist in the vocabulary i do a search for the product that matches the query while checking the damerau levenshtein distance on each token. This gave poor results. Mainly because the sequence of tokens that exist in the vocabulary do not necessarily represent a product. For example since text is max tokenized numbers can be found in the vocabulary and as such dates are detected as products.
2- i constructed bigram and trigram indicies from the list of products and converted each user text into a query.. but also results weren't so great given the spelling variation
3- i manually labeled 270 sentences and trained a named entity recognizer with labels ('O' and 'Product'). I split the data into 80% training and 20% test. Note that I didn't use the list of products as part of the features. Results were okay.. not great tho
None of the above results achieved a reliable performance. I tried regular expressions but since there are so many different combinations to consider it became too complicated.. Are there better ways to tackle this problem? I suppose ner could give better results if i train more data but suppose there isn't enough training data, what do u think a better solution would be?
If i come up with a better alternative to the ones I've already mentioned, I'll add it to this post. In the meantime I'm open to suggestions
Consider splitting your problem into two parts.
1) Conduct a spelling check using a dictionary of known product names (this is not a NLP task and there should be guides on how to impelement spell check).
2) Once you have done pre-processing (spell checking), use your NER algorithm
It should improve your accuracy.

How is polarity calculated for a sentence ??? (in sentiment analysis)

How is polarity of words in a statement are calculated....like
"i am successful in accomplishing the task,but in vain"
how each word is scored? (like - successful- 0.7 accomplishing- 0.8 but - -0.5
vain - - 0.8)
how is it calculated ? how is each word given a value or score?? what is the thing that's going behind ? As i am doing sentiment analysis I have few thing to be clear so .that would be great if someone helps.thanks in advance
If you are willing to use Python and NLTK, then check out Vader (http://www.nltk.org/howto/sentiment.html and skip down to the Vader section)
The scores from individual words can come from predefined word lists such as ANEW, General Inquirer, SentiWordNet, LabMT or my AFINN. Either individual experts have scored them or students or Amazon Mechanical Turk workers. Obviously, these scores are not the ultimate truth.
Word scores can also be computed by supervised learning with annotated texts, or word scores can be estimated from word ontologies or co-occurence patterns.
As for aggregation of individual words, there are various ways. One way would be to sum all the individual scores (valences), another to take the max valence among the words, a third to normalize (divide) by the number of words or by the number of scored words (i.e., getting a mean score), - or divide the square root of that number. The results may differ a bit.
I made some evaluation with my AFINN word list: http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6028/pdf/imm6028.pdf
Another approach is with recursive models like Richard Socher's models. The sentiment values of the individual words are aggregated in a tree-like structure and should find that the "but in vain"-part of your example should carry the most weight.

String-matching algorithm for noisy text

I have used OCR (optical character recognition) to get texts from images. The images contain book covers. Because of the images are so noisy, some characters are misrecognised, or some noises are recognised as a character.
Examples:
"w COMPUTER Nnwonxs i I "(Compuer Networks)
"s.ll NEURAL NETWORKS C "(Neural Networks)
"1llllll INFRODUCIION ro PROBABILITY ti iitiiili My "(Introduction of Probability)
I builded a dictionary with words, but i want to somehow match the recognised text with the dictionary. I tried LCS (Longest Common subsequence), but its not so effective.
What is the best string matching algorithm for this kind of problem? (So a part of string is just noise, but also the important part of string can has some misrecognised characters)
That's really a big question. Followings are something I know about it. For more details, you can read some related papers.
For single word, use Hamming Distance to calculate the similarity between the word your recognized by OCR and those in your dictionary;
this step is used to correct the the words have been recognized by OCR but do not exist.
Eg:
If the result of OCR is INFRODUCIION which dosen't exist in your dictionary, you can find out the Hamming Distance of word 'INTRODUCTION' is 2. So it may be mis-recognized as 'INFRODUCIION'.
However, the same word may be recognized as different words with the same Hamming Distance between them.
Eg: If the result of OCR is the CAY, you may find CAR and CAT are both with the same Hamming Distance of 1, so that will be confused.
In this case, there are several things can be used for analyze:
Still for single word, the image different between CAT and CAY is less that CAR and CAY. So for this reason, CAT seems the right word with a greater probability.
Then let us the context to caculate another probability. If the whold sentence is 'I drove my new CAY this morning', as for people usually drive a CAR but not a CAT, we have a better chance to regard the word CAY as CAR but not CAT.
For the frequency of the words used in the similar articles, use TF-TDF.
Are you saying you have a dictionary that defines all words that are acceptable?
If so, it should be fairly straight forward to take each word and find the closest match in your dictionary. Set a match threshold and discard the word if it does not reach the threshold.
I would experiment with the Soundex and Metaphone algorithms or the Levenshtein Distance algorithm.

Financial news headers classification to positive/negative classes

I'm doing a small research project where I should try to split financial news articles headers to positive and negative classes.For classification I'm using SVM approach.The main problem which I see now it that not a lot of features can be produced for ML. News articles contains a lot of Named Entities and other "garbage" elements (from my point of view of course).
Could you please suggest ML features which can be used for ML training? Current results are: precision =0.6, recall=0.8
Thanks
The task is not trivial at all.
The straightforward approach would be to find or create a training set. That is a set of headers with positive news and a set of headers with negative news.
You turn the training set to a TF/IDF representation and then you train a Linear SVM to separate the two classes. Depending on the quality and size of your training set you can achieve something decent - not sure for 0.7 break even point.
Then, to get better results you need to go for NLP approaches. Try use a part-of-speech tagger to identify adjectives (trivial), and then score them using some sentiment DB like SentiWordNet.
There is an excellent overview on Sentiment Analysis by Bo Pang and Lillian Lee you should read:
How about these features?
Length of article header in words
Average word length
Number of words in a dictionary of "bad" words, e.g. dictionary = {terrible, horrible, downturn, bankruptcy, ...}. You may have to generate this dictionary yourself.
Ratio of words in that dictionary to total words in sentence
Similar to 3, but number of words in a "good" dictionary of words, e.g. dictionary = {boon, booming, employment, ...}
Similar to 5, but use the "good"-word dictionary
Time of the article's publication
Date of the article's publication
The medium through which it was published (you'll have to do some subjective classification)
A count of certain punctuation marks, such as the exclamation point
If you're allowed access to the actual article, you could use surface features from the actual article, such as its total length and perhaps even the number of responses or the level of opposition to that article. You could also look at many other dictionaries online such as Ogden's 850 basic english dictionary, and see if bad/good articles would be likely to extract many words from those. I agree that it seems difficult to come up with a long list (e.g. 100 features) of useful features for this purpose.
iliasfl is right, this is not a straightforward task.
I would use a bag of words approach but use a POS tagger first to tag each word in the headline. Then you could remove all of the named entities - which as you rightly point out don't affect the sentiment. Other words should appear frequently enough (if your dataset is big enough) to cancel themselves out from being polarised as either positive or negative.
One step further along, if you still aren't close could be to only select the adjectives and verbs from the tagged data as they are the words that tend to convey the emotion or mood.
I wouldn't be too disheartened in your precision and recall figures though, an F number of 0.8 and above is actually quite good.

Resources