NLP: Words and Polarity - nlp

Is anyone aware of any repository that has words and their polarities as score.
Example
Word | Polarity
bad | -1
worst | -3
better | 1
best | 3
Thanks
A

What you are looking for is a sentiment lexicon. A sentiment lexicon is a dictionary of words, in which each word has a corresponding sentiment score (ranging from very negative to very positive). There are several sentiment lexicons that you could use, such as sentiwordnet, sentistrength, and AFINN just to name a few. The easiest to use among these is AFINN which I recommend you to start with. Later you can upgrade to a more suitable one based on your application. You can find information about AFINN here and download it from here.
While Alex Nevidomsky is generally correct in his comment, in sentiment analysis problems there are many ways to circumvent such limitations, by learning the context of a word. Let me know if you had any further questions.

Related

Unsupervised sentiment Analysis using doc2vec

Folks,
I have searched Google for different type of papers/blogs/tutorials etc but haven't found anything helpful. I would appreciate if anyone can help me. Please note that I am not asking for code step-by-step but rather an idea/blog/paper or some tutorial.
Here's my problem statement:
Just like sentiment analysis is used for identifying positive and
negative tone of a sentence, I want to find whether a sentence is
forward-looking (future outlook) statement or not.
I do not want to use bag of words approach to sum up the number of forward-looking words/phrases such as "going forward", "in near future" or "In 5 years from now" etc. I am not sure if word2vec or doc2vec can be used. Please enlighten me.
Thanks.
It seems what you are interested in doing is finding temporal statements in texts.
Not sure of your final output, but let's assume you want to find temporal phrases or sentences which contain them.
One methodology could be the following:
Create list of temporal terms [days, years, months, now, later]
Pick only sentences with key terms
Use sentences in doc2vec model
Infer vector and use distance metric for new sentence
GMM Cluster + Limit
Distance from average
Another methodology could be:
Create list of temporal terms [days, years, months, now, later]
Do Bigram and Trigram collocation extraction
Keep relevant collocations with temporal terms
Use relevant collocations in a kind of bag-of-collocations approach
Matched binary feature vectors for relevant collocations
Train classifier to recognise higher level text
This sounds like a good case for a Bootstrapping approach if you have large amounts of texts.
Both are semi-supervised really, since there is some need for finding initial temporal terms, but even that could be automated using a word2vec scheme and bootstrapping

To check if a string of words is a sentence

I have a text file from which I have to eliminate all the statements which do not make any meaning or in other words, I have to check for a statement that if it is a sentence or not.
For example:
1. John is a heart patient.
2. Dr. Green, Rob is the referring doctor for the patient.
3. Jacob Thomas, M.D. is the ordering provider
4. Xray Shoulder PA, Oblique, TRUE Lateral, 18° FOSSA LAT LT; Status: Complete;
The sentence 1,2, ad 3 makes some meaning
but sentence 4 does not make any meaning, so I want to eliminate it.
May I know how it could be done?
This task seems very difficult; however, assuming you have the training data, you could likely use XGBoost, which uses boosted decision trees (and random forests). You would train it to answer positive or negative (yes is makes sense, or no).
You would then need to come up with features. You could use the features from the NLTK part of speech (POS) tags. The number of occurrences of each of the types of tags in the sentence would be a good first model. That can set your benchmark for how good an "easy" solution is.
You also may be able to look into the utility of a (word/sentence)-to-vector model such as gensim for creating features for your model.
First I would see what happens with just the number of occurrences of each POS tag and XGBOOST. Train and test a model and see how well it does. Then look to adding other features such as position or using a doc-2-vec as your input to XGBoost.
Last resort would be a neural network (which would only be recommended if the prior ideas fail, and you have lots and lots of data). If you did use a neural net I would think an LSTM would likely be useful.
You would have to experiment and the amount of data matters, but you can start simple and then test and add to your model iteratively.
It's very hard to be 100% confident but let's try.
I can use Amazon Comprehend - Natural Language Processing and Text Analytics and create your own metrics over the sentences. ex:
John is a heart patient.
Amazon will give you: "." Punctuation, "a" Determiner, "heart" Noun, "is" verb, "John" Proper Noun, "patient" Noun.
1 Punctuation, 1 Determiner, 2 Noun, 1 Verb, 1 Proper Noun. Probably you will have Noun and verd to have a valid sentence.
In Your last sentence we have:
3 Punctuation, 1 Numeral, 11 Proper noun. You dont have a action (verb) probably these sentense isn't valid.

The use of N-Gram analysis in Sentiment Analysis

How do I use N-Gram Analysis for Sentiment Analysis ?
Once I split a sentence into Uni-Grams, Bi-Grams, Tri-Grams e.t,c.
How do I go forward from there ?
Sentiment analysis often refers to machine learning hence a possible way of doing so is to perform a machine learning algorithm where the attributes are grams.
Still, you can definitely collect some sentimental phrases/words as happy/sad tokens (depends on whether you are using uni-gram or bi-gram...) and simply count the sentences' number of occurrence of the tokens.
Vectorize the X-Grams using bag of words or any other technique and then apply classification algorithm: MaxEnt/SVM/RandomForest. N-Gram don't usually improve the results, in fact using more then 2 grams may even decrease your PR.

Finding the Polarity of particular word using sentiWordNet

I am working on Opinion Mining Algorithm in which, I am trying to find the polarity of the particular word.
Algorithm states - Search for any other POS categories like Noun, Adjective, Adverb and accumulate their polarity values using SentiWordNet.
I integrated the SentiWordNet in my current system and its working perfectly for determining the polarity of the sentence. But I want a polarity for particular word.
I found one method senti_classifier.synsets_score() which seem to be useful but I am unable to find any documentation related to this.
Can anyone describe the usage of above method or guide me to the documentation site.?
Is there any other way by which I can find the polarity of a particular word?
Thanks in advance
You can use the example code by Petter Törnberg provided on the SentiWordNet site. It calculates the sentiment score of each word in the thesaurus as a weighted average of the scores of its synsets.

Paraphrase recognition using sentence level similarity

I'm a new entrant to NLP (Natural Language Processing). As a start up project, I'm developing a paraphrase recognizer (a system which can recognize two similar sentences). For that recognizer I'm going to apply various measures at three levels, namely: lexical, syntactic and semantic. At the lexical level, there are multiple similarity measures like cosine similarity, matching coefficient, Jaccard coefficient, et cetera. For these measures I'm using the simMetrics package developed by the University of Sheffield which contains a lot of similarity measures. But for the Levenshtein distance and Jaro-Winkler distance measures, the code is only at character level, whereas I require code at the sentence level (i.e. considering a single word as a unit instead of character-wise). Additionally, there is no code for computing the Manhattan distance in SimMetrics. Are there any suggestions for how I could develop the required code (or someone provide me the code) at the sentence level for the above mentioned measures?
Thanks a lot in advance for your time and effort helping me.
I have been working in the area of NLP for a few years now, and I completely agree with those who have provided answers/comments. This really is a hard nut to crack! But, let me still provide a few pointers:
(1) Lexical similarity: Instead of trying to generalize Jaro-Winkler distance to sentence-level, it is probably much more fruitful if you develop a character-level or word-level language model, and compute the log-likelihood. Let me explain further: train your language model based on a corpus. Then take a whole lot of candidate sentences that have been annotated as similar/dissimilar to the sentences in the corpus. Compute the log-likelihood for each of these test sentences, and establish a cut-off value to determine similarity.
(2) Syntactic similarity: So far, only stylometric similarities can manage to capture this. For this, you will need to use PCFG parse trees (or TAG parse trees. TAG = tree adjoining grammar, a generalization of CFGs).
(3) Semantic similarity: off the top of my head, I can only think of using resources such as Wordnet, and identifying the similarity between synsets. But this is not simple either. Your first problem will be to identify which words from the two (or more) sentences are "corresponding words", before you can proceed to check their semantics.
As Chris suggests, this is a non-trivial project for a beginner. I would suggest you start of something simpler (if relatively boring) such as chunking.
Have a look at the docs and books for the Python NLTK library - there are some samples that are close to what you are looking for. For example, containment: is it plausible that one statement contains another. note the 'plausible' there, the state of the art isn't good enough for a simple yes/no or even a probability.

Resources