Financial news headers classification to positive/negative classes - nlp

I'm doing a small research project where I should try to split financial news articles headers to positive and negative classes.For classification I'm using SVM approach.The main problem which I see now it that not a lot of features can be produced for ML. News articles contains a lot of Named Entities and other "garbage" elements (from my point of view of course).
Could you please suggest ML features which can be used for ML training? Current results are: precision =0.6, recall=0.8
Thanks

The task is not trivial at all.
The straightforward approach would be to find or create a training set. That is a set of headers with positive news and a set of headers with negative news.
You turn the training set to a TF/IDF representation and then you train a Linear SVM to separate the two classes. Depending on the quality and size of your training set you can achieve something decent - not sure for 0.7 break even point.
Then, to get better results you need to go for NLP approaches. Try use a part-of-speech tagger to identify adjectives (trivial), and then score them using some sentiment DB like SentiWordNet.
There is an excellent overview on Sentiment Analysis by Bo Pang and Lillian Lee you should read:

How about these features?
Length of article header in words
Average word length
Number of words in a dictionary of "bad" words, e.g. dictionary = {terrible, horrible, downturn, bankruptcy, ...}. You may have to generate this dictionary yourself.
Ratio of words in that dictionary to total words in sentence
Similar to 3, but number of words in a "good" dictionary of words, e.g. dictionary = {boon, booming, employment, ...}
Similar to 5, but use the "good"-word dictionary
Time of the article's publication
Date of the article's publication
The medium through which it was published (you'll have to do some subjective classification)
A count of certain punctuation marks, such as the exclamation point
If you're allowed access to the actual article, you could use surface features from the actual article, such as its total length and perhaps even the number of responses or the level of opposition to that article. You could also look at many other dictionaries online such as Ogden's 850 basic english dictionary, and see if bad/good articles would be likely to extract many words from those. I agree that it seems difficult to come up with a long list (e.g. 100 features) of useful features for this purpose.

iliasfl is right, this is not a straightforward task.
I would use a bag of words approach but use a POS tagger first to tag each word in the headline. Then you could remove all of the named entities - which as you rightly point out don't affect the sentiment. Other words should appear frequently enough (if your dataset is big enough) to cancel themselves out from being polarised as either positive or negative.
One step further along, if you still aren't close could be to only select the adjectives and verbs from the tagged data as they are the words that tend to convey the emotion or mood.
I wouldn't be too disheartened in your precision and recall figures though, an F number of 0.8 and above is actually quite good.

Related

Use the polarity distribution of word to detect the sentiment of new words

I have just started a project in NLP. Suppose I have a graph for each word that shows the polarity distribution of sentiments for that word in different sentences. I want to know what I can use to recognize the feelings of new words? Any other use you have in mind I will be happy to share.
I apologize for any possible errors in my writing. Thanks a lot
Assuming you've got some words that have been hand-labeled with positive/negative sentiments, but then you encounter some new words that aren't labeled:
If you encounter the new words totally alone, outside of contexts, there's not much you can do. (Maybe, you could go out to try to find extra texts with those new words, such as vis dictionaries or the web, then use those larger texts in the next approach.)
If you encounter the new words inside texts that also include some of your hand-labeled words, you could try guessing that the new words are most like the words you already know that are closest-to, or used-in-the-same-places. This would leverage what's called "the distributional hypothesis" – words with similar distributions have similar meanings – that underlies a lot of computer natural-language analysis, including word2vec.
One simple thing to try along these lines: across all your texts, for every unknown word U, tally up the counts all neighboring words within N positions. (N could be 1, or larger.) From that, pick the top 5 words occuring most often near the unknown word, and look up your prior labels, and avergae them together (perhaps weighted by the number of occurrences.)
You'll then have a number for the new word.
Alternatively, you could train a word2vec set-of-word-vectors for all of your texts, including the unknown & know words. Then, ask that model for the N most-similar neighbors to your unknown word. (Again, N could be small or large.) Then, from among those neighbors with known labels, average them together (again perhaps weighted by similarity), to get a number for the previously unknown word.
I wouldn't particularly expect either of these techniques to work very well. The idea that individual words can have specific sentiment is somewhat weak given the way that in actual language, their meaning is heavily modified, or even reversed, by the surrounding grammar/context. But in each case these simple calculate-from-neighbors techniqyes are probably better than random guesses.
If your real aim is to calculate the overall sentiment of longer texts, like sentences, paragraphs, reviews, etc, then you should discard your labels of individual words an acquire/create labels for full texts, and apply real text-classification techniques to those larger texts. A simple word-by-word approach won't do very well compared to other techniques – as long as those techniques have plenty of labeled training data.

Trying to detect products from text while using a dictionary

I have a list of products names and a collection of text generated from random users. I am trying to detect products mentioned in the text while talking into account spelling variation. For example the text
Text = i am interested in galxy s8
Mentions the product samsung galaxy s8
But note the difference in spellings.
I've implemented the following approaches:
1- max tokenized products names and users text (i split words by punctuation and digits so s8 will be tokenized into 's' and '8'. Then i did a check on each token in user's text to see if it is in my vocabulary with damerau levenshtein distance <= 1 to allow for variation in spelling. Once i have detected a sequence of tokens that do exist in the vocabulary i do a search for the product that matches the query while checking the damerau levenshtein distance on each token. This gave poor results. Mainly because the sequence of tokens that exist in the vocabulary do not necessarily represent a product. For example since text is max tokenized numbers can be found in the vocabulary and as such dates are detected as products.
2- i constructed bigram and trigram indicies from the list of products and converted each user text into a query.. but also results weren't so great given the spelling variation
3- i manually labeled 270 sentences and trained a named entity recognizer with labels ('O' and 'Product'). I split the data into 80% training and 20% test. Note that I didn't use the list of products as part of the features. Results were okay.. not great tho
None of the above results achieved a reliable performance. I tried regular expressions but since there are so many different combinations to consider it became too complicated.. Are there better ways to tackle this problem? I suppose ner could give better results if i train more data but suppose there isn't enough training data, what do u think a better solution would be?
If i come up with a better alternative to the ones I've already mentioned, I'll add it to this post. In the meantime I'm open to suggestions
Consider splitting your problem into two parts.
1) Conduct a spelling check using a dictionary of known product names (this is not a NLP task and there should be guides on how to impelement spell check).
2) Once you have done pre-processing (spell checking), use your NER algorithm
It should improve your accuracy.

How is polarity calculated for a sentence ??? (in sentiment analysis)

How is polarity of words in a statement are calculated....like
"i am successful in accomplishing the task,but in vain"
how each word is scored? (like - successful- 0.7 accomplishing- 0.8 but - -0.5
vain - - 0.8)
how is it calculated ? how is each word given a value or score?? what is the thing that's going behind ? As i am doing sentiment analysis I have few thing to be clear so .that would be great if someone helps.thanks in advance
If you are willing to use Python and NLTK, then check out Vader (http://www.nltk.org/howto/sentiment.html and skip down to the Vader section)
The scores from individual words can come from predefined word lists such as ANEW, General Inquirer, SentiWordNet, LabMT or my AFINN. Either individual experts have scored them or students or Amazon Mechanical Turk workers. Obviously, these scores are not the ultimate truth.
Word scores can also be computed by supervised learning with annotated texts, or word scores can be estimated from word ontologies or co-occurence patterns.
As for aggregation of individual words, there are various ways. One way would be to sum all the individual scores (valences), another to take the max valence among the words, a third to normalize (divide) by the number of words or by the number of scored words (i.e., getting a mean score), - or divide the square root of that number. The results may differ a bit.
I made some evaluation with my AFINN word list: http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6028/pdf/imm6028.pdf
Another approach is with recursive models like Richard Socher's models. The sentiment values of the individual words are aggregated in a tree-like structure and should find that the "but in vain"-part of your example should carry the most weight.

Applied NLP: how to score a document against a lexicon of multi-word terms?

This is probably a fairly basic NLP question but I have the following task at hand: I have a collection of text documents that I need to score against an (English) lexicon of terms that could be 1-, 2-, 3- etc N-word long. N is bounded by some "reasonable" number but the distribution of various terms in the dictionary for various values of n = 1, ..., N might be fairly uniform. This lexicon can, for example, contain a list of devices of certain type and I want to see if a given document is likely about any of these devices. So I would want to score a document high(er) if it has one or more occurrences of any of the lexicon entries.
What is a standard NLP technique to do the scoring while accounting for various forms of the words that may appear in the lexicon? What sort of preprocessing would be required for both the input documents and the lexicon to be able to perform the scoring? What sort of open-source tools exist for both the preprocessing and the scoring?
I studied LSI and topic modeling almost a year ago, so what I say should be taken as merely a pointer to give you a general idea of where to look.
There are many different ways to do this with varying degrees of success. This is a hard problem in the realm of information retrieval. You can search for topic modeling to learn about different options and state of the art.
You definitely need some preprocessing and normalization if the words could appear in different forms. How about NLTK and one of its stemmers:
>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('applied')
'apply'
>>> st.stem('applies')
'apply'
You have a lexicon of terms that I am going to call terms and also a bunch of documents. I am going to explore a very basic technique to rank documents with regards to the terms. There are a gazillion more sophisticated ways you can read about, but I think this might be enough if you are not looking for something too sophisticated and rigorous.
This is called a vector space IR model. Terms and documents are both converted to vectors in a k-dimensional space. For that we have to construct a term-by-document matrix. This is a sample matrix in which the numbers represent frequencies of the terms in documents:
So far we have a 3x4 matrix using which each document can be expressed by a 3-dimensional array (each column). But as the number of terms increase, these arrays become too large and increasingly sparse. Also, there are many words such as I or and that occur in most of the documents without adding much semantic content. So you might want to disregard these types of words. For the problem of largeness and sparseness, you can use a mathematical technique called SVD that scales down the matrix while preserving most of the information it contains.
Also, the numbers we used on the above chart were raw counts. Another technique would be to use Boolean values: 1 for presence and 0 zero for lack of a term in a document. But these assume that words have equal semantic weights. In reality, rarer words have more weight than common ones. So, a good way to edit the initial matrix would be to use ranking functions like tf-id to assign relative weights to each term. If by now we have applied SVD to our weighted term-by-document matrix, we can construct the k-dimensional query vectors, which are simply an array of the term weights. If our query contained multiple instances of the same term, the product of the frequency and the term weight would have been used.
What we need to do from there is somewhat straightforward. We compare the query vectors with document vectors by analyzing their cosine similarities and that would be the basis for the ranking of the documents relative to the queries.

How to represent text documents as feature vectors for text classification?

I have around 10,000 text documents.
How to represent them as feature vectors, so that I can use them for text classification?
Is there any tool which does the feature vector representation automatically?
The easiest approach is to go with the bag of words model. You represent each document as an unordered collection of words.
You probably want to strip out punctuation and you may want to ignore case. You might also want to remove common words like 'and', 'or' and 'the'.
To adapt this into a feature vector you could choose (say) 10,000 representative words from your sample, and have a binary vector v[i,j] = 1 if document i contains word j and v[i,j] = 0 otherwise.
To give a really good answer to the question, it would be helpful to know, what kind of classification you are interested in: based on genre, author, sentiment etc. For stylistic classification for example, the function words are important, for a classification based on content they are just noise and are usually filtered out using a stop word list.
If you are interested in a classification based on content, you may want to use a weighting scheme like term frequency / inverse document frequency,(1) in order to give words which are typical for a document and comparetively rare in the whole text collection more weight. This assumes a vector space model of your texts which is a bag of word representation of the text. (See Wikipedia on Vector Space Modell and tf/idf) Usually tf/idf will yield better results than a binary classification schema which only contains the information whether a term exists in a document.
This approach is so established and common that machine learning libraries like Python's scikit-learn offer convenience methods which convert the text collection into a matrix using tf/idf as a weighting scheme.

Resources