Variable importance in classification - statistics

For example: I have 100 books with 1000 words each. They belong to different classes (comedy,drama,...). Each class consist of 15 different books.
When I do tfidf on my data, I get the importance for every word in a book in context of all books.
I see that the books belonging to the same class have similar tfidf values for each variable.
Let's say drama and comedy are pretty similar.
How can I tell what words make a difference in between those two classes?
What words do I have to change in book that belongs to comedy so the book now belongs to drama now?
I can check one by one; but I have 2000 books, 17500 words each; 950 classes. It would take a decade :)

As a first draft, compute the average vector for each classes, normalize them to unit length, and compute the absolute differences.
These should give you a rough indication of which words distinguish the two classes.

I would definitely run pairwise tests, i.e. one for each of the 475*949 pairs of classes you have as "important variables" can differ very much from case to case. Then run some standard feature selection algorithm, such as chi-square or information gain. See http://www.jmlr.org/papers/volume3/forman03a/forman03a.pdf for an extensive study.

Related

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Statistical method to compare urban vs rural matched siblings

I am writing a study protocol for my masters thesis. The study seeks to compare the rates of Non Communicable Diseases and risk factors and determine the effects of rural to urban migration. Sibling pairs will be identified from a rural area. One of the siblings should have participated in the rural NCD survey which is currently on going in the area. The other sibling should have left the area and reported moving to a city.Data will collected by completing a questionnaire on demographics, family history,medical history, diet,alcohol consumption, smoking ,physical activity.This will be done for both the rural and urban sibling, with data on the amount of time spent in urban areas fur
The outcomes which are binary (whether one has a condition or not) are : 1.diabetic, 2.hypertensive, 3.obese
What statistical method can I use to compare the outcomes (stated above) between the two groups, considering that the siblings were matched (one urban sibling for every rural sibling)?
What statistical methods can also be used to explore associations between amount spent in urban residence and the outcomes?
Given that your main aim is to compare quantities of two nominal distributions, a chi-square test seems to be the method of choice with regard to your first question. However, it should be mentioned that a chi-square test is somehow "the smallest" test for answering differences in samples. If you are studying medicine (or related) a chi-square test is fine because it is also frequently applied by researchers of this field. If you are studying psychology or sociology (or related) I'd advise to highlight limitations of the test in the discussions section since it mostly tests your distributions against randomly expected distributions.
Regarding your second question, a logistic regression would be applicable since it allows binomial distributed variables both for independent variables (predictors) and dependent variables. However, if you have other interval scaled variables (e.g. age, weight etc.) you could also use t-tests or ANOVAs to investigate differences between these variables with respect to the existence of specific diseases (i.e. is diabetic or not).
Overall, this matter strongly depends on what you mean by "association". Classically, "association" refers to correlations or linear regression (for which you need interval scaled variables on "both sides") but given your data structure, the aforementioned methods possess a better fit.
How you actually calculate these tests depends on the statistics software used.

How is polarity calculated for a sentence ??? (in sentiment analysis)

How is polarity of words in a statement are calculated....like
"i am successful in accomplishing the task,but in vain"
how each word is scored? (like - successful- 0.7 accomplishing- 0.8 but - -0.5
vain - - 0.8)
how is it calculated ? how is each word given a value or score?? what is the thing that's going behind ? As i am doing sentiment analysis I have few thing to be clear so .that would be great if someone helps.thanks in advance
If you are willing to use Python and NLTK, then check out Vader (http://www.nltk.org/howto/sentiment.html and skip down to the Vader section)
The scores from individual words can come from predefined word lists such as ANEW, General Inquirer, SentiWordNet, LabMT or my AFINN. Either individual experts have scored them or students or Amazon Mechanical Turk workers. Obviously, these scores are not the ultimate truth.
Word scores can also be computed by supervised learning with annotated texts, or word scores can be estimated from word ontologies or co-occurence patterns.
As for aggregation of individual words, there are various ways. One way would be to sum all the individual scores (valences), another to take the max valence among the words, a third to normalize (divide) by the number of words or by the number of scored words (i.e., getting a mean score), - or divide the square root of that number. The results may differ a bit.
I made some evaluation with my AFINN word list: http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6028/pdf/imm6028.pdf
Another approach is with recursive models like Richard Socher's models. The sentiment values of the individual words are aggregated in a tree-like structure and should find that the "but in vain"-part of your example should carry the most weight.

How to know one system is siginficantly better than another one?

I am studying lexical semantics. I have 65 pairs of synonyms with their sense relatedness. The dataset is derived from the paper:
Rubenstein, Herbert, and John B. Goodenough. "Contextual correlates of synonymy." Communications of the ACM 8.10 (1965): 627-633.
I extract sentences containing those synonyms, transfer the neighbouring words appearing in those sentences to vectors, calculate the cosine distance between different vectors, and finally get the Pearson correlation between the distances we calculate and the sense relatedness given by Rubenstein and Goodenough
I get the Pearson correlation for Method 1 is 0.79, and for Method 2 is 0.78, for example. How do I measure Method 1 is significantly better than Method 2 or not?
Well Strictly not a programming question, but since this question is unanswered in others stackexchange sites, i'll tell the approach i would take.
I would say there are other benchmarks to check your approaches on similar tasks. You can check how your method performs on those benchmarks and analyze the results. Some methods may capture similarity more while others relatedness and some both.
This is the link WordVec Demo which automatically scores your vectors and provides you the results.

Financial news headers classification to positive/negative classes

I'm doing a small research project where I should try to split financial news articles headers to positive and negative classes.For classification I'm using SVM approach.The main problem which I see now it that not a lot of features can be produced for ML. News articles contains a lot of Named Entities and other "garbage" elements (from my point of view of course).
Could you please suggest ML features which can be used for ML training? Current results are: precision =0.6, recall=0.8
Thanks
The task is not trivial at all.
The straightforward approach would be to find or create a training set. That is a set of headers with positive news and a set of headers with negative news.
You turn the training set to a TF/IDF representation and then you train a Linear SVM to separate the two classes. Depending on the quality and size of your training set you can achieve something decent - not sure for 0.7 break even point.
Then, to get better results you need to go for NLP approaches. Try use a part-of-speech tagger to identify adjectives (trivial), and then score them using some sentiment DB like SentiWordNet.
There is an excellent overview on Sentiment Analysis by Bo Pang and Lillian Lee you should read:
How about these features?
Length of article header in words
Average word length
Number of words in a dictionary of "bad" words, e.g. dictionary = {terrible, horrible, downturn, bankruptcy, ...}. You may have to generate this dictionary yourself.
Ratio of words in that dictionary to total words in sentence
Similar to 3, but number of words in a "good" dictionary of words, e.g. dictionary = {boon, booming, employment, ...}
Similar to 5, but use the "good"-word dictionary
Time of the article's publication
Date of the article's publication
The medium through which it was published (you'll have to do some subjective classification)
A count of certain punctuation marks, such as the exclamation point
If you're allowed access to the actual article, you could use surface features from the actual article, such as its total length and perhaps even the number of responses or the level of opposition to that article. You could also look at many other dictionaries online such as Ogden's 850 basic english dictionary, and see if bad/good articles would be likely to extract many words from those. I agree that it seems difficult to come up with a long list (e.g. 100 features) of useful features for this purpose.
iliasfl is right, this is not a straightforward task.
I would use a bag of words approach but use a POS tagger first to tag each word in the headline. Then you could remove all of the named entities - which as you rightly point out don't affect the sentiment. Other words should appear frequently enough (if your dataset is big enough) to cancel themselves out from being polarised as either positive or negative.
One step further along, if you still aren't close could be to only select the adjectives and verbs from the tagged data as they are the words that tend to convey the emotion or mood.
I wouldn't be too disheartened in your precision and recall figures though, an F number of 0.8 and above is actually quite good.

Resources