Mutual Information for feature selection text classification - text

I use Naive Bayesian classifier for text classification. How can I make improvements of the accuracy of the algorithm using Mutual information measure for feature selection?

there are 2 improvements that you can use in text classification . First , you improve using the preprocessing techniques that you use such as N-Gram. Second, you can use feature selection techniques such as TF-IDF, Mutual Information, Chi-Square, and use other optimizations algorithm such as Genetic Algorithm, Bat Algorithm, ABC-Colony, Ant Colony. TF-IDF is very popular in information retrieval. Naive bayes is very sensitive with feature selection method.So, you can combine preprocessing techniques, feature selection method, and classification method for optimizing the classification result.


How to use feature selection and dimensionality reduction in Unsupervised learning?

I've been working on classifying emails from two authors. I've been successful in executing the same using supervised learning along with TFIDF vectorization of text, PCA and SelectPercentile feature selection. I used scikit-learn package to achieve the same.
Now I wanted to try the same using Unsupervised Learning KMeans algorithm to cluster the emails into two groups. I have created dataset wherein I have each data point as a single line in the python list. Since I am a newbie to unsupervised so I wanted to ask if I can apply the same dimensionality reduction tools as used in supervised (TFIDF, PCA and SelectPercentile). If not then what are their counterparts? I am using scikit-learn for coding it up.
I looked around on stackoverflow but couldn't get a satisfactory answer.
I am really stuck at this point.
Please help!
Following are the techniques for dimensionality reduction that can be applied in case of Unsupervised Learning:-
PCA: principal component analysis
Exact PCA
Incremental PCA
Approximate PCA
Kernel PCA
SparsePCA and MiniBatchSparsePCA
Random projections
Gaussian random projection
Sparse random projection
Feature agglomeration
Standard Scaler
Mentioned above are some of the approaches that can be used for dimensionality reduction of huge data in case on unsupervised learning.
You can read more about the details here.

text classification using svm

i read this article :A hybrid classification method of k nearest neighbor, Bayesian methods
and genetic algorithm
it's proposed to use genetic algorithm in order to improve text classification
i want to replace Genetic algorithm with SVM but i don't know if it works or not
i mean i do not know if the new idea and the result will be better than this article
i read somewhere Ga is better than SVM but i dono if it's right or not?
SVM and Genetic Algorithms are in fact completely different methods. SVM is basicaly a classification tool, while genetic algorithms are meta optimisation heuristic. Unfortunately I do not have access to the cited paper, but I can hardly imagine, how putting sVM in the place of GA could work.
i read somewhere Ga is better than SVM but i dono if it's right or not?
No, it is not true. These methods are not comparable as they are completely different tools.

Large scale naïve Bayes classifier with top-k output

I need a library for naïve Bayes large scale, with millions of training examples and +100k binary features. It must be an online version (updatable after training). I also need top-k output, that is multiple classifications for a single instance. Accuracy is not very important.
The purpose is an automatic text categorization application.
Any suggestions for a good library is very appreciated.
EDIT: The library should preferably be in Java.
If a learning algorithm other than naïve Bayes is also acceptable, then check out Vowpal Wabbit (C++), which has the reputation of being one of the best scalable text classification algorithms (online stochastic gradient descent + LDA). I'm not sure if it does top-K output.

Which classifier to choose in NLTK

I want to classify text messages into several categories like, "relation building", "coordination", "information sharing", "knowledge sharing" & "conflict resolution". I am using NLTK library to process these data. I would like to know which classifier, in nltk, is better for this particular multi-class classification problem.
I am planning to use Naive Bayes Classification, is it advisable?
Naive Bayes is the simplest and easy to understand classifier and for that reason it's nice to use. Decision Trees with a beam search to find the best classification are not significantly harder to understand and are usually a bit better. MaxEnt and SVM tend be more complex, and SVM requires some tuning to get right.
Most important is the choice of features + the amount/quality of data you provide!
With your problem, I would focus first on ensuring you have a good training/testing dataset and also choose good features. Since you are asking this question you haven't had much experience with machine learning for NLP, so I'd say start of easy with Naive Bayes as it doesn't use complex features- you can just tokenize and count word occurrences.
The question How do you find the subject of a sentence? and my answer are also worth looking at.
Yes, Training a Naive Bayes Classifier for each category and then labeling each message to a class based on which Classifier provides the highest score is a standard first approach to problems like this. There are more sophisticated single class classifier algorithms which you could substitute in for Naive Bayes if you find performance inadequate, such as a Support Vector Machine ( Which I believe is available in NLTK via a Weka plug in, but not positive). Unless you can think of anything specific in this problem domain that would make Naieve Bayes especially unsuitable, its ofen the go-to "first try" for a lot of projects.
The other NLTK classifier I would consider trying would be MaxEnt as I believe it natively handles multiclass classification. (Though the multiple binary classifer approach is very standard and common as well). In any case the most important thing is to collect a very large corpus of properly tagged text messages.
If by "Text Messages" you are referring to actual cell phone text messages these tend to be very short and the language is very informal and varied, I think feature selection may end up being a larger factor in determining accuracy than classifier choice for you. For example, using a Stemmer or Lemmatizer that understands common abbreviations and idioms used, tagging part of speech or chunking , entity extraction, extracting probably relationships between terms may provide more bang than using more complex classifiers.
This paper talks about classifying Facebook status messages based on sentiment, which has some of the same issues, and may provide some insights into this. The links is to a google cache because I'm having problems w/ the original site:

Simple Sentiment Analysis

It appears that the simplest, naivest way to do basic sentiment analysis is with a Bayesian classifier (confirmed by what I'm finding here on SO). Any counter-arguments or other suggestions?
A Bayesian classifier with a bag of words representation is the simplest statistical method. You can get significantly better results by moving to more advanced classifiers and feature representation, at the cost of more complexity.
Statistical methods aren't the only game in town. Rule based methods that have more understanding of the structure of the text are the other main option. From what I have seen, these don't actually perform as well as statistical methods.
I recommend Manning and Schütze's Foundations of Statistical Natural Language Processing chapter 16, Text Categorization.
I can't think of a simpler, more naive way to do Sentiment Analysis, but you might consider using a Support Vector Machine instead of Naive Bayes (in some machine learning toolkits, this can be a drop-in replacement). Have a look at "Thumbs up? Sentiment Classification using Machine Learning Techniques" by Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan which was one of the earliest papers on these techniques, and gives a good table of accuracy results on a family of related techniques, none of which are any more complicated (from a client perspective) than any of the others.
Building upon the answer provided by Ken above, there is another paper
"Sentiment analysis using support vector machines with diverse information sources" by Tony and Niger,
which looks at assigning more features than just a bag of words used by Pang and Lee. Here, they leverage wordnet to determine semantic differentiation of adjectives, and proximity of the sentiment towards the topic in the text, as additional features for SVM. They show better results than previous attempts to classify text based on sentiment.
