Feature Selection in Text Classification - text

I'm currently studying on text classification, focusing on feature selection. Can anyone suggest me any software/program that I can use for text classification that provides feature selection function (particularly Information Gain, Chi Square, Odds Ratio, Mutual Information, etc.)?
Thanks and best regards =)

Related

Formulae for calculating the shape of feature maps after convolutions

I know that Pytorch's documentation provides this, but I have difficulties in understanding their notation.
Is there any more accessible explanation (maybe also with graphical illustrations)?
I think you are looking for Receptive Field Arithmetics.
This webpage provides a detailed explanation of the various factors affecting the size of the receptive field, and the shape of the resulting feature maps.

Correlation and graph layout in widyr and ggraph when tidy text mining

I'm using a tutorial (https://www.tidytextmining.com/nasa.html?q=correlation%20ne#networks-of-keywords) to learn about tidy text mining. I am hoping someone might be able to help with two questions:
in this tutorial, the correlation used to make the graph is 0.15. Is this best practice? I can't find any literature to help choose a cut off.
In the graph attached from the tutorial, how are clusters centrality chosen? Are more important words closer to the centre?
Thanks very much
I am not aware of any literature on a correlation threshold to use for this kind of network analysis; this will (I believe) depend on your particular dataset and how language is used in your context. This is a heuristic decision. Given what a correlation coefficient measures, I would expect 0.15 to be on the low side of what you might use.
The graph is represented visually in a two-dimensional plot via the layout argument of ggraph. You can read more about that here but the very high-level takeaways are that there are a lot of options, they have a big impact on what your graph looks like, and often it's not clear what is the best choice.

How Information Gain Works in Text Classification

I have to learn information gain for feature selection right now,
But I don't have clear comprehension about it. I am a newbie, and I'm confused about it.
How to use IG in feature selection (manual calculation)?
I just have clue this .. That have anyone can help me how to use the formula:
then this is the example:
How to use information gain in feature selection?
Information gain (InfoGain(t)) measures the number of bits of information obtained for prediction of a class (c) by knowing the presence or absence of a term (t) in a document.
Concisely, the information gain is a measure of the reduction in entropy of the class variable after the value for the feature is observed. In other words, information gain for classification is a measure of how common a feature is in a particular class compared to how common it is in all other classes.
In text classification, feature means the terms appeared in documents (a.k.a corpus). Consider, two terms in the corpus - term1 and term2. If term1 is reducing entropy of the class variable by a larger value than term2, then term1 is more useful than term2 for document classification in this example.
Example in the context of sentiment classification
A word that occurs primarily in positive movie reviews and rarely in negative reviews contains high information. For example, the presence of the word “magnificent” in a movie review is a strong indicator that the review is positive. That makes “magnificent” a high informative word.
Compute entropy and information gain in python
Measuring Entropy and Information Gain
The formula comes from mutual information, in this case, you can think of mutual information as how much information the presence of the term t gives us for guessing the class.
Check: https://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html

How to find Information gain in text classification?

I am working on text classification using Decision Tree which uses information gain as the main value for categorisation of text document. I have extracted few features by TF*IDF value. But I not able to figure out how exactly information gain should be calculated? There are some articles suggesting about this but none of them are very clear how to apply it to Text files.
you can use weka for calculating information gain . in weka InfoGainAttributeEval.java
class will calculate IG for the word with respect to document.check this answer this may help you.

Image Categorization Using Gist Descriptors

I created a multi-class SVM model using libSVM for categorizing images. I optimized for the C and G parameters using grid search and used the RBF kernel.
The classes are 1) animal 2) floral 3) landscape 4) portrait.
My training set is 100 images from each category, and for each image, I extracted a 920-length vector using Lear's Gist Descriptor C code: http://lear.inrialpes.fr/software.
Upon testing my model on 50 images/category, I achieved ~50% accuracy, which is twice as good as random (25% since there are four classes).
I'm relatively new to computer vision, but familiar with machine learning techniques. Any suggestions on how to improve accuracy effectively?
Thanks so much and I look forward to your responses!
This is very very very open research challenge. And there isn't necessarily a single answer that is theoretically guaranteed to be better.
Given your categories, it's not a bad start though, but keep in mind that Gist was originally designed as a global descriptor for scene classification (albeit has empirically proven useful for other image categories). On the representation side, I recommend trying color-based features like patch-based histograms as well as popular low-level gradient features like SIFT. If you're just beginning to learn about computer vision, then I would say SVM is plenty for what you're doing depending on the variability in your image set, e.g. illumination, view-angle, focus, etc.

Resources