I'm trying to classify some EEG data using a logistic regression model (this seems to give the best classification of my data). The data I have is from a multichannel EEG setup so in essence I have a matrix of 63 x 116 x 50 (that is channels x time points x number of trials (there are two trial types of 50), I have reshaped this to a long vector, one for each trial.
What I would like to do is after the classification to see which features were the most useful in classifying the trials. How can I do that and is it possible to test the significance of these features? e.g. to say that the classification was drive mainly by N-features and these are feature x to z. So I could for instance say that channel 10 at time point 90-95 was significant or important for the classification.
So is this possible or am I asking the wrong question?
any comments or paper references are much appreciated.
Scikit-learn includes quite a few methods for feature ranking, among them:
Univariate feature selection (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html)
Recursive feature elimination (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_digits.html)
Randomized Logistic Regression/stability selection (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLogisticRegression.html)
(see more at http://scikit-learn.org/stable/modules/feature_selection.html)
Among those, I definitely recommend giving Randomized Logistic Regression a shot. In my experience, it consistently outperforms other methods and is very stable.
Paper on this: http://arxiv.org/pdf/0809.2932v2.pdf
Edit:
I have written a series of blog posts on different feature selection methods and their pros and cons, which are probably useful for answering this question in more detail:
http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/
http://blog.datadive.net/selecting-good-features-part-ii-linear-models-and-regularization/
http://blog.datadive.net/selecting-good-features-part-iii-random-forests/
http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/
Related
I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.
I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.
I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.
I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.
I have a multi-class text classification/categorization problem. I have a set of ground truth data with K different mutually exclusive classes. This is an unbalanced problem in two respects. First, some classes are a lot more frequent than others. Second, some classes are of more interest to us than others (those generally positively correlate with their relative frequency, although there are some classes of interest that are fairly rare).
My goal is to develop a single classifier or a collection of them to be able to classify the k << K classes of interest with high precision (at least 80%) while maintaining reasonable recall (what's "reasonable" is a bit vague).
Features that I use are mostly typical unigram-/bigram-based ones plus some binary features coming from metadata of the incoming documents that are being classified (e.g. whether them were submitted via email or though a webform).
Because of the unbalanced data, I am leaning toward developing binary classifiers for each of the important classes, instead of a single one like a multi-class SVM.
What ML learning algorithms (binary or not) implemented in scikit-learn allow for training tuned to precision (versus for example recall or F1) and what options do I need to set for that?
What data analysis tools in scikit-learn can be used for feature selection to narrow down the features that might be the most relevant to the precision-oriented classification of a particular class?
This is not really a "big data" problem: K is about 100, k is about 15, the total number of samples available to me for training and testing is about 100,000.
Thx
Given that k is small, I would just do this manually. For each desired class, train your individual (one vs the rest) classifier, take look at the precision-recall curve, and then choose the threshold that gives the desired precision.