scikit-learn SVM with a lot of samples / mini batch possible? - scikit-learn

According to http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html I read:
"The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples."
I have currently 350,000 samples and 4,500 classes and this number will grow further to 1-2 million samples and 10k + classes.
My problem is that I am running out of memory. All is working as it should when I use just 200,000 samples with less than 1000 classes.
Is there a way to build-in or use something like minibatches with SVM? I saw there exists MiniBatchKMeans but I dont think its for SVM?
Any input welcome!

I mentioned this problem in my answer to this question.
You can split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.
Also if there is no need in using kernels in your case, then you can use sklearn's SGDClassifier, which implements stochastic gradient descent. It fits linear SVM by default.

Related

How to put more weight on one class during training in Pytorch [duplicate]

I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.

sklearn HistGradientBoostingClassifier with large unbalanced data

I've been using Sklearn HistGradientBoostingClassifier to classify some data. My experiment is multi-class classification with single label predictions (20 labels).
My experience shows two cases. The first case is the measurement of the accuracy of these algorithms without data augmentation (around unbalanced 3,000 samples). The second case is the measurement of accuracy with data augmentation (around 12,000 unbalanced samples). I am using default parameters.
In the first case, the HistGradientBoostingClassifier shows an accuracy of around 86.0%. However, with data augmentation, results show weak accuracy, around 23%.
I am wondering if this accuracy was coming from unbalanced datasets, but since there are no features to fix unbalanced datasets for the HistGradientBoostingClassifier algorithm within the Sklearn library, I cannot verify that fact.
Do some people have the same kind of problem with large dataset and HistGradientBoostingClassifier?
Edit: I tried other algorithms with the same data split, and the results seems normal (accuracy around 5% more w/ data augmentation). I am wondering why I am only getting this with HistGradientBoostingClassifier.
Accuracy is a poor metric when dealing with imbalanced data. Suppose I have 90:10 class 0 and class 1. A DummyClassifier that only predicts class 0 will achieve 90% accuracy.
You'll have to look at precision, recall, f1, confusion matrix, and not just accuracy alone.
I have found something that could be the reason of the lack of accuracy while using HistGradientBoostingClassifier algorithm with default parameters on augmented dataset of roughly 12,000 samples.
I compared HistGradientBoostingClassifier and LightGBM algorithms on the same data split (HistGradientBoostingClassifier from sklearn is an implementation of Microsoft's LightGBM.). HistGradientBoostingClassifier shows a weak accuracy of 24.7% and LightGBM a strong one 87.5%.
As I can read on sklearn's and Microsoft's docs, HistGradientBoostingClassifier "cannot handle properly" unbalanced dataset while LightGBM can. The latter has this parameter: class_weigth (dict, 'balanced' or None, optional (default=None)) (found on that page)
My hypothesis is that, for the time being, the dataset becomes more unbalanced with augmentation and, without any feature for the HistGradientBoostingClassifier algorithm to handle unbalanced data, the algorithm is misled.
Also, as mentioned by Hanafi Haffidz in comments the algorithm could tend to overfit with default parameters.

Fitting a random forest model on a large dataset - few million rows and few thousands columns

I am trying to build a random forest on a slightly large data set - half million rows and 20K columns (dense matrix).
I have tried modifying the hyperparameters such as:
n_jobs = -1 or iterating over max depth. However it's either getting stopped because of a memory issue (I have a 320GB server) or the accuracy is very low (when i use a lower max_depth)
Is there a way where I can still use all the features and build the model without any memory issue or not loosing on accuracy?
In my opinion (don't know exactly your case and dataset) you should focus on extract information from your dataset, especially if you have 20k of columns. I assume some of them will not give much variance or will be redundant, so you can make you dataset slightly smaller and more robust to potential overfit.
Also, you should try to use some dimensionality reduction methods which will allows you to make your dataset smaller retaining the most of the variance.
sample code for pca
pca gist
PCA for example (did not mean to offend you if you already know this methods)
pca wiki

Why does more features in a random forest decrease accuracy dramatically?

I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.

Comparing parallel k-means batch vs mini-batch speed

I am trying to cluster 1000 dimension, 250k vectors using k-means. The machine that I am working on has 80 dual-cores.
Just confirming, if anyone has compared the run-time of k-means default batch parallel version against k-means mini-batch version? The example comparison page on sklean documents doesn't provide much info as the dataset is quite small.
Much appreciate your help.
Regards,
Conventional wisdom holds that Mini-Batch K-Means should be faster and more efficient for greater than 10,000 samples. Since you have 250,000 samples, you should probably use mini-batch if you don't want to test it out on your own.
Note that the example you referenced can very easily be changed to a 5000, 10,000 or 20,000 point example by changing n_samples in this line:
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
I agree that this won't necessarily scale the same for 1000 dimensional vectors, but since you are constructing the example and are using either k-means or mini batch k-means and it only takes a second to switch between them... You should just do a scaling study for your 1000 dimensional vectors for 5k, 10k, 15k, 20k samples.
Theoretically, there is no reason why Mini-Batch K-Means should underperform K-Means due to vector dimensionality and we know that it does better for larger sample sizes, so I would go with mini batch off the cuff e.g. bias for action over research.

Resources