Comparing parallel k-means batch vs mini-batch speed - scikit-learn

I am trying to cluster 1000 dimension, 250k vectors using k-means. The machine that I am working on has 80 dual-cores.
Just confirming, if anyone has compared the run-time of k-means default batch parallel version against k-means mini-batch version? The example comparison page on sklean documents doesn't provide much info as the dataset is quite small.
Much appreciate your help.
Regards,

Conventional wisdom holds that Mini-Batch K-Means should be faster and more efficient for greater than 10,000 samples. Since you have 250,000 samples, you should probably use mini-batch if you don't want to test it out on your own.
Note that the example you referenced can very easily be changed to a 5000, 10,000 or 20,000 point example by changing n_samples in this line:
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
I agree that this won't necessarily scale the same for 1000 dimensional vectors, but since you are constructing the example and are using either k-means or mini batch k-means and it only takes a second to switch between them... You should just do a scaling study for your 1000 dimensional vectors for 5k, 10k, 15k, 20k samples.
Theoretically, there is no reason why Mini-Batch K-Means should underperform K-Means due to vector dimensionality and we know that it does better for larger sample sizes, so I would go with mini batch off the cuff e.g. bias for action over research.

Related

sklearn HistGradientBoostingClassifier with large unbalanced data

I've been using Sklearn HistGradientBoostingClassifier to classify some data. My experiment is multi-class classification with single label predictions (20 labels).
My experience shows two cases. The first case is the measurement of the accuracy of these algorithms without data augmentation (around unbalanced 3,000 samples). The second case is the measurement of accuracy with data augmentation (around 12,000 unbalanced samples). I am using default parameters.
In the first case, the HistGradientBoostingClassifier shows an accuracy of around 86.0%. However, with data augmentation, results show weak accuracy, around 23%.
I am wondering if this accuracy was coming from unbalanced datasets, but since there are no features to fix unbalanced datasets for the HistGradientBoostingClassifier algorithm within the Sklearn library, I cannot verify that fact.
Do some people have the same kind of problem with large dataset and HistGradientBoostingClassifier?
Edit: I tried other algorithms with the same data split, and the results seems normal (accuracy around 5% more w/ data augmentation). I am wondering why I am only getting this with HistGradientBoostingClassifier.
Accuracy is a poor metric when dealing with imbalanced data. Suppose I have 90:10 class 0 and class 1. A DummyClassifier that only predicts class 0 will achieve 90% accuracy.
You'll have to look at precision, recall, f1, confusion matrix, and not just accuracy alone.
I have found something that could be the reason of the lack of accuracy while using HistGradientBoostingClassifier algorithm with default parameters on augmented dataset of roughly 12,000 samples.
I compared HistGradientBoostingClassifier and LightGBM algorithms on the same data split (HistGradientBoostingClassifier from sklearn is an implementation of Microsoft's LightGBM.). HistGradientBoostingClassifier shows a weak accuracy of 24.7% and LightGBM a strong one 87.5%.
As I can read on sklearn's and Microsoft's docs, HistGradientBoostingClassifier "cannot handle properly" unbalanced dataset while LightGBM can. The latter has this parameter: class_weigth (dict, 'balanced' or None, optional (default=None)) (found on that page)
My hypothesis is that, for the time being, the dataset becomes more unbalanced with augmentation and, without any feature for the HistGradientBoostingClassifier algorithm to handle unbalanced data, the algorithm is misled.
Also, as mentioned by Hanafi Haffidz in comments the algorithm could tend to overfit with default parameters.

Which algorithm would be suitable for clustering a billion datapoints?

I am running a K-means algorithm (using the sklearn implementation) on an aggregated dataset of ~350k datapoints on a 6 dimension hyper-plane (using 6 features).
I would like to do the same but in the "non-aggregated" version of my dataset, which is ~1b datapoints using the same 6 features
I know this is a very heavy task for K-means, the number of datapoints is just too big, even though the dimensions' size is pretty small.
Are there any suggestions of other algorithms that would help me on this task, apart from mini batch K-means ?

Fitting a random forest model on a large dataset - few million rows and few thousands columns

I am trying to build a random forest on a slightly large data set - half million rows and 20K columns (dense matrix).
I have tried modifying the hyperparameters such as:
n_jobs = -1 or iterating over max depth. However it's either getting stopped because of a memory issue (I have a 320GB server) or the accuracy is very low (when i use a lower max_depth)
Is there a way where I can still use all the features and build the model without any memory issue or not loosing on accuracy?
In my opinion (don't know exactly your case and dataset) you should focus on extract information from your dataset, especially if you have 20k of columns. I assume some of them will not give much variance or will be redundant, so you can make you dataset slightly smaller and more robust to potential overfit.
Also, you should try to use some dimensionality reduction methods which will allows you to make your dataset smaller retaining the most of the variance.
sample code for pca
pca gist
PCA for example (did not mean to offend you if you already know this methods)
pca wiki

How to choose the right neural network in a binary classification problem for a unbalanced data?

I am using keras sequential model for binary classification. But My data is unbalanced. I have 2 features column and 1 output column(1/0). I have 10000 of data. Among that only 20 results in output 1, all others are 0. Then i have extended the data size to 40000. Now also only 20 results in output 1, all others are 0. Since the data is unbalanced(0 dominates 1), which neural network will be better for correct prediction?
First of all, two features is a really small amount. Neural Networks are highly non-linear models with a really really high amount of freedom degrees, thus if you try to train a network with more than just a couple of networks it will overfit even with balanced classes. You can find more suitable models for a small dimensionality like Support Vector Machines in scikit-learn library.
Now about unbalanced data, the most common techniques are Undersampling and Oversampling. Undersampling is basically training your model several times with a fraction of the dataset, that contains the non dominant class and a random sample of the dominant so that the ratio is acceptable, where as oversampling consist on generating artificial data to balance the classes. In most cases undersampling works better.
Also when working with unbalanced data it's quite important to choose the right metric based on what is more important for the problem (is minimizing false positives more important than false negatives, etc).

scikit-learn SVM with a lot of samples / mini batch possible?

According to http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html I read:
"The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples."
I have currently 350,000 samples and 4,500 classes and this number will grow further to 1-2 million samples and 10k + classes.
My problem is that I am running out of memory. All is working as it should when I use just 200,000 samples with less than 1000 classes.
Is there a way to build-in or use something like minibatches with SVM? I saw there exists MiniBatchKMeans but I dont think its for SVM?
Any input welcome!
I mentioned this problem in my answer to this question.
You can split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.
Also if there is no need in using kernels in your case, then you can use sklearn's SGDClassifier, which implements stochastic gradient descent. It fits linear SVM by default.

Resources