Which algorithm would be suitable for clustering a billion datapoints? - python-3.x

I am running a K-means algorithm (using the sklearn implementation) on an aggregated dataset of ~350k datapoints on a 6 dimension hyper-plane (using 6 features).
I would like to do the same but in the "non-aggregated" version of my dataset, which is ~1b datapoints using the same 6 features
I know this is a very heavy task for K-means, the number of datapoints is just too big, even though the dimensions' size is pretty small.
Are there any suggestions of other algorithms that would help me on this task, apart from mini batch K-means ?

Related

Perform clustering on high dimensional data

Recently I trained a BYOL model on a set of images to learn an embedding space where similar vectors are close by. The performance was fantastic when I performed approximate K-nearest neighbours search.
Now the next task, where I am facing a problem is to find a clustering algorithm that uncovers a set of clusters using the embedding vectors generated by the BYOL trained feature extractor [dimension of the vector is 1024 & there are 1 million vectors]. I have no information apriori about the number of classes i.e. clusters in my dataset & thus cannot use Kmeans. Is there any scalable clustering algorithm that can help me uncover such clusters. I tried to use FISHDBC but the repository does not have good documentation.

Fitting a random forest model on a large dataset - few million rows and few thousands columns

I am trying to build a random forest on a slightly large data set - half million rows and 20K columns (dense matrix).
I have tried modifying the hyperparameters such as:
n_jobs = -1 or iterating over max depth. However it's either getting stopped because of a memory issue (I have a 320GB server) or the accuracy is very low (when i use a lower max_depth)
Is there a way where I can still use all the features and build the model without any memory issue or not loosing on accuracy?
In my opinion (don't know exactly your case and dataset) you should focus on extract information from your dataset, especially if you have 20k of columns. I assume some of them will not give much variance or will be redundant, so you can make you dataset slightly smaller and more robust to potential overfit.
Also, you should try to use some dimensionality reduction methods which will allows you to make your dataset smaller retaining the most of the variance.
sample code for pca
pca gist
PCA for example (did not mean to offend you if you already know this methods)
pca wiki

scikit-learn SVM with a lot of samples / mini batch possible?

According to http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html I read:
"The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples."
I have currently 350,000 samples and 4,500 classes and this number will grow further to 1-2 million samples and 10k + classes.
My problem is that I am running out of memory. All is working as it should when I use just 200,000 samples with less than 1000 classes.
Is there a way to build-in or use something like minibatches with SVM? I saw there exists MiniBatchKMeans but I dont think its for SVM?
Any input welcome!
I mentioned this problem in my answer to this question.
You can split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.
Also if there is no need in using kernels in your case, then you can use sklearn's SGDClassifier, which implements stochastic gradient descent. It fits linear SVM by default.

Comparing parallel k-means batch vs mini-batch speed

I am trying to cluster 1000 dimension, 250k vectors using k-means. The machine that I am working on has 80 dual-cores.
Just confirming, if anyone has compared the run-time of k-means default batch parallel version against k-means mini-batch version? The example comparison page on sklean documents doesn't provide much info as the dataset is quite small.
Much appreciate your help.
Regards,
Conventional wisdom holds that Mini-Batch K-Means should be faster and more efficient for greater than 10,000 samples. Since you have 250,000 samples, you should probably use mini-batch if you don't want to test it out on your own.
Note that the example you referenced can very easily be changed to a 5000, 10,000 or 20,000 point example by changing n_samples in this line:
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
I agree that this won't necessarily scale the same for 1000 dimensional vectors, but since you are constructing the example and are using either k-means or mini batch k-means and it only takes a second to switch between them... You should just do a scaling study for your 1000 dimensional vectors for 5k, 10k, 15k, 20k samples.
Theoretically, there is no reason why Mini-Batch K-Means should underperform K-Means due to vector dimensionality and we know that it does better for larger sample sizes, so I would go with mini batch off the cuff e.g. bias for action over research.

Clustering data after dimension reduction with PCA

Say we have a dataset of a large dimension, which we have reduced to a lower dimension using PCA, would it be wise/accurate to then use a clustering algorithm on said data? Assuming that we do not know how many clusters to expect.
Using PCA on the Iris dataset(with the data in the csv ordered such that all of the first class are listed, then the second, then the third) yields the following plot:-
It can be seen that the three classes in the Iris dataset have been retained. However, when the order of the samples is randomised, the following plot is produced:-
Above, it is not clear how many clusters/classes are contained in the data set. In this case(the more real world case), how would one identify the number of classes, would a clustering algorithm such as K-Means be effective?
Would there be innacuracies due to the discarding of lower order Principal Components?
EDIT:- To be clear, I am asking if a dataset can be clustered after running PCA, and if so, what the most accurate method would be.
Say we have a dataset of a large dimension, which we have reduced to a lower
dimension using PCA, would it be wise/accurate to then use a clustering
algorithm on said data? Assuming that we do not know how many clusters to
expect.
Your data might well separate in a low-variance dimension. I would not recommend running PCA prior to clustering.
Above, it is not clear how many clusters/classes are contained in the data
set. In this case(the more real world case), how would one identify the
number of classes, would a clustering algorithm such as K-Means be effective?
There are effective clustering algorithms that do not require prior knowledge of the number of classes, such as Mean Shift and DBSCAN.
Try sorting the dataset after PCA, then plotting it.
The iris data set is much to simple to draw any valid conclusions about the behaviour of high-dimensional data, and the benefits of PCA.
Plus, "wise" - in which sense? If you want to eat pizza, it is not wise to plot the iris data set.

Resources