K-Means Clustering Performance Benchmarking - scikit-learn

I have data that is 157-dimensional with 688 data-points. With the data I would like to perform clustering.
Since K-Means is the simplest algorithm, I have decided to begin with this method.
Here is the Sklearn function call:
KMeans(init='k-means++', n_clusters=4, n_init=10), name="k-means++", data=sales)
Here are some output metrics:
init time inertia homo compl v-meas ARI AMI num_clusters
k-means++ 0.06s 38967 0.262 0.816 0.397 0.297 0.250 4
k-means++ 0.05s 29825 0.321 0.847 0.466 0.338 0.306 6
k-means++ 0.07s 23131 0.411 0.836 0.551 0.430 0.393 8
k-means++ 0.09s 20566 0.636 0.817 0.715 0.788 0.621 10
k-means++ 0.09s 18695 0.534 0.794 0.638 0.568 0.513 12
k-means++ 0.11s 16805 0.773 0.852 0.810 0.916 0.760 14
k-means++ 0.11s 15297 0.822 0.775 0.798 0.811 0.761 16
Can someone, please, help me interpret them?
I know that it is good to have a low inertia and high homogeneity score, but I do not know what a good threshold for these is.
For example, 15297 is the lowest inertia I have received, but that happens when the K-clusters is set to 16. Is this good or bad?
Available abbreviations:
homo = homogeneity score;
compl = completeness score;
v_meas = v-measure score;
ARI = adjusted Rand score;
AMI = adjusted mutual info.

The more centroids you have, the lower inertia you will get.
Having more centroids (num_clusters = centroids) means more ways for inputs to be classified to a center, lowering the magnitude of inertia overall in a multi-dimensional space. However, having more centroids also means that it may be more complicated for a machine to reach convergence for a defined number of max_iter in each n_init (by default, max_iter is set to 300). So, you should understand that for each random initialisation of centroids (each start of n_init), your machine computes KMeans algorithm at maximum 300 times, trying to reach a state, where no reclassification of inputs is possible. Of course, if it reaches convergence earlier, then it proceeds to the next n_init. Equally, if your machine does not find a solution for a defined number of iterations (300 in your case), then it still does a next step with another random placement of centroids. After 10 initialisations, the best output in terms of inertia is taken. You may try to increase both max_iter and num_clusters to see that it takes longer to find a solution.
There are no universal thresholds for homo and inertia due to the fact that there are different datasets. The amount of centroids should be chosen empirically, judging from the structure of data and the amount of clusters these inputs should have.
compl is the completeness metrics that reaches its upper bound (1.0) if all inputs of a given class are assigned to the same cluster. Given that its interval is [0.0, 1.0], you may interpret it as a proportion. homo is the homogeneity metrics which interval is equal to compl. It reaches 1.0 if each cluster contains inputs of a single class. v_meas is simply a harmonic mean of those two metrics.
ARI is actually the adjusted Rand score. You can read more about ARI and AMI.
More general information about completeness score and homogeneity measure is here.
Also, you should consider reducing the dimension size with PCA, because performing KMeans on largely multi-dimensional data may give less satisfying results.

Related

On batch size, epochs, and learning rate of DistributedDataParallel

I have read these threads [1] [2] [3] [4], and this article.
I think I got how batch size and epochs works with DDP, but I am not sure about the learning rate.
Let's say I have a dataset of 100 * 8 images. In a non-distributed scenario, I set the batch size to 8, so each epoch will do 100 gradient steps.
Now I am in a multi-node multi-gpu scenario, with 2 nodes and 4 GPUs (so world size is 8).
I understand that I need to pass batches of 8 / 8 = 1, because each update will aggregate the gradients from the 8 GPUs. In each worker, the data loader will load still 100 batches, but each of 1 sample. So the whole dataset is parsed exactly once per epoch.
I checked and everything seems like that.
But what about the learning rate?
According to the official doc
When a model is trained on M nodes with batch=N, the gradient will be
M times smaller when compared to the same model trained on a single
node with batch=M*N if the loss is summed (NOT averaged as usual)
across instances in a batch (because the gradients between different
nodes are averaged). [...] But in most cases, you can just treat a
DistributedDataParallel wrapped model, a DataParallel wrapped model
and an ordinary model on a single GPU as the same (E.g. using the same
learning rate for equivalent batch size).
I understand that the gradients are averaged, so if the loss is averaged over samples nothing changes, while if it is summer we need to account for that. But does 'nodes' refer to the total number of GPUs across all cluster nodes (world size) or just cluster nodes? In my example, would M be 2 or 8? Some posts in the threads I linked say that the gradient is divided 'by the number of GPUs'. How exactly is the gradient aggregated?
Please refer to the following discussion:
https://github.com/PyTorchLightning/pytorch-lightning/discussions/3706
"As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.
Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes
In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512."

PyTorch: purpose of addmm function

What is the purpose of the following PyTorch function (doc):
torch.addmm(beta=1, mat, alpha=1, mat1, mat2, out=None)
More specifically, is there any reason to prefer this function instead of just using
beta * mat + alpha * (mat1 # mat2)
The addmm function is an optimized version of the equation beta*mat + alpha*(mat1 # mat2). I ran some tests and timed their execution.
If beta=1, alpha=1, then the execution of both the statements (addmm and manual) is approximately the same (addmm is just a little faster), regardless of the matrices size.
If beta and alpha are not 1, then addmm is two times faster than the manual execution for smaller matrices (with total elements in order of 105). But, if matrices are large (in order of 106), the speedup seems negligible (39ms v/s 41ms)

The 95% of non-normally distributed points around a mean/median

I asked users to tap a location repeatedly. To calculate the size of a target in that location, such that 95% of users will hit that target successfully, I usually measure 2 std of the tap offsets from the centroid. That works if the tap offsets are normally distributed, but my data now is not distributed normally. How can I figure out the equivalent of a 2 std around the mean/median?
If you're only measuring in one dimension, the region encompassed by +/-2 std in a Normal distribution corresponds fairly well to the central 95% of the distribution. Perhaps it's worth working with quantiles instead - take the interval corresponding to that within the 2.5th and 97.5th percentiles - this will be robust to skew or any other departure from normality.

What's the difference between "samples_per_epoch" and "steps_per_epoch" in fit_generator

I was confused by this problem for several days...
My question is that why the training time has such massive difference between that I set the batch_size to be "1" and "20" for my generator.
If I set the batch_size to be 1, the training time of 1 epoch is approximately 180 ~ 200 sec.
If I set the batch_size to be 20, the training time of 1 epoch is approximately 3000 ~ 3200 sec.
However, this horrible difference between these training times seems to be abnormal..., since it should be the reversed result:
batch_size = 1, training time -> 3000 ~ 3200 sec.
batch_size = 20, training time -> 180 ~ 200 sec.
The input to my generator is not the file path, but the numpy arrays which are already loaded into the
memory via calling "np.load()".
So I think the I/O trade-off issue doesn't exist.
I'm using Keras-2.0.3 and my backend is tensorflow-gpu 1.0.1
I have seen the update of this merged PR,
but it seems that this change won't affect anything at all. (the usage is just the same with original one)
The link here is the gist of my self-defined generator and the part of my fit_generator.
When you use fit_generator, the number of samples processed for each epoch is batch_size * steps_per_epochs. From the Keras documentation for fit_generator: https://keras.io/models/sequential/
steps_per_epoch: Total number of steps (batches of samples) to yield from generator before declaring one epoch finished and starting the next epoch. It should typically be equal to the number of unique samples of your dataset divided by the batch size.
This is different from the behaviour of 'fit', where increasing batch_size typically speeds up things.
In conclusion, when you increase batch_size with fit_generator, you should decrease steps_per_epochs by the same factor, if you want training time to stay the same or lower.
Let's clear it :
Assume you have a dataset with 8000 samples (rows of data) and you choose a batch_size = 32 and epochs = 25
This means that the dataset will be divided into (8000/32) = 250 batches, having 32 samples/rows in each batch. The model weights will be updated after each batch.
one epoch will train 250 batches or 250 updations to the model.
here steps_per_epoch = no.of batches
With 50 epochs, the model will pass through the whole dataset 50 times.
Ref - https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
You should also take into account the following function parameters when working with fit_generator:
max_queue_size, use_multiprocessing and workers
max_queue_size - might cause to load more data than you actually expect, which depending on your generator code may do something unexpected or unnecessary which can slow down your execution times.
use_multiprocessing together with workers - might spin-up additional processes that would lead to additional work for serialization and interprocess communication. First you would get your data serialized using pickle, then you would send your data to that target processes, then you would do your processing inside those processes and then the whole communication procedure repeats backwards, you pickle results, and send them to the main process via RPC. In most cases it should be fast, but if you're processing dozens of gigabytes of data or have your generator implemented in sub-optimal fashion then you might get the slowdown you describe.
The whole thing is:
fit() works faster than fit_generator() since it can access data directly in memory.
fit() takes numpy arrays data into memory, while fit_generator() takes data from the sequence generator such as keras.utils.Sequence which works slower.

Random Forest: Running out of memory

I'm using scikit-learn Random Forest to fit a training data (~30mb) and my laptop keeps crashing running of out application memory. The test data is a few times bigger than the training data. Using Macbook Air 2GHz 8GB memory.
What are some of the ways to deal with this?
rf = RandomForestClassifier(n_estimators = 100, n_jobs=4)
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rf, X_train_a, y_train, cv=20, scoring='roc_auc'))
Your best choice is to tune the arguments.
n_jobs=4
This makes the computer compute four train-test cycles simultaneously. Different Python jobs run in separate processes, thus the full dataset is also copied. Try to reduce n_jobs to 2 or 1 to save memory. n_jobs==4 uses four times the memory n_jobs==1 uses.
cv=20
This splits the data into 20 pieces and the code does 20 train-test iterations. This means that the training data is the size of 19 pieces of the original data. You can quite safely reduce it to 10, however your accuracy estimate might get worse. It won't save much memory, but makes runtime faster.
n_estimators = 100
Reducing this will save little memory, but it will make the algorithm run faster as the random forest will contain fewer trees.
To sum up, I'd recommend reducing n_jobs to 2 to save the memory (2-fold increase in runtime). To compensate runtime, I'd suggest changing cv to 10 (2-fold savings in runtime). If that does not help, change n_jobs to 1 and also reduce the number of estimators to 50 (extra two times faster processing).
I was dealing with ~4MB dataset and Random Forest from scikit-learn with default hyper-parameters was ~50MB (so more than 10 times of the data). By setting the max_depth = 6 the memory consumption decrease 66 times. The performance of shallow Random Forest on my dataset improved!
I write down this experiment in the blog post.
From my experience, in the case of regression tasks the memory usage can grow even much more, so it is important to control the tree depth. The tree depth can be controlled directly with max_depth or by tuning: min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes.
The memory of the Random Forest can be of course controlled with number of trees in the ensemble.

Resources