I want to use use scikit kmean in production deployment and would want to use the default setting for kmean.init = k-means++. The question I have is that what are the chances that kmeans will fall into a local optima when it initializes cluster centroids?.
Notes says that "‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details".
Is there a data on the probability of getting a local optima ?.
If so, should I iterate to get the minimal cost function ?.
Probability of getting trapped into local optima, majorly depends on the nature of your data. If its explicitly grouped, then initial cluster might not have much of an impact on the final clusters results. May be of use for you.
Inspite of above point, for high dimensional dataset, it is preferable to try 10 or more iterations with different initial clusters and choose the one with best performance (one of performance metric could be silhouette-coefficient)
Related
I want to cluster big data set (more than 1M records).
I want to use dbscan or hdbscan algorithms for this clustering task.
When I try to use one of those algorithms, I'm getting memory error.
Is there a way to fit big data set in parts ? (go with for loop and refit every 1000 records) ?
If no, is there a better way to cluster big data set, without upgrading the machine memory ?
If the number of features in your dataset is not too much (below 20-25), you can consider using BIRCH. It's an iterative method that can be used for large datasets. In each iteration it builds a tree with only a small sample of data and put each instance into clusters.
I am building a K means algorithm and have multiple variables to feed into it. As of this I am using PCA to transform the data to two dimensions. When I display the PCA biplot I don't understand what similarities the data has to be grouped into a specific cluster. I am using a customer segmentation dataset. I.E: I want to be able to know that a specific cluster is a cluster as a customer has a low income but spends a lot of money on products.
Since you are using k-means:
Compute the mean of each cluster on the original data. Now you can compare these attributes.
Alternatively: don't use PCA in the first place, if it had your analysis... k-means is as good as PCA at coping with several dozen variables.
I have built a Machine Learning model based on clustering, & now just want to update it with new data periodically (on daily basis). I am using PySpark Mlib, & not able to find any method in Spark for this need.
Note, required method 'partial_fit' available in scikit-learn, but not in Spark.
I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.
Please suggest me an effective way for model update or on-line learning using Spark Mllib?
You cannot update arbitrary models.
On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.
For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.
That is why it is fairly common to build new models every night, for example.
Streaming is quite overrated. In particular k-means. Total nonsense to do online k-meand with "big" (lol) data. Because the new point have next to zero effect, you may just as well do a batch every night. These are just academic toys with no relevancy.
Theoretical question here. I understand that when dealing with datasets that cannot fit into memory on a single machine, spark + EMR is a great way to go.
However, I would also like to use tensorflow instead of spark's ml lib algorithms to perform deep learning on these large datasets.
From my research I see that I could potentially use a combination of pyspark, elephas and EMR to achieve this. Alternatively there is BigDL and sparkdl.
Am I going about this the wrong way? What is best practice for deep learning on data that cannot fit into memory? Should I use online learning or batch training instead? This post seems to say that "most high-performance deep learning implementations are single-node only"
Any help to point me in the right direction would be greatly appreciated.
In TensorFlow, you can use tf.data.Dataset.from_generator so you can generate your dataset at runtime without any storage hassles.
See link for example https://www.codespeedy.com/what-is-tf-data-dataset-from_generator-in-tensorflow/
As you mention "fitting massive dataset to memory", I understand that you are trying to load all data to memory at once and start training. Hence, I give the reply based on this assumption.
General mentality is that if you cannot fit the data to your resources, divide data into smaller chunks and train in an iterative way.
1- Load data one by one instead of trying to load all at once. If you create an execution workflow as "Load Data -> Train -> Release Data (This can be done automatically by garbage collectors) -> Restart" , you can understand how much resource is needed to train single data.
2- Use mini-batches. As soon as you get the resource information from #1, you can make an easy calculation to estimate the mini-batch size. For example, if training single data consumes 1.5 GB of RAM, and your GPU has 8 GB of RAM, theoretically you may train mini-batches with size 5 at once.
3- If the resources are not enough to train even 1-sized single batch, in this case, you may think about increasing your PC capacity or decreasing your model capacity / layers / features. Alternatively, you can go for cloud computing solutions.
I'm new to Spark (and to cluster computing framework) and I'm wondering about the general principles followed by the parallel algorithms used for machine learning (MLlib). Are they essentially faster because Spark distributes training data over multiple nodes? If yes, I suppose that all nodes share the same set of parameters right? And that they have to combine (ex: summing) the intermediate calculations (ex: the gradients) on a regular basis, am I wrong?
Secondly, suppose I want to fit my data with an ensemble of models (ex: 10). Wouldn't it be simpler in this particular context to run my good old machine-learning program independently on 10 machines instead of having to write complicated code (for me at least!) for training in a Spark cluster?
Corollary question: is Spark (or other cluster computing framework) useful only for big data applications for which we could not afford training more than one model and for which training time would be too much long on a single machine?
You correct about the general principle. Typical MLlib algorithm is a an iterative procedure with local phase and data exchange.
MLlib algorithms are not necessarily faster. They try to solve two problems:
disk latency.
memory limitations on a single machine.
If you can process data on a single node this can be orders of magnitude faster than using ML / MLlib.
The last question is hard to answer but:
It is not complicated to train ensembles:
def train_model(iter):
items = np.array(list(iter))
model = ...
return model
rdd.mapPartitions(train_model)
There are projects which already do that (https://github.com/databricks/spark-sklearn)