In scikit-learn, some clustering algorithms have both predict(X) and fit_predict(X) methods, like KMeans and MeanShift, while others only have the latter, like SpectralClustering. According to the doc:
fit_predict(X[, y]): Performs clustering on X and returns cluster labels.
predict(X): Predict the closest cluster each sample in X belongs to.
I don't really understand the difference between the two, they seem equivalent to me.
In order to use the 'predict' you must use the 'fit' method first. So using 'fit()' and then 'predict()' is definitely the same as using 'fit_predict()'. However, one could benefit from using only 'fit()' in such cases where you need to know the initialization parameters of your models rather than if you use 'fit_predict()', where you will just be obtained the labeling results of running your model on the data.
fit_predict is usually used for unsupervised machine learning transductive estimator.
Basically, fit_predict(x) is equivalent to fit(x).predict(x).
This might be very late to add an answer here, It just that someone might get benefitted in future
The reason I could relate for having predict in kmeans and only fit_predict in dbscan is
In kmeans you get centroids based on the number of clusters considered. So once you trained your datapoints using fit(), you can use that to predict() a new single datapoint to assign to a specific cluster.
In dbscan you don't have centroids , based on the min_samples and eps (min distance between two points to be considered as neighbors) you define, clusters are formed . This algorithm returns cluster labels for all the datapoints. This behavior explains why there is no predict() method to predict a single datapoint. Difference between fit() and fit_predict() was already explained by other user -
In another spatial clustering algorithm hdbscan gives us an option to predict using approximate_predict(). Its worth to explore that.
Again its my understanding based on the source code I explored. Any experts can highlight any difference.
I am using Spark MLlib ALS function to build a recommendation system. The function accepts as input an rdd comprising rows of the form: (user_id, item_id, rating).
I would like to know what happens when the function sees two tuples with the same user_id and item_id. Is the function overwriting or averaging the values?
I went through the official documentation but did not find any clue.
Many thanks
In order to do a measure of the "goodness" of the classification k-means has found I need to calculate the (Between Sum of squares) BSS/TSS (Total Sum of squares) ratio which should approach 1 if the clustering has the properties of internal cohesion and external separation. I was wondering whether spark has internal functions to compute BSS/TSS for me similar to the R Kmeans clustering package in order to leverage the parrallism of the spark cluster.
Or is there a cost effective way of computing the BSS/TSS ratio through another means?
I am using Hierarchical Agglomerative Clustering in scikit-learn, to cluster texts. How can i get text labels for each clusters.
clustering = AgglomerativeClustering(linkage=linkage, n_clusters=10)
is there any parameter to get it, or we have to write our own logic for that.
Apache spark support sparse data.
For example, we can use MLUtils.loadLibSVMFile(...) to load data into an RDD.
I was wondering how does spark deal with those missing values.
Spark creates an RDD of Labeled points, and each labeled point has a label and a vector of features. Note that this is a Spark Vector which does support sparse elements (currently Sparse vectors are represented by an array of non-indices and a second array of doubles for each of the non-null value).