I'm new to scikit.
I can't find an example using a precomputed distance matrix in Scikit KMeans.
Could anybody shed a light now this, better with an example?
Scikit-learn does not allow you to pass in a custom (precomputed) distance matrix. It can precompute Euclidean distance matrix to speed-up the process, but there's no way to use your own one without hacking the source.
Related
I have a set of multivariate (2D) Gaussian distributions (represented by mean and variance) and would like to perform clustering on these distributions in a way that maintains the probabilistic Gaussian information (perhaps using the overlap of variances?).
I have done some research into clustering methods and found that DBSCAN clustering is more appropriate than K-means, as I don't know how many clusters I expect to find. However, DBSCAN makes use of a euclidean distance epsilon value to find clusters instead of using the variances of each distribution. I have also looked into Gaussian-Mixture Model methods, but they fit a set of points to a set of K Gaussian clusters, rather than fitting clusters to a set of Gaussian distributions.
Does anyone know of any additional clustering methods that might be appropriate to my needs?
Thanks!
DBSCAN can be used with arbitrary distances. It is not limited to Euclidean distance. You could employ a divergence measure, e.g. how much your Gaussians overlap.
However, I would suggest hierarchical clustering or Gaussian Mixture Modeling (EM).
DBSCAN is designed to allow Banana-shaped clusters, which are not well approximated by Gaussians. Your objective appear to be to merge similar Gaussians. That is better achieved by hierarchical clustering.
Now I'm using K-means for clustering and following this tutorial and API.
But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark?
In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids) algorithm is well defined only for Euclidean distances.
See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation.
Moreover MLlib algorithms are implemented in Scala, and PySpark provides only the wrappers required to execute Scala code. Therefore providing a custom metric as a Python function, wouldn't be technically possible without significant changes in the API.
Please note that since Spark 2.4 there are two built-in measures that can be used with pyspark.ml.clustering.KMeans and pyspark.ml.clustering.BisectingKMeans. (see DistanceMeasure Param).
euclidean for Euclidean distance.
cosine for cosine distance.
Use at your own risk.
I am using the package scikit-learn to compute a logistic regression on a moderately large data set (300k rows, 2k cols. That's pretty large to me!).
Now, since scikit-learn does not produce confidence intervals, I am calculating them myself. To do so, I need to compute and invert the Hessian matrix of the logistic function evaluated at the minimum. Since scikit-learn already computes the Hessian while optimizing, it'd be efficient if I could retrieve it.
In sklearn.classification.LogisticRegression, Is there any way to retrieve the Hessian evaluated at the minimum value?
Note: This is an intermediate step, and I actually only need the diagonal entries of the inverse of the Hessian. If anyone has a more straightforward way to get there, I'd love to learn it.
Generally CBIR works with Euclidean distance for comparing a query image and a database image feature vectors.
However in math works, I got a source code that instead of Euclidean distance it is done with SVM, like a content based image retrieval using two techniques:
Using knn for image retrieval;
Using svm for image retrieval.
How does it work?
There are some literature in that area:
Content Based Image Retrieval Using SVM Algorithm
An Approach for Image Retrieval Using SVM
Image Retrieval with Structured Object Queries Using Latent Ranking SVM
As far as I know, the simple approach is having a feature extraction phase (i.e Using PCA) and then doing a one-class svm classification.
K-NN usually uses euclidean distance anyway so the algorithm is offering you a more consistent decision boundary and a feature extraction phase on top of that. You can see an example here
I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.