How to scale input DBSCAN in scikit-learn - scikit-learn

Should the input to sklearn.clustering.DBSCAN be pre-processeed?
In the example http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py the distances between the input samples X are calculated and normalized:
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
db = DBSCAN(eps=0.95, min_samples=10).fit(S)
In another example for v0.14 (http://jaquesgrobler.github.io/online-sklearn-build/auto_examples/cluster/plot_dbscan.html) some scaling is done:
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
I base my code on the latter example and have the impression clustering works better with this scaling. However, this scaling "Standardizes features by removing the mean and scaling to unit variance". I try to find 2d clusters. If I have my clusters distributed in a squared area - let's say 100x100 I see no problem in the scaling. However, if the are distributed in an rectangled area e.g. 800x200 the scaling 'squeezes' my samples and changes the relative distances between them in one dimension. This deteriorates the clustering, doesn't it? Or am I understanding sth. wrong?
Do I need to apply some preprocessing at all, or can I simply input my 'raw' data?

It depends on what you are trying to do.
If you run DBSCAN on geographic data, and distances are in meters, you probably don't want to normalize anything, but set your epsilon threshold in meters, too.
And yes, in particular a non-uniform scaling does distort distances. While a non-distorting scaling is equivalent to just using a different epsilon value!
Note that in the first example, apparently a similarity and not a distance matrix is processed. S = (1 - D / np.max(D)) is a heuristic to convert a similarity matrix into a dissimilarity matrix. Epsilon 0.95 then effectively means at most "0.05 of the maximum dissimilarity observed". An alternate version that should yield the same result is:
D = distance.squareform(distance.pdist(X))
S = np.max(D) - D
db = DBSCAN(eps=0.95 * np.max(D), min_samples=10).fit(S)
Whereas in the second example, fit(X) actually processes the raw input data, and not a distance matrix. IMHO that is an ugly hack, to overload the method this way. It's convenient, but it leads to misunderstandings and maybe even incorrect usage sometimes.
Overall, I would not take sklearn's DBSCAN as a referene. The whole API seems to be heavily driven by classification, not by clustering. Usually, you don't fit a clustering, you do that for supervised methods only. Plus, sklearn currently does not use indexes for acceleration, and needs O(n^2) memory (which DBSCAN usually would not).
In general, you need to make sure that your distance works. If your distance function doesn't work no distance-based algorithm will produce the desired results. On some data sets, naive distances such as Euclidean work better when you first normalize your data. On other data sets, you have a good understanding on what distance is (e.g. geographic data. Doing a standardization on this obivously does not make sense, nor does Euclidean distance!)

Related

How remove duplicates from a dataframe and create new one with the weight for each sample?

I'm working on a Classification Problem where I know the label. I'm comparing 2 different algorithms K-Means and DBSCAN. However the latter has the famous problem with the Memory for computing the metric distance. But If in my dataset there are a lot of duplicated samples can I delete them and count their occurrences and after that use this weight in the Algorithm ? Everything for saving memory.
I do not know how to do it . This is my code:
df = dimensionality_reduction(dataframe = df_balanced_train)
train = np.array(df.iloc[:,1:])
### DBSCAN
#Here the centroids there aren't
y_dbscan, centroidi = Cluster(data = train, algo = "DBSCAN")
err, colori = error_Cluster(y_dbscan, df)
#These are the functions:
#DBSCAN Algorithm
#nbrs = NearestNeighbors(n_neighbors= 1500).fit(data)
#distances, indices = nbrs.kneighbors(data)
#print("The mean distance is about : " + str(np.mean(distances)))
#np.median(distances)
dbscan = DBSCAN(eps= 0.9, min_samples= 1000, metric="euclidean",
n_jobs = 1)
y_result = dbscan.fit_predict(data)
centroidi = "In DBSCAN there are not Centroids"
For a sample of 30k elements everything ok but for 800k always prloblem with the memory, could solve my problem delete dupliates and count thir occurrences ?
DBSCAN should take only O(n) memory - just as k-means.
But apparently the sklearn implementation does a version that first computes all neighbors, and thus uses O(n²) memory, and hence less scalable. I'd consider this a bug in sklearn, but apparently they are well aware of this limitation, which seems to be mostly a problem when you choose bad parameters. To guarantee O(n) memory it may be enough to just implement the standard DBSCAN yourself.
Merging duplicates is certainly an option, but A) that usually means you are using inappropriate data for these algorithms resp. for this distance and B) you'll also need to implement the algorithms yourself to add support for weight. Because you need to use weight sums instead of result counts etc. in DBSCAN.
Last but not least: if you have labels and a classification problem, these seem to be the wrong choice. They are clustering, not classification. Their job is not to recreate the labels you have, but to find new labels from the data.

Find top K cosine similar vectors to a given vector efficiently

The problem:
Suppose I have a group of around 1,000,000 short documents D (no more than 50 words each), and I want to let users to supply a document from the same group D, and and get the top K similar documents from D.
My approach:
My first approach was to preprocess the group D by applying simple tf-idf, and after I have vector for each document, which is extremely sparse, to use a simple nearest neighbours algorithm based on cosine similarity.
Then, on query time, to justuse my static nearest neighbours table which its size is 1,000,000 x K, without any further calculations.
After applying tf-idf, I got vectors in size ~200,000, which means now I have a very sparse table (that can be stored efficiently in memory using sparse vectors) in size 1,000,000 x 200,000.
However, calculating the nearest neighbours model took me more than one day, and still haven't finished.
I tried to lower the vectors dimension by applying HashingTF, that utilizes the hasing trick, instead, so I can set the dimension to a constant one (in my case, i used 2^13 for uninfied hashing), but still I get the same bad performance.
Some technical information:
I use Spark 2.0 for the tf-idf calculation, and sklearn NearestNeighbours on the collected data.
Is thier any more efficient way to achieve that goal?
Thanks in advance.
Edit:
I had an idea to try a LSH based approximation similarity algorithm like those implemented in spark as described here, but could not find one that supports the 'cosine' similarity metric.
There were some requirements for the algorithm on the relation between training instances and the dimensions of your vectors , but you can try DIMSUM.
You can find the paper here.

Feature scaling and its affect on various algorithm

Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.

Representing classification confidence

I am working on a simple AI program that classifies shapes using unsupervised learning method. Essentially I use the number of sides and angles between the sides and generate aggregates percentages to an ideal value of a shape. This helps me create some fuzzingness in the result.
The problem is how do I represent the degree of error or confidence in the classification? For example: a small rectangle that looks very much like a square would yield night membership values from the two categories but can I represent the degree of error?
Thanks
Your confidence is based on used model. For example, if you are simply applying some rules based on the number of angles (or sides), you have some multi dimensional representation of objects:
feature 0, feature 1, ..., feature m
Nice, statistical approach
You can define some kind of confidence intervals, baesd on your empirical results, eg. you can fit multi-dimensional gaussian distribution to your empirical observations of "rectangle objects", and once you get a new object you simply check the probability of such value in your gaussian distribution, and have your confidence (which would be quite well justified with assumption, that your "observation" errors have normal distribution).
Distance based, simple approach
Less statistical approach would be to directly take your model's decision factor and compress it to the [0,1] interaval. For example, if you simply measure distance from some perfect shape to your new object in some metric (which yields results in [0,inf)) you could map it using some sigmoid-like function, eg.
conf( object, perfect_shape ) = 1 - tanh( distance( object, perfect_shape ) )
Hyperbolic tangent will "squash" values to the [0,1] interval, and the only remaining thing to do would be to select some scaling factor (as it grows quite quickly)
Such approach would be less valid in the mathematical terms, but would be similar to the approach taken in neural networks.
Relative approach
And more probabilistic approach could be also defined using your distance metric. If you have distances to each of your "perfect shapes" you can calculate the probability of an object being classified as some class with assumption, that classification is being performed at random, with probiability proportional to the inverse of the distance to the perfect shape.
dist(object, perfect_shape1) = d_1
dist(object, perfect_shape2) = d_2
dist(object, perfect_shape3) = d_3
...
inv( d_i )
conf(object, class_i) = -------------------
sum_j inv( d_j )
where
inv( d_i ) = max( d_j ) - d_i
Conclusions
First two ideas can be also incorporated into the third one to make use of knowledge of all the classes. In your particular example, the third approach should result in confidence of around 0.5 for both rectangle and circle, while in the first example it would be something closer to 0.01 (depending on how many so small objects would you have in the "training" set), which shows the difference - first two approaches show your confidence in classifing as a particular shape itself, while the third one shows relative confidence (so it can be low iff it is high for some other class, while the first two can simply answer "no classification is confident")
Building slightly on what lejlot has put forward; my preference would be to use the Mahalanobis distance with some squashing function. The Mahalanobis distance M(V, p) allows you to measure the distance between a distribution V and a point p.
In your case, I would use "perfect" examples of each class to generate the distribution V and p is the classification you want the confidence of. You can then use something along the lines of the following to be your confidence interval.
1-tanh( M(V, p) )

k-nearest neighbour classifier but using a distribution?

I am building a classifier for some 2D data.
I have some training data for which I know the classes and have plotted these on a graph to see the clustering.
To the observer, there are obvious, separate clusters, but unfortunately they are spread out over lines rather than in tight clusters. One line-spread goes up at about an 80 degree angle, another at 45 degree and another at about 10 degrees from horizontal, but all three seem to point back to the origin.
I want to perform a nearest-neighbour classification on some test data, and from the looks of things, if the test data is very similar to the training data a 3-nearest-neighbour classifier would work fine, except when the data is close to the origin of the graph, in which case the three clusters are quite close together and there might be a few errors.
Should I be coming up with some estimated gaussian distributions for my clusters? If so, I'm not sure how I can combine this with a nearest neighbour classifier?
Be grateful for any input.
Cheers
Transform all your points to [r, angle], and scale r down to the range 0 to 90 too, before running nearest-neighbor.
Why ? NN uses Euclidean distance between points and centres (in most implementations),
but you want distance( point, centre ) to be more like
sqrt( (point.r - centre.r)^2 + (point.angle - centre.angle)^2 )
than sqrt( (point.x - centre.x)^2 + (point.y - centre.y)^2 ) .
Scaling r down to 30 ? 10 ? would weight angle more than r, which seems to be what you want.
Why use k-NN for that purpose? any linear classifier would do the trick. try solving it with SVM and you'll get much better results.
If you insist of using kNN, you clearly have to scale the features and transform them into polar ones as mentioned here.

Resources