k-nearest neighbour classifier but using a distribution? - statistics

I am building a classifier for some 2D data.
I have some training data for which I know the classes and have plotted these on a graph to see the clustering.
To the observer, there are obvious, separate clusters, but unfortunately they are spread out over lines rather than in tight clusters. One line-spread goes up at about an 80 degree angle, another at 45 degree and another at about 10 degrees from horizontal, but all three seem to point back to the origin.
I want to perform a nearest-neighbour classification on some test data, and from the looks of things, if the test data is very similar to the training data a 3-nearest-neighbour classifier would work fine, except when the data is close to the origin of the graph, in which case the three clusters are quite close together and there might be a few errors.
Should I be coming up with some estimated gaussian distributions for my clusters? If so, I'm not sure how I can combine this with a nearest neighbour classifier?
Be grateful for any input.

Transform all your points to [r, angle], and scale r down to the range 0 to 90 too, before running nearest-neighbor.
Why ? NN uses Euclidean distance between points and centres (in most implementations),
but you want distance( point, centre ) to be more like
sqrt( (point.r - centre.r)^2 + (point.angle - centre.angle)^2 )
than sqrt( (point.x - centre.x)^2 + (point.y - centre.y)^2 ) .
Scaling r down to 30 ? 10 ? would weight angle more than r, which seems to be what you want.

Why use k-NN for that purpose? any linear classifier would do the trick. try solving it with SVM and you'll get much better results.
If you insist of using kNN, you clearly have to scale the features and transform them into polar ones as mentioned here.


Improving accuracy of nearest neighbours algorithm - unsupervised learning problem

I have a situation where I am trying to find out 3 nearest neighbours for a given ID in my dataframe. I am using NN alogrithm (not KNN) to achieve this. The below code is giving me the three nearest neighbours, for the top node the results are fine but for the middle ones and the bottom ones the accuracy is only 1/3 neighbours are correct whereas I am eyeing to have atleast 2/3 neighours correct at every ID. My dataset has 47 features and 5000 points.
from sklearn.neighbors import KDTree
def findsuccess(sso_id):
neighbors_f_sso_id = np.where(nbrs.kneighbors_graph([X[i]]))[0]
print('Neighbors of id', neighbors_f_sso_id)
kdt = KDTree(X, leaf_size=40, metric='euclidean')
kdt.query(X, k=4, return_distance=False)
The above code will return the ID itself and the 3 nearest neighbours ,hence k=4
I have read that due to curse of dimensionality, this NN algorithm might not work well as there are about 47 features in my dataset but this is the only option I think I have when it comes to a data frame without a target variable. There is one article available here that says the KD Tree is not best of the algorithms that can be used.
What would be the best way to achieve the maximum accuracy, meaning achieving minimum distance?
Do I need to perform scaling before passing into KD Tree algorithm? Any other things that I need to take care off?

Average and Measure of Spread of 3D Rotations

I've seen several similar questions, and have some ideas of what I might try, but I don't remember seeing anything about spread.
So: I am working on a measurement system, ultimately computer vision based.
I take N captures, and process them using a library which outputs pose estimations in the form of 4x4 affine transformation matrices of translation and rotation.
There's some noise in these pose estimations. The standard deviation in Euler angles for each axis of rotation is less than 2.5 degrees, so all orientations are pretty close to each other (for a case where all Euler angles are close to 0 or 180). Standard errors of less than 0.25 degrees are important to me. But I have already run into the problems endemic to Euler angles.
I want to average all these pretty-close-together pose estimates to get a single final pose estimate. And I also want to find some measure of spread so that I can estimate accuracy.
I'm aware that "average" isn't actually well defined for rotations.
(For the record, my code is in Numpy-heavy Python.)
I also may want to weight this average, since some captures (and some axes) are known to be more accurate than others.
My impression is that I can just take the mean and standard deviation of the translation vector, and that for the rotation I can convert to quaternions, take the mean, and re-normalize with OK accuracy since these quaternions are pretty close together.
I've also heard mentions of least-squares across all the quaternions, but most of my research into how this would be implemented has been a dismal failure.
Is this workable? Is there a reasonably well-defined measure of spread in this context?
Without more info about your geometry setup is hard to answer. Anyway for rotations I would:
create 3 unit vectors
and apply the rotation on them and call the output
it is just applying the matrix(i) with position at (0,0,0)
do this for all measurements you have
now average all vectors
correct the vector values
so make each of the X,Y,Z unit vectors again and take the axis which is more closest to the rotation axis as main axis. It will stay as is and recompute the remaining two axises as cross product of main axis and the other vector to ensure orthogonality. Beware of the multiplication order (wrong order of operands will negate the output)
construct averaged transform matrix
see transform matrix anatomy as origin you can use averaged origin of the measurement matrices
Moakher wrote a paper that explains there are basically two ways to take an average of Rotation matrices. The first is a weighted average followed by a projection back to SO(3) using the SVD. The second is the Riemannian center of mass. That one is a closer notion to the geometric mean, and its more complicated to compute.

Representing classification confidence

I am working on a simple AI program that classifies shapes using unsupervised learning method. Essentially I use the number of sides and angles between the sides and generate aggregates percentages to an ideal value of a shape. This helps me create some fuzzingness in the result.
The problem is how do I represent the degree of error or confidence in the classification? For example: a small rectangle that looks very much like a square would yield night membership values from the two categories but can I represent the degree of error?
Your confidence is based on used model. For example, if you are simply applying some rules based on the number of angles (or sides), you have some multi dimensional representation of objects:
feature 0, feature 1, ..., feature m
Nice, statistical approach
You can define some kind of confidence intervals, baesd on your empirical results, eg. you can fit multi-dimensional gaussian distribution to your empirical observations of "rectangle objects", and once you get a new object you simply check the probability of such value in your gaussian distribution, and have your confidence (which would be quite well justified with assumption, that your "observation" errors have normal distribution).
Distance based, simple approach
Less statistical approach would be to directly take your model's decision factor and compress it to the [0,1] interaval. For example, if you simply measure distance from some perfect shape to your new object in some metric (which yields results in [0,inf)) you could map it using some sigmoid-like function, eg.
conf( object, perfect_shape ) = 1 - tanh( distance( object, perfect_shape ) )
Hyperbolic tangent will "squash" values to the [0,1] interval, and the only remaining thing to do would be to select some scaling factor (as it grows quite quickly)
Such approach would be less valid in the mathematical terms, but would be similar to the approach taken in neural networks.
Relative approach
And more probabilistic approach could be also defined using your distance metric. If you have distances to each of your "perfect shapes" you can calculate the probability of an object being classified as some class with assumption, that classification is being performed at random, with probiability proportional to the inverse of the distance to the perfect shape.
dist(object, perfect_shape1) = d_1
dist(object, perfect_shape2) = d_2
dist(object, perfect_shape3) = d_3
inv( d_i )
conf(object, class_i) = -------------------
sum_j inv( d_j )
inv( d_i ) = max( d_j ) - d_i
First two ideas can be also incorporated into the third one to make use of knowledge of all the classes. In your particular example, the third approach should result in confidence of around 0.5 for both rectangle and circle, while in the first example it would be something closer to 0.01 (depending on how many so small objects would you have in the "training" set), which shows the difference - first two approaches show your confidence in classifing as a particular shape itself, while the third one shows relative confidence (so it can be low iff it is high for some other class, while the first two can simply answer "no classification is confident")
Building slightly on what lejlot has put forward; my preference would be to use the Mahalanobis distance with some squashing function. The Mahalanobis distance M(V, p) allows you to measure the distance between a distribution V and a point p.
In your case, I would use "perfect" examples of each class to generate the distribution V and p is the classification you want the confidence of. You can then use something along the lines of the following to be your confidence interval.
1-tanh( M(V, p) )

How to scale input DBSCAN in scikit-learn

Should the input to sklearn.clustering.DBSCAN be pre-processeed?
In the example http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py the distances between the input samples X are calculated and normalized:
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
db = DBSCAN(eps=0.95, min_samples=10).fit(S)
In another example for v0.14 (http://jaquesgrobler.github.io/online-sklearn-build/auto_examples/cluster/plot_dbscan.html) some scaling is done:
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
I base my code on the latter example and have the impression clustering works better with this scaling. However, this scaling "Standardizes features by removing the mean and scaling to unit variance". I try to find 2d clusters. If I have my clusters distributed in a squared area - let's say 100x100 I see no problem in the scaling. However, if the are distributed in an rectangled area e.g. 800x200 the scaling 'squeezes' my samples and changes the relative distances between them in one dimension. This deteriorates the clustering, doesn't it? Or am I understanding sth. wrong?
Do I need to apply some preprocessing at all, or can I simply input my 'raw' data?
It depends on what you are trying to do.
If you run DBSCAN on geographic data, and distances are in meters, you probably don't want to normalize anything, but set your epsilon threshold in meters, too.
And yes, in particular a non-uniform scaling does distort distances. While a non-distorting scaling is equivalent to just using a different epsilon value!
Note that in the first example, apparently a similarity and not a distance matrix is processed. S = (1 - D / np.max(D)) is a heuristic to convert a similarity matrix into a dissimilarity matrix. Epsilon 0.95 then effectively means at most "0.05 of the maximum dissimilarity observed". An alternate version that should yield the same result is:
D = distance.squareform(distance.pdist(X))
S = np.max(D) - D
db = DBSCAN(eps=0.95 * np.max(D), min_samples=10).fit(S)
Whereas in the second example, fit(X) actually processes the raw input data, and not a distance matrix. IMHO that is an ugly hack, to overload the method this way. It's convenient, but it leads to misunderstandings and maybe even incorrect usage sometimes.
Overall, I would not take sklearn's DBSCAN as a referene. The whole API seems to be heavily driven by classification, not by clustering. Usually, you don't fit a clustering, you do that for supervised methods only. Plus, sklearn currently does not use indexes for acceleration, and needs O(n^2) memory (which DBSCAN usually would not).
In general, you need to make sure that your distance works. If your distance function doesn't work no distance-based algorithm will produce the desired results. On some data sets, naive distances such as Euclidean work better when you first normalize your data. On other data sets, you have a good understanding on what distance is (e.g. geographic data. Doing a standardization on this obivously does not make sense, nor does Euclidean distance!)

'Probability' of a K-nearest neighbor like classification

I've a small set of data points (around 10) in a 2D space, and each of them have a category label. I wish to classify a new data point based on the existing data point labels and also associate a 'probability' for belonging to any particular label class.
Is it appropriate to label the new point based on the label to its nearest neighbor( like a K-nearest neighbor, K=1)? For getting the probability I wish to permute all the labels and calculate all the minimum distance of the unknown point and the rest and finding the fraction of cases where the minimum distance is lesser or equal to the distance that was used to label it.
The Nearest Neighbour method is already using the Bayes theorem to estimate the probability using the points in a ball containing your chosen K points. There is no need to transform, as the number of points in the ball of K points belonging to each label divided by the total number of points in that ball already is an approximation of the posterior probability of that label. In other words:
P(label|z) = P(z|label)P(label) / P(z) = K(label)/K
This is obtained using the Bayes rule of probability on an estimated probability estimated using a subset of the data. In particular, using:
VP(x) = K/N (this gives you the probability of a point in a ball of volume V)
P(x) = K/NV (from above)
P(x=label) = K(label)/N(label)V (where K(label) and N(label) are the number of points in the ball of that given class and the number of points in the total samples of that class)
P(label) = N(label)/N.
Therefore, just pick a K, calculate the distances, count the points and by checking their labels and recounting you will have your probability.
Roweis uses a probabilistic framework with KNN in his publication Neighbourhood Component Analysis. The idea is to use a "soft" nearest neighbour classification, where the probability that a point i uses another point j as its neighbour is defined by
where d_ij is the euclidean distance between point i and j.
The are no probabilities for such K-nearest classification method because it is discriminative classification as well as SVM. There are should be used postporcess for learning probabilities on unseen data with generative model like logistic regression.
1. learn K nearest classifier
2. Train logistic regression on distance and average distance to K nearest for validation data.
Check for details LibSVM article.
Sort the distances to the 10 centres; they could be
1 5 6 ... — one near, others far
1 1 1 5 6 ... — 3 near, others far
... lots of possibilities.
You could combine the 10 distances to a single number, e.g. 1 - (nearest / average) ** p,
but that's throwing away information.
(Different powers p makes the hills around the centres steeper or flatter.)
If your centres are really Gaussian hills though, take a look at
Multivariate kernel density estimation.
There are zillions of functions that go smoothly between 0 and 1,
but that doesn't make them probabilities of something.
"Probability" means either that chance, likelihood, is involved,
as in probability of rain;
or that you're trying to impress somebody.
Added again: scholar.google.com "(single|1) nearest neighbor classifier" gets > 300 hits;
"k nearest neighbor classifier" gets almost 3000.
It seems to me (non-expert) that, out of 10 different ways of mapping k-NN distances to labels,
each one might be better than the 9 others — for some data, with some error measure.
Anyway, you could try asking stats.stackexchange.com ,
The answer is : it depends.
Imagine your labels are the surname of a person, and the X,Y coordinates represent some essential characteristics of the person's DNA sequence. Clearly a more close DNA description enhance the probability of having the same surnames.
Now suppose the X,Y is the lat/long of the work office for that person. Working closer isn't related to label (surname) sharing.
So, it depends on the semantic of your tags and axes.
