'Probability' of a K-nearest neighbor like classification - statistics

I've a small set of data points (around 10) in a 2D space, and each of them have a category label. I wish to classify a new data point based on the existing data point labels and also associate a 'probability' for belonging to any particular label class.
Is it appropriate to label the new point based on the label to its nearest neighbor( like a K-nearest neighbor, K=1)? For getting the probability I wish to permute all the labels and calculate all the minimum distance of the unknown point and the rest and finding the fraction of cases where the minimum distance is lesser or equal to the distance that was used to label it.
Thanks

The Nearest Neighbour method is already using the Bayes theorem to estimate the probability using the points in a ball containing your chosen K points. There is no need to transform, as the number of points in the ball of K points belonging to each label divided by the total number of points in that ball already is an approximation of the posterior probability of that label. In other words:
P(label|z) = P(z|label)P(label) / P(z) = K(label)/K
This is obtained using the Bayes rule of probability on an estimated probability estimated using a subset of the data. In particular, using:
VP(x) = K/N (this gives you the probability of a point in a ball of volume V)
P(x) = K/NV (from above)
P(x=label) = K(label)/N(label)V (where K(label) and N(label) are the number of points in the ball of that given class and the number of points in the total samples of that class)
and
P(label) = N(label)/N.
Therefore, just pick a K, calculate the distances, count the points and by checking their labels and recounting you will have your probability.

Roweis uses a probabilistic framework with KNN in his publication Neighbourhood Component Analysis. The idea is to use a "soft" nearest neighbour classification, where the probability that a point i uses another point j as its neighbour is defined by
,
where d_ij is the euclidean distance between point i and j.

The are no probabilities for such K-nearest classification method because it is discriminative classification as well as SVM. There are should be used postporcess for learning probabilities on unseen data with generative model like logistic regression.
1. learn K nearest classifier
2. Train logistic regression on distance and average distance to K nearest for validation data.
Check for details LibSVM article.

Sort the distances to the 10 centres; they could be
1 5 6 ... — one near, others far
1 1 1 5 6 ... — 3 near, others far
... lots of possibilities.
You could combine the 10 distances to a single number, e.g. 1 - (nearest / average) ** p,
but that's throwing away information.
(Different powers p makes the hills around the centres steeper or flatter.)
If your centres are really Gaussian hills though, take a look at
Multivariate kernel density estimation.
Added:
There are zillions of functions that go smoothly between 0 and 1,
but that doesn't make them probabilities of something.
"Probability" means either that chance, likelihood, is involved,
as in probability of rain;
or that you're trying to impress somebody.
Added again: scholar.google.com "(single|1) nearest neighbor classifier" gets > 300 hits;
"k nearest neighbor classifier" gets almost 3000.
It seems to me (non-expert) that, out of 10 different ways of mapping k-NN distances to labels,
each one might be better than the 9 others — for some data, with some error measure.
Anyway, you could try asking stats.stackexchange.com ,

The answer is : it depends.
Imagine your labels are the surname of a person, and the X,Y coordinates represent some essential characteristics of the person's DNA sequence. Clearly a more close DNA description enhance the probability of having the same surnames.
Now suppose the X,Y is the lat/long of the work office for that person. Working closer isn't related to label (surname) sharing.
So, it depends on the semantic of your tags and axes.
HTH!

Related

Converting intensities to probabilities in ppp

Apologies for the overlap with existing questions; mine is at a more basic skill level. I am working with very sparse occurrences spanning very large areas, so I would like to calculate probability at pixels using the density.ppp function (as opposed to relrisk.ppp, where specifying presences+absences would be computationally intractable). Is there a straightforward way to convert density (intensity) to probabilities at each point?
Maxdist=50
dtruncauchy=function(x,L=60) L/(diff(atan(c(-1,1)*Maxdist/L)) * (L^2 + x^2))
dispersfun=function(x,y) dtruncauchy(sqrt(x^2+y^2))
n=1e3; PPP=ppp(1:n,1:n, c(1,n),c(1,n), marks=rep(1,n));
density.ppp(PPP,cutoff=Maxdist,kernel=dispersfun,at="points",leaveoneout=FALSE) #convert to probabilies?
Thank you!!
I think there is a misunderstanding about fundamentals. The spatstat package is designed mainly for analysing "mapped point patterns", datasets which record the locations where events occurred or things were located. It is designed for "presence-only" data, not "presence/absence" data (with some exceptions).
The relrisk function expects input data about the presence of two different types of events, such as the mapped locations of trees belonging to two different species, and then estimates the spatially-varying probability that a tree will belong to each species.
If you have 'presence-only' data stored in a point pattern object X of class "ppp", then density(X, ....) will produce a pixel image of the spatially-varying intensity (expected number of points per unit area). For example if the spatial coordinates were expressed in metres, then the intensity values are "points per square metre". If you want to calculate the probability of presence in each pixel (i.e. for each pixel, the probability that there is at least one presence point in the pixel), you just need to multiply the intensity value by the area of one pixel, which gives the expected number of points in the pixel. If pixels are small (the usual case) then the presence probability is just equal to this value. For physically larger pixels the probability is 1 - exp(-m) where m is the expected number of points.
Example:
X <- redwood
D <- density(X, 0.2)
pixarea <- with(D, xstep * ystep)
M <- pixarea * D
p <- 1 - exp(-M)
then M and p are images which should be almost equal, and can both be interpreted as probability of presence.
For more information see Chapter 6 of the spatstat book.
If, instead, you had a pixel image of presence/absence data, with pixel values equal to 1 or 0 for presence or absence respectively, then you can just use the function blur in the spatstat package to perform kernel smoothing of the image, and the resulting pixel values are presence probabilities.

Improving accuracy of nearest neighbours algorithm - unsupervised learning problem

I have a situation where I am trying to find out 3 nearest neighbours for a given ID in my dataframe. I am using NN alogrithm (not KNN) to achieve this. The below code is giving me the three nearest neighbours, for the top node the results are fine but for the middle ones and the bottom ones the accuracy is only 1/3 neighbours are correct whereas I am eyeing to have atleast 2/3 neighours correct at every ID. My dataset has 47 features and 5000 points.
from sklearn.neighbors import KDTree
def findsuccess(sso_id):
neighbors_f_sso_id = np.where(nbrs.kneighbors_graph([X[i]]))[0]
print('Neighbors of id', neighbors_f_sso_id)
kdt = KDTree(X, leaf_size=40, metric='euclidean')
kdt.query(X, k=4, return_distance=False)
The above code will return the ID itself and the 3 nearest neighbours ,hence k=4
I have read that due to curse of dimensionality, this NN algorithm might not work well as there are about 47 features in my dataset but this is the only option I think I have when it comes to a data frame without a target variable. There is one article available here that says the KD Tree is not best of the algorithms that can be used.
What would be the best way to achieve the maximum accuracy, meaning achieving minimum distance?
Do I need to perform scaling before passing into KD Tree algorithm? Any other things that I need to take care off?

What N ((1,0)T , I) mean related to Gaussian Distribution

Hi everyone I am reading a book "Element of Statistical Learning) and came across the below paragraph which i dont I understand. (explains how the training data was generated)
We generated 10 means mk from a bivariate Gaussian distribution N((0,1)T,I) and labeled this class as blue. Similraly, 10 more were drawn from from N((0,1)T,I) and labeled class Orange. Then for each class we generated 100 observations as follows: for each observation, we picked an mk at random with probability 1/10, and then generated a N(mk, I/5), thus leading to a mixture of Gaussian cluster for each class.
I would appreciate if you could explain the above paragraph and especially N((0,1)T,I)
by the way- (0,1) to the power of T for Transpose.
Is this notation mathmatically common or related to a specific computer language.
In the paragraph N stands for the Normal distribution; more specifically, in this case it stands for the Multivariate normal distribution. It is not specific to any programming languages. It comes from statistics and probability theory, but due to numerous appealing properties and important applications of this probability distribution it is also widely used in programming, so you should be able to perform the described procedure in any language.
The part (0,1)^T is a vector of means. That is, we have in mind a random vector of length two, where the first element on average is 0, and the second one on average is 1.
"I" stands for the 2x2 identity matrix whose role is the variance-covariance matrix. That is, the variance of both random vector components is 1 (i.e., the diagonal terms), while off-diagonal points are 0 and correspond to the covariance between the two random variables.

Isomap "nonlinear dimensionality reduction" numbre of points

I have a question please , it's about 'Isomap' nonlinear dimensionality reduction, in normal cases when I introduce a matrix distance of 100 * 100
and I apply Isomap [http://isomap.stanford.edu/][1] I get the coordinates of 100 points ,in other cases I do not understand why, with a matrix of 150 * 150 i obtain juste 35 or 50 points ?
The first step of Isomap is usually to create a "nearest neighbor matrix" so that every point is connected to its 4 or 6 or 8 or something nearest neighbors.
So, you may start with a distance matrix that is 100 x 100 and every point has a distance to 99 other points, after this first step the distances for anything but the (4 or 6 or 8) closest points are set to infinity.
Then Isomap computes a shortest path distance, hopping between nearby points to get to farther away points.
In your case, when you create a matrix of 150 points, I think that once you only keep the nearby points in the first step, the points become disconnected, and there is path between distant points. The default behavior of many Isomap codes is to return the Isomap embedding of the largest collection of connected points.
How can you fix this?
1. You can increase the number of nearest neighbors that you use until you get all the points included.
Caveat: In many natural cases, if you include most or all neighbors, this ends up in the case where the shortest-path part of the procedure does nothing, and this reduces to a problem called "multi-dimensional scaling" which gives a linear embedding.

Representing classification confidence

I am working on a simple AI program that classifies shapes using unsupervised learning method. Essentially I use the number of sides and angles between the sides and generate aggregates percentages to an ideal value of a shape. This helps me create some fuzzingness in the result.
The problem is how do I represent the degree of error or confidence in the classification? For example: a small rectangle that looks very much like a square would yield night membership values from the two categories but can I represent the degree of error?
Thanks
Your confidence is based on used model. For example, if you are simply applying some rules based on the number of angles (or sides), you have some multi dimensional representation of objects:
feature 0, feature 1, ..., feature m
Nice, statistical approach
You can define some kind of confidence intervals, baesd on your empirical results, eg. you can fit multi-dimensional gaussian distribution to your empirical observations of "rectangle objects", and once you get a new object you simply check the probability of such value in your gaussian distribution, and have your confidence (which would be quite well justified with assumption, that your "observation" errors have normal distribution).
Distance based, simple approach
Less statistical approach would be to directly take your model's decision factor and compress it to the [0,1] interaval. For example, if you simply measure distance from some perfect shape to your new object in some metric (which yields results in [0,inf)) you could map it using some sigmoid-like function, eg.
conf( object, perfect_shape ) = 1 - tanh( distance( object, perfect_shape ) )
Hyperbolic tangent will "squash" values to the [0,1] interval, and the only remaining thing to do would be to select some scaling factor (as it grows quite quickly)
Such approach would be less valid in the mathematical terms, but would be similar to the approach taken in neural networks.
Relative approach
And more probabilistic approach could be also defined using your distance metric. If you have distances to each of your "perfect shapes" you can calculate the probability of an object being classified as some class with assumption, that classification is being performed at random, with probiability proportional to the inverse of the distance to the perfect shape.
dist(object, perfect_shape1) = d_1
dist(object, perfect_shape2) = d_2
dist(object, perfect_shape3) = d_3
...
inv( d_i )
conf(object, class_i) = -------------------
sum_j inv( d_j )
where
inv( d_i ) = max( d_j ) - d_i
Conclusions
First two ideas can be also incorporated into the third one to make use of knowledge of all the classes. In your particular example, the third approach should result in confidence of around 0.5 for both rectangle and circle, while in the first example it would be something closer to 0.01 (depending on how many so small objects would you have in the "training" set), which shows the difference - first two approaches show your confidence in classifing as a particular shape itself, while the third one shows relative confidence (so it can be low iff it is high for some other class, while the first two can simply answer "no classification is confident")
Building slightly on what lejlot has put forward; my preference would be to use the Mahalanobis distance with some squashing function. The Mahalanobis distance M(V, p) allows you to measure the distance between a distribution V and a point p.
In your case, I would use "perfect" examples of each class to generate the distribution V and p is the classification you want the confidence of. You can then use something along the lines of the following to be your confidence interval.
1-tanh( M(V, p) )

Resources