How does probability come in to play in a kNN algorithm? - scikit-learn

kNN seems relatively simple to understand: you have your data points and you draw them in your feature space (in a feature space of dimension 2, its the same as drawing points on a xy plane graph). When you want to classify a new data, you put the new data onto the same feature space, find the nearest k neighbors, and see what their labels are, ultimately taking the label(s) with highest votes.
So where does probability come in to play here? All I am doing to calculating distance between two points and taking the label(s) of the closest neighbor.

For a new test sample you look on the K nearest neighbors and look on their labels.
You count how many of those K samples you have in each of the classes, and divide the counts by the number of classes.
For example - lets say that you have 2 classes in your classifier and you use K=3 nearest neighbors, and the labels of those 3 nearest samples are (0,1,1) - the probability of class 0 is 1/3 and the probability for class 1 is 2/3.

Related

90% Confidence ellipsoid of 3 dimensinal data

i did get to know confidence ellipses during university (but that has been some semesters ago).
In my current project, I'd like to calculate a 3 dimensional confidence ellipse/ellipsoid in which I can set the probability of success to e.g. 90%. The center of the data is shifted from zero.
At the moment i am calculating the variance-covariance matrix of the dataset and from it its eigenvalues and eigenvectors which i then represent as an ellipsoid.
here, however, I am missing the information on the probability of success, which I cannot specify.
What is the correct way to calculate a confidence ellipsoid with e.g. 90% probability of success ?

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

Unsupervised Outlier detection

I have 6 points in each row and have around 20k such rows. Each of these row points are actually points on a curve, the nature of curve of each of the rows is same (say a sigmoidal curve or straight line, etc). These 6 points may have different x-values in each row.I also know a point (a,b) for each row which that curve should pass through. How should I go about in finding the rows which may be anomalous or show an unexpected behaviour than other rows? I was thinking of curve fitting but then I only have 6 points for each curve, all I know is that majority of the rows have same nature of curve, so I can perhaps make a general curve for all the rows and have a distance threshold for outlier detection.
What happens if you just treat the 6 points as a 12 dimensional vector and run any of the usual outlier detection methods such as LOF and LoOP?
It's trivial to see the relationship between Euclidean distance on the 12 dimensional vector, and the 6 Euclidean distances of the 6 points each. So this will compare the similarities of these curves.
You can of course also define a complex distance function for LOF.

Why Principle Components of Covariance matrix capture maximum variance of the variables?

I am trying to understand PCA, I went through several tutorials. So far I understand that, the eigenvectors of a matrix implies the directions in which vectors are rotated and scaled when multiplied by that matrix, in proportion of the eigenvalues. Hence the eigenvector associated with the maximum Eigen value defines direction of maximum rotation. I understand that along the principle component, the variations are maximum and reconstruction errors are minimum. What I do not understand is:
why finding the Eigen vectors of the covariance matrix corresponds to the axis such that the original variables are better defined with this axis?
In addition to tutorials, I reviewed other answers here including this and this. But still I do not understand it.
Your premise is incorrect. PCA (and eigenvectors of a covariance matrix) certainly don't represent the original data "better".
Briefly, the goal of PCA is to find some lower dimensional representation of your data (X, which is in n dimensions) such that as much of the variation is retained as possible. The upshot is that this lower dimensional representation is an orthogonal subspace and it's the best k dimensional representation (where k < n) of your data. We must find that subspace.
Another way to think about this: given a data matrix X find a matrix Y such that Y is a k-dimensional projection of X. To find the best projection, we can just minimize the difference between X and Y, which in matrix-speak means minimizing ||X - Y||^2.
Since Y is just a projection of X onto lower dimensions, we can say Y = X*v where v*v^T is a lower rank projection. Google rank if this doesn't make sense. We know Xv is a lower dimension than X, but we don't know what direction it points.
To do that, we find the v such that ||X - X*v*v^t||^2 is minimized. This is equivalent to maximizing ||X*v||^2 = ||v^T*X^T*X*v|| and X^T*X is the sample covariance matrix of your data. This is mathematically why we care about the covariance of the data. Also, it turns out that the v that does this the best, is an eigenvector. There is one eigenvector for each dimension in the lower dimensional projection/approximation. These eigenvectors are also orthogonal.
Remember, if they are orthogonal, then the covariance between any two of them is 0. Now think of a matrix with non-zero diagonals and zero's in the off-diagonals. This is a covariance matrix of orthogonal columns, i.e. each column is an eigenvector.
Hopefully that helps bridge the connection between covariance matrix and how it helps to yield the best lower dimensional subspace.
Again, eigenvectors don't better define our original variables. The axis determined by applying PCA to a dataset are linear combinations of our original variables that tend to exhibit maximum variance and produce the closest possible approximation to our original data (as measured by l2 norm).

Excel formula to calculate the distance between multiple points using lat/lon coordinates

I'm currently drawing up a mock database schema with two tables: Booking and Waypoint.
Booking stores the taxi booking information.
Waypoint stores the pickup and drop off points during the journey, along with the lat lon position. Each sequence is a stop in the journey.
How would I calculate the distance between the different stops in each journey (using the lat/lon data) in Excel?
Is there a way to programmatically define this in Excel, i.e. so that a formula can be placed into the mileage column (Booking table), lookup the matching sequence (via bookingId) for that journey in the Waypoint table and return a result?
Example 1:
A journey with 2 stops:
1 1 1 MK4 4FL, 2, Levens Hall Drive, Westcroft, Milton Keynes 52.002529 -0.797623
2 1 2 MK2 2RD, 55, Westfield Road, Bletchley, Milton Keynes 51.992571 -0.72753
4.1 miles according to Google, entry made in mileage column in Booking table where id = 1
Example 2:
A journey with 3 stops:
6 3 1 MK7 7DT, 2, Spearmint Close, Walnut Tree, Milton Keynes 52.017486 -0.690113
7 3 2 MK18 1JL, H S B C, Market Hill, Buckingham 52.000674 -0.987062
8 3 1 MK17 0FE, 1, Maids Close, Mursley, Milton Keynes 52.040622 -0.759417
27.7 miles according to Google, entry made in mileage column in Booking table where id = 3
If you want to find the distance between two points just use this formula and you will get the result in Km, just convert to miles if needed.
Point A: LAT1, LONG1
Point B: LAT2, LONG2
ACOS(COS(RADIANS(90-Lat1)) *COS(RADIANS(90-Lat2)) +SIN(RADIANS(90-Lat1)) *SIN(RADIANS(90-lat2)) *COS(RADIANS(long1-long2)))*6371
Regards
Until quite recently, accurate maps were constructed by triangulation, which in essence is the application of Pythagoras’s Theorem. For the distance between any pair of co-ordinates take the square root of the sum of the square of the difference in x co-ordinates and the square of the difference in y co-ordinates. The x and y co-ordinates must however be in the same units (eg miles) which involves factoring the latitude and longitude values. This can be complicated because the factor for longitude depends upon latitude (walking all round the North Pole is less far than walking around the Equator) but in your case a factor for 52o North should serve. From this the results (which might be checked here) are around 20% different from the examples you give (in the second case, with pairing IDs 6 and 7 and adding that result to the result from pairing IDs 7 and 8).
Since you say accuracy is not important, and assuming distances are small (say less than 1000 miles) you can use the loxodromic distance.
For this, compute the difference of latitutes (dlat) and difference of longitudes (dlon). If there were any chance (unlikely) that you're crossing meridian 180º, take modulo 360º to ensure the difference of longitudes is between -180º and 180º. Also compute average latitude (alat).
Then compute:
distance= 60*sqrt(dlat^2 + (dlon*cos(alat))^2)
This distance is in nautical miles. Apply conversions as needed.
EXPLANATION: This takes advantage of the fact that one nautical mile is, by definition, always equal to one minute-arc of latitude. The cosine corresponds to the fact that meridians get closer to each other as they approach the poles. The rest is just application of Pythagoras theorem -- which requires that the relevant portion of the globe be flat, which is of course only a good approximation for small distances.
It all depends on what the distance is and what accuracy you require. Calculations based on "Earth locally flat" model will not provide great results for long distances but for short distance they may be ok. Models assuming Earth is a perfect sphere (e.g. Haversine formula) give better accuracy but they still do not produce geodesic grade results.
See Geodesics on an ellipsoid for more details.
One of the high accuracy (fraction of a mm) solutions is known as Vincenty's formulae. For my Excel VBA implementation look here https://github.com/tdjastrzebski/Vincenty-Excel

Resources