Calculate the standard deviation of a cluster of datapoints - python-3.x

So, I have a list of data points where all of them belong to a cluster(Each item is a numpy array with 3 features(represnting a point)). I compute their centroid (mean of the points). I want to calculate the standard deviation of a point from the centroid. To put it more precisely, I want to find out how many standard deviations away is a point from the centroid of the cluster. Please help me in coding it.
My list of data points looks something like this
([-5.75204079 8.78545302 8.00800119],....)

Assuming data points in a cluster are stored in a list called data, the following code will calculate standard deviation of that set of data.
# Calculate mean
mean = sum(data)/len(data)
# Calculate sum of square of difference
# of data points from mean
dev = 0
for rec in data:
dev += pow((rec - mean),2)
# Calculate variance
var = dev/len(data)
# Calculate standard deviation
std_dev = math.sqrt(var)

Related

Question(s) regarding computational intensity, prediction of time required to produce a result

Introduction
I have written code to give me a set of numbers in '36 by q' format ( 1<= q <= 36), subject to following conditions:
Each row must use numbers from 1 to 36.
No number must repeat itself in a column.
Method
The first row is generated randomly. Each number in the coming row is checked for the above conditions. If a number fails to satisfy one of the given conditions, it doesn't get picked again fot that specific place in that specific row. If it runs out of acceptable values, it starts over again.
Problem
Unlike for low q values (say 15 which takes less than a second to compute), the main objective is q=36. It has been more than 24hrs since it started to run for q=36 on my PC.
Questions
Can I predict the time required by it using the data I have from lower q values? How?
Is there any better algorithm to perform this in less time?
How can I calculate the average number of cycles it requires? (using combinatorics or otherwise).
Can I predict the time required by it using the data I have from lower q values? How?
Usually, you should be able to determine the running time of your algorithm in terms of input. Refer to big O notation.
If I understood your question correctly, you shouldn't spend hours computing a 36x36 matrix satisfying your conditions. Most probably you are stuck in the infinite loop or something. It would be more clear of you could share code snippet.
Is there any better algorithm to perform this in less time?
Well, I tried to do what you described and it works in O(q) (assuming that number of rows is constant).
import random
def rotate(arr):
return arr[-1:] + arr[:-1]
y = set([i for i in range(1, 37)])
n = 36
q = 36
res = []
i = 0
while i < n:
x = []
for j in range(q):
if y:
el = random.choice(list(y))
y.remove(el)
x.append(el)
res.append(x)
for j in range(q-1):
x = rotate(x)
res.append(x)
i += 1
i += 1
Basically, I choose random numbers from the set of {1..36} for the i+q th row, then rotate the row q times and assigned these rotated rows to the next q rows.
This guarantees both conditions you have mentioned.
How can I calculate the average number of cycles it requires?( Using combinatorics or otherwise).
I you cannot calculate the computation time in terms of input (code is too complex), then fitting to curve seems to be right.
Or you could create an ML model with iterations as data and time for each iteration as label and perform linear regression. But that seems to be overkill in your example.
Graph q vs time
Fit a curve,
Extrapolate to q = 36.
You might want to also graph q vs log(time) as that may give an easier fitted curve.

get n smallest values of multidimensional xarray.DataArray

I'm currently working with some weather data that I have as netcdf files which I can easily read with pythons xarray library
I would now like to get the n smallest values of my DataArray which has 3 dimensions (longitude, latitude and time)
When I have a DataArray dr, I can just do dr.min(), maybe specify an axis and then I get the minimum, but when I want to get also the second smallest or even a variable amount of smallest values, it seems not to be as simple
What I currently do is:
with xr.open_dataset(path) as ds:
dr = ds[selection]
dr = dr.values.reshape(dr.values.size)
dr.sort()
n_smallest = dr[0:n]
which seems a bit complicated to me compared to the simple .min() I have to type for the smallest value
I actually want to get the times to the respective smallest values which I do for the smallest with:
dr.where(dr[selection] == dr[selection].min(), drop=True)[time].values
so is there a better way of getting the n smallest values? or maybe even a simple way to get the times for the n smallest values?
maybe there is a way to reduce the 3D DataArray along the longitude and latitude axis to the respective smallest values?
I just figured out there really is a reduce function for DataArray that allows me to reduce along longitude and latitude and as I don't reduce the time dimension, I can just use the sortby function and get the DataArray with min values for each day with their respective times:
with xr.open_dataset(path) as ds:
dr = ds[selection]
dr = dr.reduce(np.min,dim=[longitude,latitude])
dr.sortby(dr)
which is obviously not shorter than my original code, but perfectly satisfies my demands

How to get N numbers of data points which are nearest from a cluster's center?

I want to get N nearest data points from center (based on Euclidean Distance) in each cluster after deploying K-means algorithm. I am able to get the indices of data points using
np.where(km.labels_ == 0)
You can use the transform method of the kmeans class which calculates the distance of each data point to each of the cluster.
Then assuming you want the top N points from the 0th index cluster then you can just do:
cluster = 0
N = 2
np.sort(kmeans.transform(X)[:,cluster])[:N]
A simple four step process:
Compute the mean
Compute the distances from the mean
Select the k smallest with argmin
map back the sunset indexes to dataset indexes by indexing into the return value of np.where

Fast algorithms to approximate distance between two strings

I am working on a project that requires to calculate minimum distance between two strings. The maximum length of each string can be 10,000(m) and we have around 50,000(n) strings. I need to find distance between each pair of strings. I also have a weight matrix that contains the the weight for each character pairs. Example, weight between (a,a) = (a,b) = 0.
Just iterating over all pair of string takes O(n^2) time. I have seen algorithms that takes O(m) time for finding distance. Then, the overall time complexity becomes O(n^2*m). Are there any algorithms which can do better than this using some pre-processing? It's actually the same problem as auto correct.
Do we have some algorithms that stores all the strings in a data structure and then we query the approximate distance between two strings from the data structure? Constructing the data structure can take O(n^2) and query processing should be done in less than O(m).
s1 = abcca, s2 = bdbbe
If we follow the above weighted matrix and calculate Euclidean distance between the two:
sqrt(0^2 + 9^2 + 9^2 + 9^2 + 342^2)
Context: I need to cluster time series and I have converted the time series to SAX representation with around 10,000 points. In order to cluster, I need to define a distance matrix. So, i need to calculate distance between two strings in an efficient way.
Note: All strings are of same length and the alphabet size is 5.
https://web.stanford.edu/class/cs124/lec/med.pdf
http://stevehanov.ca/blog/index.php?id=114

NORMDIST function is not giving the correct output

I'm trying to use NORMDIST function in Excel to create a bell curve, but the output is strange.
My mean is 0,0000583 and standard deviation is 0,0100323 so when I plug this to the function NORMDIST(0,0000583; 0,0000583; 0,0100323; FALSE) I expect to get something close to 0,5 as I'm using the same value as the mean probability of this value should be 50%, but the function gives an output of 39,77 which is clearly not correct.
Why is it like this?
A probability cannot have values greater than 1, but a density can.
The integral of the entire range of a density function is equal 1, but it can have values greater than one in specific interval. Example, a uniform distribution on the interval [0, ½] has probability density f(x) = 2 for 0 ≤ x ≤ ½ and f(x) = 0 elsewhere. See below:
          
=NORMDIST(x, mean, dev, FALSE) returns the density function. Densities are probabilities per unit. It is almost the probability of a point, but with a very tiny range interval (the derivative in the point).
shg's answer here, explain how to get a probability on a given interval with NORMIDIST and also in what occasions it can return a density greater than 1.
For a continuous variable, the probability of any particular value is zero, because there are an infinite number of values.
If you want to know the probability that a continuous random variable with a normal distribution falls in the range of a to b, use:
=NORMDIST(b, mean, dev, TRUE) - NORMDIST(a, mean, dev, TRUE)
The peak value of the density function occurs at the mean (i.e., =NORMDIST(mean, mean, dev, FALSE) ), and the value is:
=1/(SQRT(2*PI())*dev)
The peak value will exceed 1 when the deviation is less than 1 / sqrt(2pi) ~ 0.399,
which was your case.
This is an amazing answer on Cross Validated Stack Exchange (statistics) from a moderator (whuber), that addresses this issue very thoughtfully.
It is returning the probability density function whereas I think you want the cumulative distribution function (so try TRUE in place of FALSE) ref.

Resources