Measure the uniformity of distribution of points in a 2D square - statistics

I am currently running into this problem: I have a 2D square, and have a set of points inside it, say, 1000 points. I need a way to see if the distribution of points inside the square are spread out (or more or less uniformly distributed) or they tend to gather together in some spot area inside the square.
Need a mathematical/statistical (not programming) way to determine this. I googled, found something like goodness of fit, Kolmogorov... and just wonder if there are other approaches to achieve this. Need this for class paper.
So: Inputs: a 2D square, and 1000 points.
Output: yes/no (yes = evenly spread out, no = gathering together in some spots).
Any idea would be appreciated.
Thanks

If your points are independent you can just check the distribution for each dimension individually. The Kolmogorov-Smirnov test (a measure of the distance between 2 distributions) is a good test for this. First let's generate and plot some Gaussian-distributed points so you can see how you can use the KS test (statistic) to detect a nonuniform distribution.
>>> import numpy as np
>>> from matplotlib.pyplot import plt
>>> X = np.random.gauss(1000, 2) # 1000 2-D points, normally distributed
>>> from sklearn.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> X = scaler.fit_transform(X) # fit to default uniform dist range 0-1
>>> X
array([[ 0.46169481, 0.7444449 ],
[ 0.49408692, 0.5809512 ],
...,
[ 0.60877526, 0.59758908]])
>>> plt.scatter(*list(X))
>>> from scipy import stats
>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler
>>> stats.kstest(MinMaxScaler().fit_transform(X[:,0]), 'uniform')
KstestResult(statistic=0.24738043186386116, pvalue=0.0)
The low p-value and high KS-statistic (distance from the uniform distribution) says nearly certainly did not come from a uniform distribution between 0 and 1
>>> stats.kstest(StandardScaler().fit_transform(X[:,0]), 'norm')
KstestResult(statistic=0.028970945967462303, pvalue=0.36613946547024456)
But they probably did come from a normal distribution with mean 0 and standard deviation 1 because of the high p-value and low KS distance.
Then you'd just repeat the KS-Tests for the second dimension (Y)

Related

Sklearn TruncatedSVD not showing explained variance ration in descending order, or first number means something else?

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
digits = datasets.load_digits()
X = digits.data
X = X - X.mean() # centering the data
#### svd
svd = TruncatedSVD(n_components=5)
svd.fit(X)
print(svd.explained_variance_ration)
#### PCA
pca = PCA(n_components=5)
pca.fit(X)
print(pca.explained_variance_ratio_)
svd output is:
array([0.02049911, 0.1489056 , 0.13534811, 0.11738598, 0.08382797])
pca output is:
array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415])
is there a bug in the TruncatedSVD implementation? or why is the first explained variance (0.02...) behaving like this? or what is the meaning
Summary:
That is because TruncatedSVD and PCA use different SVD functions!.
Note: Your case is due to Reason 2 below, yet I included another reason for future readers.
Details:
Reason 1: The solver set by user in each algorithm, is different:
PCA internally uses scipy.linalg.svd which sorts singular values, hence the explained_variance_ratio_ is sorted.
Part of Scikit Implementation of PCA:
# Center data
U, S, Vt = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, Vt = svd_flip(U, Vt)
components_ = Vt
# Get variance explained by singular values
explained_variance_ = (S ** 2) / (n_samples - 1)
total_var = explained_variance_.sum()
explained_variance_ratio_ = explained_variance_ / total_var
Screenshot from the above-mentioned scipy.linalg.svd link:
On the other hand, TruncatedSVD uses scipy.sparse.linalg.svds which relies on the ARPACK solver for decomposition.
Screenshot from the above-mentioned scipy.sparse.linalg.svds link:
Reason 2: The TruncatedSVD operates differently compared to PCA:
In your case you chose randomized as a solver (which is set by default) in both algorithms, yet you obtained different results with regards to the order of the variance.
That is because in PCA, the variance is obtained from the actual singular values (called Sigma or S in Scikit-Learn implementation), which are already sorted:
On the other hand, the variance in TruncatedSVD is obtained from X_transformed which results from multiplying the data matrix by the components. The latter does not necessarily preserve order because data are not centered, nor is it the purpose of TruncatedSVD which it is used in first place for sparse matrices:
Now if you center your data, you will get them sorted (note that you did not center data properly, because centering requires dividing by standard deviation):
from sklearn import datasets
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits()
X = digits.data
sc = StandardScaler()
X = sc.fit_transform(X)
### SVD
svd = TruncatedSVD(n_components=5, algorithm='randomized', random_state=2021)
svd.fit(X)
print(svd.explained_variance_ratio_)
Output
[0.12033916 0.09561054 0.08444415 0.06498406 0.04860093]
Important: Further read.

How to connect the points from a contour of a binary image

I have a segmentation result stored in a binary image, from which i want to extract the contours. To do so, I compute the difference between the mask and the eroded mask. Hence, I am able to extract the pixels that are on the boundaries of my segmentation result. Here is a code snippet:
import numpy as np
from skimage.morphology import binary_erosion
from matplotlib import pyplot as plt
# mask is a 2D boolean np.array containing the segmentation result
contour_raw=np.logical_xor(mask,binary_erosion(mask))
contour_y,contour_x=np.where(contour_raw)
fig=plt.figure()
plt.imshow(mask)
plt.plot(contour_x,contour_y,'.r')
I end up with a collection of dots on the contours of the mask:
The troubles starts when I want to connect the dots. Doing a naive plot of the contours results of course in a disappointing results, because contour_x and contour_y are not sorted as I would like:
plt.plot(contour_x,contour_y,'--r')
And here is the result, with a focus on an arbitrary part of the figure to highlight the connection between the dots:
How is it possible to sort the contours coordinates contour_x and contour_y so that they are correctly ordered when I connect the dot? Furthermore, if my mask contains several independent connected component, I would like to obtain as many contours as there are connected components.
Thanks for your help!
Best,
I think combining a clustering and convex hull works in your case. For this example, I am generating three synthetic segments using make_blobs function and demonstrating each with a color:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
from scipy.spatial import ConvexHull, convex_hull_plot_2d
X, y = make_blobs(n_samples=1000, centers=3, n_features=2, random_state=0, cluster_std=0.3)
plt.scatter(X[:,0], X[:,1], c=y)
Then, since segments are distributed in a two dimensional map, we can run density based clustering method to cluster them, and then by finding a convex hull around each cluster, we can find points surrounding those clusters coming with order:
# Fitting Clustering
c_alg = DBSCAN()
c_alg.fit(X)
labels = c_alg.labels_
for i in range(0, max(labels)+1):
ind = np.where(labels == i)
segment = X[ind, :][0]
hull = ConvexHull(segment)
plt.plot(segment[:, 0], segment[:, 1], 'o')
for simplex in hull.simplices:
plt.plot(segment[simplex, 0], segment[simplex, 1], 'k-')
However in your case Concave Hull should work not Convex Hull. There is a package alphashape in python that claimed to find Concave hulls in two-dimensional maps. More information here. The tricky part is to find the best alpha, but in this example, we can fit concave hulls using:
import alphashape
from descartes import PolygonPatch
fig, ax = plt.subplots()
for i in range(0, max(labels)+1):
ind = np.where(labels == i)
points = X[ind, :][0,:,:]
alpha_shape = alphashape.alphashape(points,5.0)
ax.scatter(*zip(*points))
ax.add_patch(PolygonPatch(alpha_shape, alpha=0.5))
plt.show()

Why does kmeans give exactly the same results everytime?

I have re-run kmeans 4 times and get
From other answers, I got that
Everytime K-Means initializes the centroid, it is generated randomly.
Could you please explain why the results are exactly the same each time?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
They are not the same. They are similar. K-means is an algorithm that is in a way moving centroids iteratively so that they become better and better at splitting data and while this process is deterministic, you have to pick initial values for those centroids and this is usually done at random. Random start, doesn't mean that final centroids will be random. They will converge to something relatively good and often similar.
Have a look at your code with this simple modification:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
cc = []
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
cc.append(kmeans.cluster_centers_)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
cc
if you have a look at exact values of those centroids, they will look like that:
[array([[ 4.97975722, 4.93316461],
[ 5.21715504, -0.18757547],
[ 0.31141141, 0.06726803],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162],
[ 4.97975722, 4.93316461]]),
array([[ 0.30066361, 0.06804847],
[ 4.97975722, 4.93316461],
[ 5.21017831, -0.18735444],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 4.97975722, 4.93316461],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162]])]
Similar, but different sets of values.
Also:
Have a look at default arguments to KMeans. There is one called n_init:
Number of time the k-means algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.
By default it is equal to 10. Which means every time you run k-means it actually run 10 times and picked the best result. Those best results will be even more similar, than results of a single run of k-means.
I post #AEF's comment to remove this question from unanswered list.
Random initialziation does not necessarily mean random result. Easiest example: k-means with k=1 always finds the mean in one step, regardless of where the center is initialised.
Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may be provided to control the random number generator used. Note that the mere presence of random_state doesn’t mean that randomization is always used, as it may be dependent on another parameter, e.g. shuffle, being set.
The passed value will have an effect on the reproducibility of the results returned by the function (fit, split, or any other function like k_means). random_state’s value may be:
for reference :
https://scikit-learn.org/stable/glossary.html#term-random_state

Vectorizing a for loop with a pandas dataframe

I am trying to do a project for my physics class where we are supposed to simulate motion of charged particles. We are supposed to randomly generate their positions and charges but we have to have positively charged particles in one region and negatively charged ones anywhere else. Right now, as a proof of concept, I am trying to do only 10 particles but the final project will have at least 1000.
My thought process is to create a dataframe with the first column containing the randomly generated charges and run a loop to see what value I get and place in the same dataframe as the next three columns their generated positions.
I have tried to do a simple for loop going over the rows and inputting the data as I go, but I run into an IndexingError: too many indexers. I also want this to run as efficiently as possible so that if I scale up the number of particles, it doesn't slow as much.
I also want to vectorize the operations of calculating the motion of each particle since it is based on position of every other particle which, through normal loops would take a lot of computational time.
Any vectorization optimization or offloading to GPU would be very helpful, thanks.
# In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
# In[2]:
num_points=10
df_position = pd.DataFrame(pd,np.empty((num_points,4)),columns=['Charge','X','Y','Z'])
# In[3]:
charge = np.array([np.random.choice(2,num_points)])
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[4]:
def positive():
return np.random.uniform(low=0, high=5)
def negative():
return np.random.uniform(low=5, high=10)
# In[5]:
for row in df_position.itertuples(index=True,name='Charge'):
if(getattr(row,"Charge")==-1):
df_position.iloc[row,1]=positive()
df_position.iloc[row,2]=positive()
df_position.iloc[row,3]=positive()
else:
df_position.iloc[row,1]=negative()
#this is where I would get the IndexingError and would like to optimize this portion
df_position.iloc[row,2]=negative()
df_position.iloc[row,3]=negative()
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[6]:
ax=plt.axes(projection='3d')
ax.set_xlim(0, 10); ax.set_ylim(0, 10); ax.set_zlim(0,10);
xdata=df_position.iloc[:,1]
ydata=df_position.iloc[:,2]
zdata=df_position.iloc[:,3]
chargedata=df_position.iloc[:11,0]
colors = np.where(df_position["Charge"]==1,'r','b')
ax.scatter3D(xdata,ydata,zdata,c=colors,alpha=1)
EDIT:
The dataframe that I want the results in would be something like this
Charge X Y Z
-1
1
-1
-1
1
With the inital coordinates of each charge listed after in their respective columns. It will be a 3D dataframe as I will need to track of all their new positions after each time step so that I can do animations of the motion. Each layer will be exactly the same format.
Some code for creating your dataframe:
import numpy as np
import pandas as pd
num_points = 1_000
# uniform distribution of int, not sure it is the best one for your problem
# positive_point = np.random.randint(0, num_points)
positive_point = int(num_points / 100 * np.random.randn() + num_points / 2)
negavite_point = num_points - positive_point
positive_df = pd.DataFrame(
np.random.uniform(0.0, 5.0, size=[positive_point, 3]), index=[1] * positive_point, columns=['X', 'Y', 'Z']
)
negative_df = pd.DataFrame(
np.random.uniform(5.0, 10.0, size=[negavite_point, 3]), index=[-1] *negavite_point, columns=['X', 'Y', 'Z']
)
df = pd.concat([positive_df, negative_df])
It is quite fast for 1,000 or 1,000,000.
Edit: with my first answer, I totally miss a big part of the question. This new one should fit better.
Second edit: I use a better distribution for the number of positive point than a uniform distribution of int.

Using python and networkx to find the probability density function

I'm struggling to draw a power law graph for Facebook Data that I found online. I'm using Networkx and I've found how to draw a Degree Histogram and a degree rank. The problem that I'm having is I want the y axis to be a probability so I'm assuming I need to sum up each y value and divide by the total number of nodes? Can anyone please help me do this? Once I've got this I'd like to draw a log-log graph to see if I can obtain a straight line. I'd really appreciate it if anyone could help! Here's my code:
import collections
import networkx as nx
import matplotlib.pyplot as plt
from networkx.algorithms import community
import math
import pylab as plt
g = nx.read_edgelist("/Users/Michael/Desktop/anaconda3/facebook_combined.txt","r")
nx.info(g)
degree_sequence = sorted([d for n, d in g.degree()], reverse=True)
degreeCount = collections.Counter(degree_sequence)
deg, cnt = zip(*degreeCount.items())
fig, ax = plt.subplots()
plt.bar(deg, cnt, width=0.80, color='b')
plt.title("Degree Histogram for Facebook Data")
plt.ylabel("Count")
plt.xlabel("Degree")
ax.set_xticks([d + 0.4 for d in deg])
ax.set_xticklabels(deg)
plt.show()
plt.loglog(degree_sequence, 'b-', marker='o')
plt.title("Degree rank plot")
plt.ylabel("Degree")
plt.xlabel("Rank")
plt.show()
You seem to be on the right tracks, but some simplifications will likely help you. The code below uses only 2 libraries.
Without access your graph, we can use some graph generators instead. I've chosen 2 qualitatively different types here, and deliberately chosen different sizes so that the normalization of the histogram is needed.
import networkx as nx
import matplotlib.pyplot as plt
g1 = nx.scale_free_graph(1000, )
g2 = nx.watts_strogatz_graph(2000, 6, p=0.8)
# we don't need to sort the values since the histogram will handle it for us
deg_g1 = nx.degree(g1).values()
deg_g2 = nx.degree(g2).values()
# there are smarter ways to choose bin locations, but since
# degrees must be discrete, we can be lazy...
max_degree = max(deg_g1 + deg_g2)
# plot different styles to see both
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(deg_g1, bins=xrange(0, max_degree), density=True, histtype='bar', rwidth=0.8)
ax.hist(deg_g2, bins=xrange(0, max_degree), density=True, histtype='step', lw=3)
# setup the axes to be log/log scaled
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('degree')
ax.set_ylabel('relative density')
ax.legend()
plt.show()
This produces an output plot like this (both g1,g2 are randomised so won't be identical):
Here we can see that g1 has an approximately straight line decay in the degree distribution -- as expected for scale-free distributions on log-log axes. Conversely, g2 does not have a scale-free degree distribution.
To say anything more formal, you could look at the toolboxes from Aaron Clauset: http://tuvalu.santafe.edu/~aaronc/powerlaws/ which implement model fitting and statistical testing of power-law distributions.

Resources