Correaltion and regression analysis - statistics

How should I analysis the correlation between four ordinal numbers (0,1,2,3) and various range of the continuous values? The scatter plot looks like a 4 parallel horizontal dots .

You could run a Spearman rank correlation test. Using R,
require(pspearman)
x <- c(rep("a", 5), rep("b", 5), rep("c", 5), rep("d", 5))
x <- factor(x, levels=c("a", "b", "c", "d"), ordered=T)
y <- 1:20
spearman.test(x, y)
Spearman's rank correlation rho
data: x and y
S = 40.6203, p-value = 6.566e-06
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.9694584
Warning message:
In spearman.test(x, y) : Cannot compute exact p-values with ties
Non-significant correlation
set.seed(123)
y2 <- rnorm(20)
spearman.test(x, y2)
Spearman's rank correlation rho
data: x and y2
S = 1144.329, p-value = 0.5558
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.139602
Warning message:
In spearman.test(x, y2) : Cannot compute exact p-values with ties

Related

How to calculate the common volume/intersection between 2, 2D kde plots in python?

I have 2 sets of datapoints:
import random
import pandas as pd
A = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
B = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
For each one of these dataset I can produce the jointplot like this:
import seaborn as sns
sns.jointplot(x=A["x"], y=A["y"], kind='kde')
sns.jointplot(x=B["x"], y=B["y"], kind='kde')
Is there a way to calculate the "common area" between these 2 joint plots ?
By common area, I mean, if you put one joint plot "inside" the other, what is the total area of intersection. So if you imagine these 2 joint plots as mountains, and you put one mountain inside the other, how much does one fall inside the other ?
EDIT
To make my question more clear:
import matplotlib.pyplot as plt
import scipy.stats as st
def plot_2d_kde(df):
# Extract x and y
x = df['x']
y = df['y']
# Define the borders
deltaX = (max(x) - min(x))/10
deltaY = (max(y) - min(y))/10
xmin = min(x) - deltaX
xmax = max(x) + deltaX
ymin = min(y) - deltaY
ymax = max(y) + deltaY
# Create meshgrid
xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
# We will fit a gaussian kernel using the scipy’s gaussian_kde method
positions = np.vstack([xx.ravel(), yy.ravel()])
values = np.vstack([x, y])
kernel = st.gaussian_kde(values)
f = np.reshape(kernel(positions).T, xx.shape)
fig = plt.figure(figsize=(13, 7))
ax = plt.axes(projection='3d')
surf = ax.plot_surface(xx, yy, f, rstride=1, cstride=1, cmap='coolwarm', edgecolor='none')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('PDF')
ax.set_title('Surface plot of Gaussian 2D KDE')
fig.colorbar(surf, shrink=0.5, aspect=5) # add color bar indicating the PDF
ax.view_init(60, 35)
I am interested in finding the interection/common volume (just the number) of these 2 kde plots:
plot_2d_kde(A)
plot_2d_kde(B)
Credits: The code for the kde plots is from here
I believe this is what you're looking for. I'm basically calculating the space (integration) of the intersection (overlay) of the two KDE distributions.
A = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
B = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
# KDE fro both A and B
kde_a = scipy.stats.gaussian_kde([A.x, A.y])
kde_b = scipy.stats.gaussian_kde([B.x, B.y])
min_x = min(A.x.min(), B.x.min())
min_y = min(A.y.min(), B.y.min())
max_x = max(A.x.max(), B.x.max())
max_y = max(A.y.max(), B.y.max())
print(f"x is from {min_x} to {max_x}")
print(f"y is from {min_y} to {max_y}")
x = [a[0] for a in itertools.product(np.arange(min_x, max_x, 0.01), np.arange(min_y, max_y, 0.01))]
y = [a[1] for a in itertools.product(np.arange(min_x, max_x, 0.01), np.arange(min_y, max_y, 0.01))]
# sample across 100x100 points.
a_dist = kde_a([x, y])
b_dist = kde_b([x, y])
print(a_dist.sum() / len(x)) # intergral of A
print(b_dist.sum() / len(x)) # intergral of B
print(np.minimum(a_dist, b_dist).sum() / len(x)) # intergral of the intersection between A and B
The following code compares calculating the volume of the intersection either via scipy's dblquad or via taking the average value over a grid.
Remarks:
For the 2D case (and with only 100 sample points), it seems the delta's need to be quite larger than 10%. The code below uses 25%. With a delta of 10%, the calculated values for f1 and f2 are about 0.90, while in theory they should be 1.0. With a delta of 25%, these values are around 0.994.
To approximate the volume the simple way, the average needs to be multiplied by the area (here (xmax - xmin)*(ymax - ymin)). Also, the more grid points are considered, the better the approximation. The code below uses 1000x1000 grid points.
Scipy has some special functions to calculate the integral, such as scipy.integrate.dblquad. This is much slower than the 'simple' method, but a bit more precise. The default precision didn't work, so the code below reduces that precision considerably. (dblquad outputs two numbers: the approximate integral and an indication of the error. To only get the integral, dblquad()[0] is used in the code.)
The same approach can be used for more dimensions. For the 'simple' method, create a more dimensional grid (xx, yy, zz = np.mgrid[xmin:xmax:100j, ymin:ymax:100j, zmin:zmax:100j]). Note that a subdivision by 1000 in each dimension would create a grid that's too large to work with.
When using scipy.integrate, dblquad needs to be replaced by tplquad for 3 dimensions or nquad for N dimensions. This probably will also be rather slow, so the accuracy needs to be reduced further.
import numpy as np
import pandas as pd
import scipy.stats as st
from scipy.integrate import dblquad
df1 = pd.DataFrame({'x':np.random.uniform(0, 1, 100), 'y':np.random.uniform(0, 1, 100)})
df2 = pd.DataFrame({'x':np.random.uniform(0, 1, 100), 'y':np.random.uniform(0, 1, 100)})
# Extract x and y
x1 = df1['x']
y1 = df1['y']
x2 = df2['x']
y2 = df2['y']
# Define the borders
deltaX = (np.max([x1, x2]) - np.min([x1, x2])) / 4
deltaY = (np.max([y1, y2]) - np.min([y1, y2])) / 4
xmin = np.min([x1, x2]) - deltaX
xmax = np.max([x1, x2]) + deltaX
ymin = np.min([y1, y2]) - deltaY
ymax = np.max([y1, y2]) + deltaY
# fit a gaussian kernel using scipy’s gaussian_kde method
kernel1 = st.gaussian_kde(np.vstack([x1, y1]))
kernel2 = st.gaussian_kde(np.vstack([x2, y2]))
print('volumes via scipy`s dblquad (volume):')
print(' volume_f1 =', dblquad(lambda y, x: kernel1((x, y)), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])
print(' volume_f2 =', dblquad(lambda y, x: kernel2((x, y)), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])
print(' volume_intersection =',
dblquad(lambda y, x: np.minimum(kernel1((x, y)), kernel2((x, y))), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])
Alternatively, one can calculate the mean value over a grid of points, and multiply the result by the area of the grid. Note that np.mgrid is much faster than creating a list via itertools.
# Create meshgrid
xx, yy = np.mgrid[xmin:xmax:1000j, ymin:ymax:1000j]
positions = np.vstack([xx.ravel(), yy.ravel()])
f1 = np.reshape(kernel1(positions).T, xx.shape)
f2 = np.reshape(kernel2(positions).T, xx.shape)
intersection = np.minimum(f1, f2)
print('volumes via the mean value multiplied by the area:')
print(' volume_f1 =', np.sum(f1) / f1.size * ((xmax - xmin)*(ymax - ymin)))
print(' volume_f2 =', np.sum(f2) / f2.size * ((xmax - xmin)*(ymax - ymin)))
print(' volume_intersection =', np.sum(intersection) / intersection.size * ((xmax - xmin)*(ymax - ymin)))
Example output:
volumes via scipy`s dblquad (volume):
volume_f1 = 0.9946974276169385
volume_f2 = 0.9928998852123891
volume_intersection = 0.9046421634401607
volumes via the mean value multiplied by the area:
volume_f1 = 0.9927873844924111
volume_f2 = 0.9910132867915901
volume_intersection = 0.9028999384136771

How to visualize feasible region for linear programming (with arbitrary inequalities) in Numpy/MatplotLib?

I need to implement a solver for linear programming problems. All of the restrictions are <= ones such as
5x + 10y <= 10
There can be an arbitrary amount of these restrictions. Also , x>=0 y>=0 implicitly.
I need to find the optimal solutions(max) and show the feasible region in matplotlib. I've found the optimal solution by implementing the simplex method but I can't figure out how to draw the graph.
Some approaches I've found:
This link finds the minimum of the y points from each function and uses plt.fillBetween() to draw the region. But it doesn't work when I change the order of the equations. I'm not sure which y values to minimize(). So I can't use it for arbitrary restrictions.
Find solution for every pair of restrictions and draw a polygon. Not efficient.
An easier approach might be to have matplotlib compute the feasible region on its own (with you only providing the constraints) and then simply overlay the "constraint" lines on top.
# plot the feasible region
d = np.linspace(-2,16,300)
x,y = np.meshgrid(d,d)
plt.imshow( ((y>=2) & (2*y<=25-x) & (4*y>=2*x-8) & (y<=2*x-5)).astype(int) ,
extent=(x.min(),x.max(),y.min(),y.max()),origin="lower", cmap="Greys", alpha = 0.3);
# plot the lines defining the constraints
x = np.linspace(0, 16, 2000)
# y >= 2
y1 = (x*0) + 2
# 2y <= 25 - x
y2 = (25-x)/2.0
# 4y >= 2x - 8
y3 = (2*x-8)/4.0
# y <= 2x - 5
y4 = 2 * x -5
# Make plot
plt.plot(x, 2*np.ones_like(y1))
plt.plot(x, y2, label=r'$2y\leq25-x$')
plt.plot(x, y3, label=r'$4y\geq 2x - 8$')
plt.plot(x, y4, label=r'$y\leq 2x-5$')
plt.xlim(0,16)
plt.ylim(0,11)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
This is a vertex enumeration problem. You can use the function lineqs which visualizes the system of inequalities A x >= b for any number of lines. The function will also display the vertices on which the graph was plotted.
The last 2 lines mean that x,y >=0
from intvalpy import lineqs
import numpy as np
A = -np.array([[5, 10],
[-1, 0],
[0, -1]])
b = -np.array([10, 0, 0])
lineqs(A, b, title='Solution', color='gray', alpha=0.5, s=10, size=(15,15), save=False, show=True)
Visual Solution Link

Plotting all of a trigonometric function (x^2 + y^2 == 1) with matplotlib and python

As an exercise in learning Matplotlib and improving my math/coding I decided to try and plot a trigonometric function (x squared plus y squared equals one).
Trigonometric functions are also called "circular" functions but I am only producing half the circle.
#Attempt to plot equation x^2 + y^2 == 1
import numpy as np
import matplotlib.pyplot as plt
import math
x = np.linspace(-1, 1, 21) #generate np.array of X values -1 to 1 in 0.1 increments
x_sq = [i**2 for i in x]
y = [math.sqrt(1-(math.pow(i, 2))) for i in x] #calculate y for each value in x
y_sq = [i**2 for i in y]
#Print for debugging / sanity check
for i,j in zip(x_sq, y_sq):
print('x: {:1.4f} y: {:1.4f} x^2: {:1.4f} y^2: {:1.4f} x^2 + Y^2 = {:1.4f}'.format(math.sqrt(i), math.sqrt(j), i, j, i+j))
#Format how the chart displays
plt.figure(figsize=(6, 4))
plt.axhline(y=0, color='y')
plt.axvline(x=0, color='y')
plt.grid()
plt.plot(x, y, 'rx')
plt.show()
I want to plot the full circle. My code only produces the positive y values and I want to plot the full circle.
Here is how the full plot should look. I used Wolfram Alpha to generate it.
Ideally I don't want solutions where the lifting is done for me such as using matplotlib.pyplot.contour. As a learning exercise, I want to "see the working" so to speak. Namely I ideally want to generate all the values and plot them "manually".
The only method I can think of is to re-arrange the equation and generate a set of negative y values with calculated x values then plot them separately. I am sure there is a better way to achieve the outcome and I am sure one of the gurus on Stack Overflow will know what those options are.
Any help will be gratefully received. :-)
The equation x**2 + y**2 = 1 describes a circle with radius 1 around the origin.
But suppose you wouldn't know this already, you can still try to write this equation in polar coordinates,
x = r*cos(phi)
y = r*sin(phi)
(r*cos(phi))**2 + (r*sin(phi))**2 == 1
r**2*(cos(phi)**2 + sin(phi)**2) == 1
Due to the trigonometric identity cos(phi)**2 + sin(phi)**2 == 1 this reduces to
r**2 == 1
and since r should be real,
r == 1
(for any phi).
Plugging this into python:
import numpy as np
import matplotlib.pyplot as plt
phi = np.linspace(0, 2*np.pi, 200)
r = 1
x = r*np.cos(phi)
y = r*np.sin(phi)
plt.plot(x,y)
plt.axis("equal")
plt.show()
This happens because the square root returns only the positive value, so you need to take those values and turn them into negative values.
You can do something like this:
import numpy as np
import matplotlib.pyplot as plt
r = 1 # radius
x = np.linspace(-r, r, 1000)
y = np.sqrt(r-x**2)
plt.figure(figsize=(5,5), dpi=100) # figsize=(n,n), n needs to be equal so the image doesn't flatten out
plt.grid(linestyle='-', linewidth=2)
plt.plot(x, y, color='g')
plt.plot(x, -y, color='r')
plt.legend(['Positive y', 'Negative y'], loc='lower right')
plt.axhline(y=0, color='b')
plt.axvline(x=0, color='b')
plt.show()
And that should return this:
PLOT

python - spectrogram divide by zero encountered in log10 warning

i tried to generate a spectrogram for each axis in my dataset
here what i tried
dataset = np.loadtxt("trainingdataset.txt", delimiter=",", dtype = np.int32)
fake_size = 1415684
time = np.arange(fake_size)/1415684 # 1kHz
base_freq = 2 * np.pi * 100
x = dataset[:,2]
y = dataset[:,3]
z = dataset[:,4]
xyz_magnitude = x**2 + y**2 + z**2
to_plot = [('x', x), ('y', y), ('z', z), ('xyz', xyz_magnitude)]
for chl, data in to_plot:
plt.figure(); plt.title(chl)
d = plt.specgram(data, Fs=1000)
plt.xlabel('Time [s]'); plt.ylabel('Frequency [Hz]')
plt.show()
but it gives the following warning
Warning (from warnings module):
File "C:\Users\hadeer.elziaat\AppData\Local\Programs\Python\Python36\lib\site-packages\matplotlib\axes\_axes.py", line 7221
Z = 10. * np.log10(spec)
RuntimeWarning: divide by zero encountered in log10
the dataset headers
(patient number, time/millisecond, x-axis, y-axis, z-axis, label)
1,15,70,39,-970,0
1,31,70,39,-970,0
1,46,60,49,-960,0
1,62,60,49,-960,0
1,78,50,39,-960,0
1,93,50,39,-960,0
1,109,60,39,-990,0
According to the manual, the default scaling is dB. In case of zero values in the calculated spectrogram, evaluation of the logarithmic scale will lead to an error.

Calculate the volume of 3d plot

The data is from a measurement. The picture of the plotted data
I tried using trapz twice, but I get and error code: "ValueError: operands could not be broadcast together with shapes (1,255) (256,531)"
The x has 256 points and y has 532 points, also the Z is a 2d array that has a 256 by 532 lenght. The code is below:
import numpy as np
img=np.loadtxt('focus_x.txt')
m=0
m=np.max(img)
Z=img/m
X=np.loadtxt("pixelx.txt",float)
Y=np.loadtxt("pixely.txt",float)
[X, Y] = np.meshgrid(X, Y)
volume=np.trapz(X,np.trapz(Y,Z))
The docs state that trapz should be used like this
intermediate = np.trapz(Z, x)
result = np.trapz(intermediate, y)
trapz is reducing the dimensionality of its operand (by default on the last axis) using optionally a 1D array of abscissae to determine the sub intervals of integration; it is not using a mesh grid for its operation.
A complete example.
First we compute, using sympy, the integral of a simple bilinear function over a rectangular domain (0, 5) × (0, 7)
In [1]: import sympy as sp, numpy as np
In [2]: x, y = sp.symbols('x y')
In [3]: f = 1 + 2*x + y + x*y
In [4]: f.integrate((x, 0, 5)).integrate((y, 0, 7))
Out[4]: 2555/4
Now we compute the trapezoidal approximation to the integral (as it happens, the approximation is exact for a bilinear function) — we need coordinates arrays
In [5]: x, y = np.linspace(0, 5, 11), np.linspace(0, 7, 22)
(note that the sampling is different in the two directions and different from the defalt value used by trapz) — we need a mesh grid to compute the integrand and we need to compute the integrand
In [6]: X, Y = np.meshgrid(x, y)
In [7]: z = 1 + 2*X + Y + X*Y
and eventually we compute the integral
In [8]: 4*np.trapz(np.trapz(z, x), y)
Out[8]: 2555.0

Resources