I have been following lectures of MIT open course on Application of Mathematics in Finance. I am trying to code out the concepts for better understanding.
According to lectures(from what I understand), if random variable X is normally distributed then exp(X) is log-normally distributed and vice versa (please correct me if I am wrong here).
Here is what I tried:
I have list of integers that are normally distributed:
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
X = np.array(l)
mu = np.mean(X)
sigma = np.std(X)
count, bins, ignored = plt.hist(X, 35, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2)
),linewidth=2, color='r')
plt.show()
Output:
Normally distributed curve
Now I want to get log-normal distribution from above data, here is what I have tried:
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
X = np.array(l)
ln = []
for x in X:
val = np.e**x
ln.append(val)
X_ln = np.array(ln)
X_ln = np.array(X_ln) / np.min(X_ln)
mu = np.mean(X_ln)
sigma = np.std(X_ln)
count, bins, ignored = plt.hist(X_ln, 10, density=True)
x = np.linspace(min(bins), max(bins), 10000)
pdf = (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2)) / (x * sigma * np.sqrt(2 * np.pi)))
plt.plot(x, pdf, color='r', linewidth=2)
plt.show()
Output :
Not so clean Output
I know there is a better way to do this, but I can't figure out how. Any suggestions would be highly appreciated.
Here are couple of references:
Log normal distribution in Python
MIT lecture notes(topic-1.1)
In case this is relevant, here is a list of elements I am trying to process:
List of elements
Update 1:
I have normalized X before adding values to ln. This fixed the distribution of histogram, however, I can't seem to fix to get red line to show log-normal distribution. Also the new histogram distribution is not very different from normal distribution. I can't think of any suitable reason for that.
This is the block of code I have added:
def normalize(v):
norm=np.linalg.norm(v, ord=1)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm
X = np.array(l)
X = normalize(X)
New Output:
Slightly better result
I have 2 sets of datapoints:
import random
import pandas as pd
A = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
B = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
For each one of these dataset I can produce the jointplot like this:
import seaborn as sns
sns.jointplot(x=A["x"], y=A["y"], kind='kde')
sns.jointplot(x=B["x"], y=B["y"], kind='kde')
Is there a way to calculate the "common area" between these 2 joint plots ?
By common area, I mean, if you put one joint plot "inside" the other, what is the total area of intersection. So if you imagine these 2 joint plots as mountains, and you put one mountain inside the other, how much does one fall inside the other ?
EDIT
To make my question more clear:
import matplotlib.pyplot as plt
import scipy.stats as st
def plot_2d_kde(df):
# Extract x and y
x = df['x']
y = df['y']
# Define the borders
deltaX = (max(x) - min(x))/10
deltaY = (max(y) - min(y))/10
xmin = min(x) - deltaX
xmax = max(x) + deltaX
ymin = min(y) - deltaY
ymax = max(y) + deltaY
# Create meshgrid
xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
# We will fit a gaussian kernel using the scipy’s gaussian_kde method
positions = np.vstack([xx.ravel(), yy.ravel()])
values = np.vstack([x, y])
kernel = st.gaussian_kde(values)
f = np.reshape(kernel(positions).T, xx.shape)
fig = plt.figure(figsize=(13, 7))
ax = plt.axes(projection='3d')
surf = ax.plot_surface(xx, yy, f, rstride=1, cstride=1, cmap='coolwarm', edgecolor='none')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('PDF')
ax.set_title('Surface plot of Gaussian 2D KDE')
fig.colorbar(surf, shrink=0.5, aspect=5) # add color bar indicating the PDF
ax.view_init(60, 35)
I am interested in finding the interection/common volume (just the number) of these 2 kde plots:
plot_2d_kde(A)
plot_2d_kde(B)
Credits: The code for the kde plots is from here
I believe this is what you're looking for. I'm basically calculating the space (integration) of the intersection (overlay) of the two KDE distributions.
A = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
B = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
# KDE fro both A and B
kde_a = scipy.stats.gaussian_kde([A.x, A.y])
kde_b = scipy.stats.gaussian_kde([B.x, B.y])
min_x = min(A.x.min(), B.x.min())
min_y = min(A.y.min(), B.y.min())
max_x = max(A.x.max(), B.x.max())
max_y = max(A.y.max(), B.y.max())
print(f"x is from {min_x} to {max_x}")
print(f"y is from {min_y} to {max_y}")
x = [a[0] for a in itertools.product(np.arange(min_x, max_x, 0.01), np.arange(min_y, max_y, 0.01))]
y = [a[1] for a in itertools.product(np.arange(min_x, max_x, 0.01), np.arange(min_y, max_y, 0.01))]
# sample across 100x100 points.
a_dist = kde_a([x, y])
b_dist = kde_b([x, y])
print(a_dist.sum() / len(x)) # intergral of A
print(b_dist.sum() / len(x)) # intergral of B
print(np.minimum(a_dist, b_dist).sum() / len(x)) # intergral of the intersection between A and B
The following code compares calculating the volume of the intersection either via scipy's dblquad or via taking the average value over a grid.
Remarks:
For the 2D case (and with only 100 sample points), it seems the delta's need to be quite larger than 10%. The code below uses 25%. With a delta of 10%, the calculated values for f1 and f2 are about 0.90, while in theory they should be 1.0. With a delta of 25%, these values are around 0.994.
To approximate the volume the simple way, the average needs to be multiplied by the area (here (xmax - xmin)*(ymax - ymin)). Also, the more grid points are considered, the better the approximation. The code below uses 1000x1000 grid points.
Scipy has some special functions to calculate the integral, such as scipy.integrate.dblquad. This is much slower than the 'simple' method, but a bit more precise. The default precision didn't work, so the code below reduces that precision considerably. (dblquad outputs two numbers: the approximate integral and an indication of the error. To only get the integral, dblquad()[0] is used in the code.)
The same approach can be used for more dimensions. For the 'simple' method, create a more dimensional grid (xx, yy, zz = np.mgrid[xmin:xmax:100j, ymin:ymax:100j, zmin:zmax:100j]). Note that a subdivision by 1000 in each dimension would create a grid that's too large to work with.
When using scipy.integrate, dblquad needs to be replaced by tplquad for 3 dimensions or nquad for N dimensions. This probably will also be rather slow, so the accuracy needs to be reduced further.
import numpy as np
import pandas as pd
import scipy.stats as st
from scipy.integrate import dblquad
df1 = pd.DataFrame({'x':np.random.uniform(0, 1, 100), 'y':np.random.uniform(0, 1, 100)})
df2 = pd.DataFrame({'x':np.random.uniform(0, 1, 100), 'y':np.random.uniform(0, 1, 100)})
# Extract x and y
x1 = df1['x']
y1 = df1['y']
x2 = df2['x']
y2 = df2['y']
# Define the borders
deltaX = (np.max([x1, x2]) - np.min([x1, x2])) / 4
deltaY = (np.max([y1, y2]) - np.min([y1, y2])) / 4
xmin = np.min([x1, x2]) - deltaX
xmax = np.max([x1, x2]) + deltaX
ymin = np.min([y1, y2]) - deltaY
ymax = np.max([y1, y2]) + deltaY
# fit a gaussian kernel using scipy’s gaussian_kde method
kernel1 = st.gaussian_kde(np.vstack([x1, y1]))
kernel2 = st.gaussian_kde(np.vstack([x2, y2]))
print('volumes via scipy`s dblquad (volume):')
print(' volume_f1 =', dblquad(lambda y, x: kernel1((x, y)), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])
print(' volume_f2 =', dblquad(lambda y, x: kernel2((x, y)), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])
print(' volume_intersection =',
dblquad(lambda y, x: np.minimum(kernel1((x, y)), kernel2((x, y))), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])
Alternatively, one can calculate the mean value over a grid of points, and multiply the result by the area of the grid. Note that np.mgrid is much faster than creating a list via itertools.
# Create meshgrid
xx, yy = np.mgrid[xmin:xmax:1000j, ymin:ymax:1000j]
positions = np.vstack([xx.ravel(), yy.ravel()])
f1 = np.reshape(kernel1(positions).T, xx.shape)
f2 = np.reshape(kernel2(positions).T, xx.shape)
intersection = np.minimum(f1, f2)
print('volumes via the mean value multiplied by the area:')
print(' volume_f1 =', np.sum(f1) / f1.size * ((xmax - xmin)*(ymax - ymin)))
print(' volume_f2 =', np.sum(f2) / f2.size * ((xmax - xmin)*(ymax - ymin)))
print(' volume_intersection =', np.sum(intersection) / intersection.size * ((xmax - xmin)*(ymax - ymin)))
Example output:
volumes via scipy`s dblquad (volume):
volume_f1 = 0.9946974276169385
volume_f2 = 0.9928998852123891
volume_intersection = 0.9046421634401607
volumes via the mean value multiplied by the area:
volume_f1 = 0.9927873844924111
volume_f2 = 0.9910132867915901
volume_intersection = 0.9028999384136771
I have a variable (P) which is a function of angle (theta):
In this equation the K is a constant, theta_p is equal to zero and I is the modified Bessel function of the first kind (order 0) which is defined as:
Now, I want to plot the P versus theta for different values of constant K. First I calculated the parameter I and then plug it into the first equation to calculate P for different angles theta. I mapped it into a Cartesian coordinate by putting :
x = P*cos(theta)
y = P*sin(theta)
Here is my python implementation using matplotlib and scipy when the constant k=2.0:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import quad
def integrand(x, a, k):
return a*np.exp(k*np.cos(x))
theta = (np.arange(0, 362, 2))
theta_p = 0.0
X = []
Y = []
for i in range(len(theta)):
a = (1 / np.pi)
k = 2.0
Bessel = quad(integrand, 0, np.pi, args=(a, k))
I = list(Bessel)[0]
P = (1 / (np.pi * I)) * np.exp(k * np.cos(2 * (theta[i]*np.pi/180. - theta_p)))
x = P*np.cos(theta[i]*np.pi/180.)
y = P*np.sin(theta[i]*np.pi/180.)
X.append(x)
Y.append(y)
plt.plot(X,Y, linestyle='-', linewidth=3, color='red')
axes = plt.gca()
plt.show()
I should get a set of graphs like the below figure for different K values:
(Note that the distributions were plotted on a circle of unit 1 to ease visualization)
However it seems like the graphs produced by the above code are not similar to the above figure.
Any idea what is the issue with the above implementation?
Thanks in advance for your help.
Here is how it looks like (for k=2):
The reference for these formulas are the equation 5 and 6 that you could find here
You had a mistake in your formula.
Your formula gives the delta of your function above a unit circle. So in your function to get the plot you want, simply add 1 to it.
Here is what you want, with some tidied up python. ...note you can do the whole calculation of the 'P' values as a numpy vector line, you don't need to loop over the indicies. ...also you can just do a polar plot directly in matplotlib - you don't need to transform it into cartesian.
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import quad
theta = np.arange(0, 2*np.pi+0.1, 2*np.pi/100)
def integrand(x, a, k):
return a*np.exp(k*np.cos(x))
for k in np.arange(0, 5, 0.5):
a = (1 / np.pi)
Bessel = quad(integrand, 0, np.pi, args=(a, k))
I = Bessel[0]
P = 1 + (1/(np.pi * I)) * np.exp(k * np.cos(2 * theta))
plt.polar(theta, P)
plt.show()
How can I get from a plot in Python an exact value on y - axis? I have two arrays vertical_data and gradient(temperature_data) and I plotted them as:
plt.plot(gradient(temperature_data),vertical_data)
plt.show()
Plot shown here:
I need the zero value but it is not exactly zero, it's a float.
I did not find a good answer to the question of how to find the roots or zeros of a numpy array, so here is a solution, using simple linear interpolation.
import numpy as np
N = 750
x = .4+np.sort(np.random.rand(N))*3.5
y = (x-4)*np.cos(x*9.)*np.cos(x*6+0.05)+0.1
def find_roots(x,y):
s = np.abs(np.diff(np.sign(y))).astype(bool)
return x[:-1][s] + np.diff(x)[s]/(np.abs(y[1:][s]/y[:-1][s])+1)
z = find_roots(x,y)
import matplotlib.pyplot as plt
plt.plot(x,y)
plt.plot(z, np.zeros(len(z)), marker="o", ls="", ms=4)
plt.show()
Of course you can invert the roles of x and y to get
plt.plot(y,x)
plt.plot(np.zeros(len(z)),z, marker="o", ls="", ms=4)
Because people where asking how to get the intercepts at non-zero values y0, note that one may simply find the zeros of y-y0 then.
y0 = 1.4
z = find_roots(x,y-y0)
# ...
plt.plot(z, np.zeros(len(z))+y0)
People were also asking how to get the intersection between two curves. In that case it's again about finding the roots of the difference between the two, e.g.
x = .4 + np.sort(np.random.rand(N)) * 3.5
y1 = (x - 4) * np.cos(x * 9.) * np.cos(x * 6 + 0.05) + 0.1
y2 = (x - 2) * np.cos(x * 8.) * np.cos(x * 5 + 0.03) + 0.3
z = find_roots(x,y2-y1)
plt.plot(x,y1)
plt.plot(x,y2, color="C2")
plt.plot(z, np.interp(z, x, y1), marker="o", ls="", ms=4, color="C1")
I'm trying to get the four first order histogram statistics (mean,
variance, skewness and kurtosis) from a histogram.
I have this code that calculates the histogram:
import cv2
from matplotlib import pyplot as plt
img1 = 'img.jpg'
gray_img = cv2.imread(img1, cv2.IMREAD_GRAYSCALE)
plt.hist(gray_img.ravel(),256,[0,256])
plt.title('Histogram for gray scale picture')
plt.show()
How can I get that statistics?
Based on my answer here
def mean_h(val, freq):
return np.average(val, weights = freq)
def var_h(val, freq):
dev = freq * (val - mean_h(val, freq)) ** 2
return dev.sum() / freq.sum()
def moment_h(val, freq, n):
n = (freq * (val - mean_h(val, freq)) ** n).sum() / freq.sum()
d = var_h(val, freq) ** (n / 2)
return n / d
skewness and kurtosis are just the 3rd and 4th moments
If the number of bins is reasonable, you should be able to just count the values manually, put in a vector; and calculate all those moments.