Normalizing CDF in Python - python-3.x

I want to calculate and plot the cumulative distribution function (CDF) of a given sample, new_dO18 and then overlay the CDF of a normal distribution with a given mean and standard deviation on the same plot. I am having problems normalizing the CDF. I should have values ranging between 0 and 1 on the x axis. Can someone guide me as to where I went wrong. I'm sure it's a simple fix but I'm very new to Python. I've included my steps so far. Thanks!
# Use np.histogram to get counts in each bin. See the help page or
# documentation on how to use this function, and what it returns.
# normalize the data new_dO18 using a for loop
norm_newdO18 = []
for element in new_dO18:
x = element
y = (x - np.mean(new_dO18))/np.std(new_dO18)
norm_newdO18.append(y)
print ('normalized dO18 values, excluding outliers:', norm_newdO18)
print()
# Use the histogram function to bin the data
num_bins = 20
counts, bin_edges = np.histogram(norm_newdO18, bins=num_bins, normed=0)
# Calculate and plot CDF of sample
cdf = np.cumsum(counts)
scale = 1.0/cdf[-1]
norm_cdf = scale * cdf
plt.plot(bin_edges[1:], norm_cdf, label = 'dO18 values')
plt.legend(bbox_to_anchor=(0, 1), loc='upper left', ncol=1)
plt.xlabel('normalized dO18 data')
plt.ylabel('frequency')
# Calculate and overlay the CDF of a normal distribution with sample mean and std
# as parameters.
# specific normally distributed function with mean and st. dev
mu, sigma = np.mean(new_dO18), np.std(new_dO18)
norm_theoretical = np.random.normal(mu, sigma, 1000)
# Calculate and plot CDF of theoretical sample
counts1, bin_edges1 = np.histogram(norm_theoretical, bins=20, normed=0)
cdft= np.cumsum(counts1)
scale = 1.0/cdft[-1]
norm_cdft = scale * cdf
plt.plot(bin_edges[1:], norm_cdft, label = 'theoretical values')
plt.legend(bbox_to_anchor=(0, 1), loc='upper left', ncol=1)
plt.show()

Related

Skewed random sample from Numpy random generator sample (numpy.random.Generator.choice)

I have made a piece of Python to generate mixture of normal distributions and I would want to sample from it. As the result is my probability density function I would want the sample to be representative of the original distribution.
So I have developped the function to create the pdf:
def gaussian_pdf(amplitude, mean, std, sample_int):
coeff = (amplitude / std) / np.sqrt(2 * np.pi)
if len(amplitude > 1):
# create mixture distribution
# get distribution support
absciss_array = np.linspace(np.min(mean) - 4 * std[np.argmin(mean)],
np.max(mean) + 4 * std[np.argmax(mean)],
sample_int)
normal_array = np.zeros(len(absciss_array))
for index in range(0, len(amplitude)):
normal_array += coeff[index] * np.exp(-((absciss_array - mean[index]) / std[index]) ** 2)
else:
# create simple gaussian distribution
absciss_array = np.linspace(mean - 4*std, mean + 4*std, sample_int)
normal_array = coeff * np.exp(-((absciss_array - mean) / 2*std) ** 2)
return np.ascontiguousarray(normal_array / np.sum(normal_array))
An I have tested a sampling with the main part of the script :
def main():
amplitude = np.asarray([1, 2, 1])
mean = np.asarray([0.5, 1, 2.5])
std = np.asarray([0.1, 0.2, 0.3])
no_sample = 10000
# create mixture gaussian array
gaussian_array = gaussian_pdf(amplitude, mean, std, no_sample)
# pot data
fig, ax = plt.subplots()
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
ax.plot(absciss, gaussian_array)
# create random generator to sample from distribution
rng = np.random.default_rng(424242)
# sample from distribution
sample = rng.choice(a=gaussian_array, size=100, replace=True, p=gaussian_array)
# plot results
ax.plot(sample, np.full_like(sample, -0.00001), '|k', markeredgewidth=1)
plt.show()
return None
I then have the result :
You can see with the dark lines the samples that have been extracted from the distribution. The problem is that, even if I specify to use the probability array in the numpy function, the sampling is skewed towards the end of the distribution. I have tried several times with other seeds but the result does not change...
I expect to have more samples in the area where the probability density is greater...
Would someone please help me ? Am I missing something here ?
Thanks in advance.
Well actually the answer was to use an uniform distribution for sampling. Thanks to #amzon-ex for pointing it out.
The code is then :
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
sample_other = rng.choice(a=absciss, size=100, replace=True, p=gaussian_array)

Time series anomaly detection plot

I have a pandas dataframe of size (1280,2). The head of the data looks as follows:
I'm using a clustering based anomaly detection method using k-means. It creates 'k' similar clusters of data points. Data points that fall outside of these groups are marked as anomalies.
def getDistanceByPoint(data, model):
distance = pd.Series()
for i in range(0,len(data)):
Xa = np.array(data.loc[i])
Xb = model.cluster_centers_[model.labels_[i]-1]
distance.set_value(i, np.linalg.norm(Xa-Xb))
return distance
kmeans = KMeans(n_clusters=9).fit(data)
outliers_fraction = 0.01
distance = getDistanceByPoint(data, kmeans)
number_of_outliers = int(outliers_fraction*len(distance))
threshold = distance.nlargest(number_of_outliers).min()
(0:normal, 1:anomaly)
df['anomaly1'] = (distance >= threshold).astype(int)
I want to plot data frame with the x-axis as time elapsed and the y-axis as value. I would like to plot the normal data values in blue and the anomaly values in red. How could I plot this?
This is what you need. Remember to change time and value to your column name accordingly.
fig, ax = plt.subplots()
a = df.loc[df['anomaly'] == 1, ['time', 'value']]
ax.plot(df['time'], df['value'], color='blue')
ax.scatter(a['time'], a['value'], color='red')
plt.show()
Check this notebook out for more information.

Create a baseline to line up plots and then fit a Gaussian to each

I am trying to line up the attached plots so I can properly fit Gaussians to them but am not sure how to do so. I want them to have the same baseline. This is how I read in the data:
fig = plt.figure(figsize=(16, 8), dpi=360)
plt.xlabel('wavenumbers (cm^-1)', fontsize=18)
plt.ylabel('absorbance', fontsize=16)
plt.xlim(4000,650)
for i in range((len(data.columns))):
if i%2 == 0:
xdata = data.iloc[100:,i]
ydata = data.iloc[100:,i+1]
plt.plot(xdata,ydata, label = str(data.columns[i+1]))
Then I use a Gaussian function to fit to my data but it doesn't work.
#locations of peaks to bias model
peak_locations = [985] #enter values here for peak locations
numb_of_peaks = len(peak_locations)
def gaussian_basis(x, xk , sigma = 1):
# x is a vector
# k is the "index" of the basis
# sigma is the standard deviation
return np.exp(-((x - xk)**2/(2*sigma**2)))
The image of my data is attached. How can I have the plots have the same baseline to show the peaks increasing?
https://i.stack.imgur.com/VcsZW.jpg

log x axis on matplotlib histogram with imshow()

I have some x data in a lognorm distribution, of which I would like to plot a histogram. I have this data for different parameters, so I would like to have those parameters as my y-axis, on my x-axis I would like to have the bins of my histogram (in log-scale with reasonable ticks like 100 to 103) - exactly like a ax.hist(), only using imshow() to get it in a more beautiful and compact way.
mu, sigma = 3., 2.
img = []
for i in range(20):
dat = np.random.lognormal(mu, sigma + 1/10, 10000)
hist = np.histogram(dat, bins = 10**(np.arange(0, 3, step = 0.1)))[0]
img.append(hist)
plt.imshow(img)
plt.show()
The result looks like
this
but I would like to have the x-axis being log and matching the bins.
I also have data for y, but that is not so much of a problem.

Fitting distribution functions to dataset in Python 3

I'm trying to find the find the probability distribution that better fits my data. I've tried with the code I've found in different threads, but the results are not what I'm expecting.
The descriptive statistics and histogram for my data are as follows:
Data Histogram
count 865.000000
mean 43.476713
std 12.486362
min 4.075682
25% 34.934609
50% 41.917304
75% 51.271708
max 88.843940
I tried to find a proper distribution function using the following code, but the results are not what I expected.
size = 865
kappa=99
x = scipy.arange(size)
y = scipy.int_(scipy.round_(st.vonmises.rvs(kappa,size=size)*100))
h = plt.hist(df['spreadMaizChicagoAtlantico'],bins=100,color='b')
dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,100)
plt.legend(loc='upper right')
plt.show()
Data histogram with functions
Can Anyone please tell me what I'm doing wrong and guide me through a better understanding of this solutions.
Thanks to the reply from before I found my mistake.
I got all the values from the DataFrame and made a numpy array.
ser=df.values
Then I ran a similar code from before correcting the fitting of the distribution to the proper data
size = 867
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*60))
h = plt.hist(ser, bins=range(80))
dist_names = ['beta', 'rayleigh', 'norm']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(ser)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,100)
plt.legend(loc='upper right')
plt.show()
The result is as follows, showing the histogram and three probability density functions.
The distfit library can do this job as it searches for the best fit among 89 theoretical distributions.
pip install distfit
import numpy as np
from distfit import distfit
# Example data
X = np.random.normal(10, 3, 2000)
# Initialize
dfit = distfit()
# Search for best theoretical fit on your empirical data
dfit.fit_transform(X)
# The plot function will now also include the predictions of y
dfit.plot(chart='PDF',
emp_properties={'linewidth': 4, 'color': 'k'},
bar_properties={'edgecolor':'k', 'color':'g'},
pdf_properties={'linewidth': 4, 'color': 'r'})

Resources