Understand the nature of distribution for a dataset in Python?

Understand the nature of distribution for a dataset in Python? - python-3.x

Let's say I have a dataset (sinusoidal curve in this example):
import matplotlib.pyplot as plt
import numpy as np
T = 1
Fs = 10000
N = T*Fs
t = np.linspace(0,T,N)
x = 10 * np.sin(2*np.pi*2*t)
plt.figure(figsize=(8,8))
plt.plot(t,x,'k')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
How do I figure out the nature distribution (normal/weibull/uniform/exponential/ etc.) Of 'x'?

Basically you have to do a Goodness of fit test iteratively over potentially fitting distributions to see which one best fits your sample data.
Luckily fitter not only provides the iteration process using Scipy (meaning you could do that manually with Scipy as well) but also displays a plot and table of the statistics values.
Some np.random example distributions and the sine function from your question along with the respective code below.
Note the heads-up sections at the end.
Pre-sets:
import numpy as np
from fitter import Fitter, get_common_distributions
distributions_set = get_common_distributions()
distributions_set.extend(['arcsine', 'cosine', 'expon', 'weibull_max',
'weibull_min', 'dweibull', 't', 'pareto',
'exponnorm', 'lognorm'])
Sine (from your example):
# arcsine = inverse sine
T = 1
Fs = 10_000
N = T*Fs
t = np.linspace(0,T,N)
np_sine_arr = 10 * np.sin(2*np.pi*2*t)
f_sine = Fitter(np_sine_arr, distributions = distributions_set)
f_sine.fit()
f_sine.summary()
Normal: Note t-test for normal distributions.
# normal
mu, sigma = 0.0, 0.1 # mean and standard deviation
np_normal_arr = np.random.normal(mu, sigma, 10_000)
f_normal = Fitter(np_normal_arr, distributions = distributions_set)
f_normal.fit()
f_normal.summary()
Rayleigh:
# rayleigh
meanvalue = 1
modevalue = np.sqrt(2 / np.pi) * meanvalue # shape
np_rayleigh_arr = np.random.rayleigh(modevalue, 10_000)
f_rayleigh = Fitter(np_rayleigh_arr, distributions = distributions_set)
f_rayleigh.fit()
f_rayleigh.summary()
Pareto:
# pareto
a, m = 3., 2. # shape and mode
np_pareto_arr = (np.random.pareto(a, 10_000) + 1) * m
f_pareto = Fitter(np_pareto_arr, distributions = distributions_set)
f_pareto.fit()
f_pareto.summary()
Weibull:
# weibull
a = 5. # shape
np_weibull_arr = np.random.weibull(a, 10_000)
f_weibull = Fitter(np_weibull_arr, distributions = distributions_set)
f_weibull.fit()
f_weibull.summary()
Exponent:
# exp
np_exp_arr = np.random.exponential(scale=1.0, size=10_000)
f_exp = Fitter(np_exp_arr, distributions = distributions_set)
f_exp.fit()
f_exp.summary()
Heads-up 1) Make sure the latest fitter version is installed - currently 1.4.1
You may have to install also some dependencies.
import fitter
print(fitter.version)
# 1.4.1
If you got an logging error that's likely because you have a previous version.
For me it was conda install -c bioconda fitter
Heads-up 2) fitter has a lot of distributions to test, which takes a long time if you go for all of them.
Best is to reduce the distrubtions based on the common ones with some you think are likely for your data (as done in the code above in the pre-sets section).
To get a list of all available distributions:
from fitter import get_distributions
get_distributions()
Heads-up 3) Depending on the distribution several very similar ones can come up close together. You could see that as well in some of the examples above.
Also especially when a distribution is slightly altered (e.g. mean ...) often a different can fit as well, see e.g. Wikipedia Gamma distribution probalility density plot which can look like a lot of other distributions depending on the parameters.

Related

random selection of numbers with known log-norm distibution

I have some (unknown) numbers that follow a log-norm distribution. What I know is that the mean value is 3 and the coefficient of variation of 0.5.
This means the range of St.dev. varies an order of magnitude.
How can in python generate 100 random variables from the mean and coefficient (in pyhton)?

From desired mean mu_d and coefficient of variation coeff_var,
var_d = mu_d * coeff_var
Solve these expressions 3 for mu_x and var_x.
With a given mean 'mu_x' and variance 'var_x' for underlying normal distribution. 4
import numpy as np
# Mean and variance of underlying normal distribution
mu_x = 0
var_x = 1
sigma_x = np.sqrt(var_x)
# Samples from the distribution
s = np.random.lognormal(mu_x, sigma_x, 100)

Is this what you're looking for?
https://numpy.org/doc/stable/reference/random/generated/numpy.random.lognormal.html
import numpy as np
mean = 3 # mean
var_coef = 0.5 # coefficient of variation
std = var_coef * mean / 100 # standard deviation
s = np.random.lognormal(mean, std, 100)
print(s)

Converting Normal Distribution to Lognormal distribution

I have been following lectures of MIT open course on Application of Mathematics in Finance. I am trying to code out the concepts for better understanding.
According to lectures(from what I understand), if random variable X is normally distributed then exp(X) is log-normally distributed and vice versa (please correct me if I am wrong here).
Here is what I tried:
I have list of integers that are normally distributed:
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
X = np.array(l)
mu = np.mean(X)
sigma = np.std(X)
count, bins, ignored = plt.hist(X, 35, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2)
),linewidth=2, color='r')
plt.show()
Output:
Normally distributed curve
Now I want to get log-normal distribution from above data, here is what I have tried:
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
X = np.array(l)
ln = []
for x in X:
val = np.e**x
ln.append(val)
X_ln = np.array(ln)
X_ln = np.array(X_ln) / np.min(X_ln)
mu = np.mean(X_ln)
sigma = np.std(X_ln)
count, bins, ignored = plt.hist(X_ln, 10, density=True)
x = np.linspace(min(bins), max(bins), 10000)
pdf = (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2)) / (x * sigma * np.sqrt(2 * np.pi)))
plt.plot(x, pdf, color='r', linewidth=2)
plt.show()
Output :
Not so clean Output
I know there is a better way to do this, but I can't figure out how. Any suggestions would be highly appreciated.
Here are couple of references:
Log normal distribution in Python
MIT lecture notes(topic-1.1)
In case this is relevant, here is a list of elements I am trying to process:
List of elements
Update 1:
I have normalized X before adding values to ln. This fixed the distribution of histogram, however, I can't seem to fix to get red line to show log-normal distribution. Also the new histogram distribution is not very different from normal distribution. I can't think of any suitable reason for that.
This is the block of code I have added:
def normalize(v):
norm=np.linalg.norm(v, ord=1)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm
X = np.array(l)
X = normalize(X)
New Output:
Slightly better result

What is the most appropriate solving method for ODE or PDE based ecosystem models in Python GEKKO?

I have been looking for a while, but could not find the answer to this specific question anywhere, sorry if it is a duplicate!
I have started to build a python package based on the xarray-simlab framework with the goal to provide a modular toolbox for building reproducible and flexible marine ecosystem models. Xarray-simlab at the moment only supports explicit step-sizes to solve the model functions. In order to solve complex models more safely & efficiently, I have instead started using GEKKO as a solver backend, as the model syntax seems well suited. (Note: At the moment I will only need functionality to solve the model equations over time, but I would like to make use of GEKKO's optimization functionality to fit model parameters to field or lab data at later stages.)
The current prototype of the package creates a xsimlab process class that passes the GEKKO model instance
m to all sub-processes. Process classes that inherit the model instance initialize m.SV, m.Param or define m.Intermediates based on the processes added to the model & parameters (incl. SV dimensions) supplied at runtime. In the next step all initialized intermediates are accumulated to the affected state variables in m.Equations. Once successfully solved, GEKKO variables are repackaged into a xarray data structure, that includes relevant metadata and can be analysed further. The package prototype can solve basic models using IMODE=7, but I have come across one issue related to the time steps of that solver:
I was expecting functionality similar to scipy's odeint, with adaptive time step evaluation, but obviously this does not seem to be the case and instead it evaluates the model at the discrete time-steps supplied.
The package is still under heavy development, and there are plenty of features that I am still trying to improve, so below is a minimal code example of a simple chemostat model. The model describes a phytoplankton state variable growing on a nutrient in a simplified flow-through system. The nutrient flows in at a constant rate, and phytoplankton dies and is lost from the system at a constant rate:
import numpy as np
from gekko import GEKKO
import matplotlib.pyplot as plt
m = GEKKO() # create GEKKO model
halfsat_const = m.Param(0.1)
N0 = m.Param(1.)
inflow_rate = m.Param(0.1)
mortality_rate = m.Param(0.1)
N = m.SV(1)
P = m.SV(0.1)
t = np.arange(0,10,0.01)
m.time = t
# Growth under nutrient limitation is described via Monod / Michaelis-Menten kinetics
nutlim = m.Intermediate(N/(N+halfsat_const)*P)
N_influx = m.Intermediate(N0 * inflow_rate)
mortality = m.Intermediate(P * mortality_rate)
m.Equation(N.dt()==N_influx - nutlim)
m.Equation(P.dt()==nutlim - mortality)
m.options.IMODE = 7
m.solve(disp=False)
plt.plot(m.time, N, label='N')
plt.plot(m.time, P, label='P')
plt.legend()
This works perfectly for the supplied time-step, but e.g. m.time = np.arange(0,10) returns a nonsensical solution (two divergent lines reaching >1e7). Odeint has no problem solving it:
import numpy as np
from scipy.integrate import odeint
halfsat = 0.1
N0 = 1.
inflow = 0.1
mortality_rate = 0.1
def model(y,t):
N,P = y
nutlim = N/(N+halfsat)*P
influx = N0 * inflow
mortality = P * mortality_rate
dNdt = influx - nutlim
dPdt = nutlim - mortality
return [dNdt, dPdt]
model_time = np.arange(0,10)
out = odeint(model,[1,0.1],model_time)
plt.plot(model_time,out[:,0], label='N')
plt.plot(model_time,out[:,1], label='P')
plt.legend()
The models I am building with my package can get relatively complex, with hundreds of state variables, and a much larger number of interactions, yielding highly non-linear results. I am not sure how I can be sure that my supplied time step is appropriate, since smaller time steps significantly increase computational time.
Is there a solver included with GEKKO (or compatible with the GEKKO model syntax) that provides a similar solver to odeint with adaptive step size? Or is there another approach that is better suited to deal with ecological models based on ODEs (or spatially-discretized PDE systems)?
Any help is very much appreciated!

Try to increase the number of nodes per segment with:
m.options.NODES = 3
This gives a more accurate solution because a higher order collocation method is used. In this case the time points [0,1,2,...9,10] are too coarse for an accurate solution but [0,0.5,1,...9.5,10] works fine.
Additionally, setting the lower boundary for the state variables to zero via m.SV(lb=0) will improve solving stability. This is a basic assumption of ecosystem models, that components tracked by e.g. biomass would not be negative.
I typically recommend a grid independence test where you reduce the step size until the solution doesn't change or compare with an adaptive step-size solver such as ODEINT. Gekko does do adaptive step sizes for IMODE=7 but only when the solver fails on a step. It is up to the user to decide the discretization. The strength of Gekko is in optimization and an adaptive step size in optimization requires a multi-level strategy that can be very slow. However, there has been recent progress. If you'd like to have an adaptive step size with IMODE=7 and error checking, please consider a feature request.
import numpy as np
from gekko import GEKKO
from scipy.integrate import odeint
import matplotlib.pyplot as plt
m = GEKKO(remote=False) # create GEKKO model
halfsat_const = m.Param(0.1)
N0 = m.Param(1.)
inflow_rate = m.Param(0.1)
mortality_rate = m.Param(0.1)
N = m.SV(1, lb=0)
P = m.SV(0.1, lb=0)
t = np.arange(0,10,0.2)
m.time = t
# Growth under nutrient limitation is described via Monod / Michaelis-Menten kinetics
nutlim = m.Intermediate(N/(N+halfsat_const)*P)
N_influx = m.Intermediate(N0 * inflow_rate)
mortality = m.Intermediate(P * mortality_rate)
m.Equation(N.dt()==N_influx - nutlim)
m.Equation(P.dt()==nutlim - mortality)
m.options.NODES = 3
m.options.IMODE = 7
m.solve(disp=False)
halfsat = 0.1
N0 = 1.
inflow = 0.1
mortality_rate = 0.1
def model(y,t):
N,P = y
nutlim = N/(N+halfsat)*P
influx = N0 * inflow
mortality = P * mortality_rate
dNdt = influx - nutlim
dPdt = nutlim - mortality
return [dNdt, dPdt]
model_time = np.arange(0,10)
out = odeint(model,[1,0.1],model_time)
plt.plot(model_time,out[:,0], 'ro', label='N ODEINT')
plt.plot(model_time,out[:,1], 'bx', label='P ODEINT')
plt.plot(m.time, N, 'r--', label='N Gekko')
plt.plot(m.time, P, 'b--', label='P Gekko')
plt.legend()
plt.show()

Skewed random sample from Numpy random generator sample (numpy.random.Generator.choice)

I have made a piece of Python to generate mixture of normal distributions and I would want to sample from it. As the result is my probability density function I would want the sample to be representative of the original distribution.
So I have developped the function to create the pdf:
def gaussian_pdf(amplitude, mean, std, sample_int):
coeff = (amplitude / std) / np.sqrt(2 * np.pi)
if len(amplitude > 1):
# create mixture distribution
# get distribution support
absciss_array = np.linspace(np.min(mean) - 4 * std[np.argmin(mean)],
np.max(mean) + 4 * std[np.argmax(mean)],
sample_int)
normal_array = np.zeros(len(absciss_array))
for index in range(0, len(amplitude)):
normal_array += coeff[index] * np.exp(-((absciss_array - mean[index]) / std[index]) ** 2)
else:
# create simple gaussian distribution
absciss_array = np.linspace(mean - 4*std, mean + 4*std, sample_int)
normal_array = coeff * np.exp(-((absciss_array - mean) / 2*std) ** 2)
return np.ascontiguousarray(normal_array / np.sum(normal_array))
An I have tested a sampling with the main part of the script :
def main():
amplitude = np.asarray([1, 2, 1])
mean = np.asarray([0.5, 1, 2.5])
std = np.asarray([0.1, 0.2, 0.3])
no_sample = 10000
# create mixture gaussian array
gaussian_array = gaussian_pdf(amplitude, mean, std, no_sample)
# pot data
fig, ax = plt.subplots()
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
ax.plot(absciss, gaussian_array)
# create random generator to sample from distribution
rng = np.random.default_rng(424242)
# sample from distribution
sample = rng.choice(a=gaussian_array, size=100, replace=True, p=gaussian_array)
# plot results
ax.plot(sample, np.full_like(sample, -0.00001), '|k', markeredgewidth=1)
plt.show()
return None
I then have the result :
You can see with the dark lines the samples that have been extracted from the distribution. The problem is that, even if I specify to use the probability array in the numpy function, the sampling is skewed towards the end of the distribution. I have tried several times with other seeds but the result does not change...
I expect to have more samples in the area where the probability density is greater...
Would someone please help me ? Am I missing something here ?
Thanks in advance.

Well actually the answer was to use an uniform distribution for sampling. Thanks to #amzon-ex for pointing it out.
The code is then :
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
sample_other = rng.choice(a=absciss, size=100, replace=True, p=gaussian_array)

Fitting distribution functions to dataset in Python 3

I'm trying to find the find the probability distribution that better fits my data. I've tried with the code I've found in different threads, but the results are not what I'm expecting.
The descriptive statistics and histogram for my data are as follows:
Data Histogram
count 865.000000
mean 43.476713
std 12.486362
min 4.075682
25% 34.934609
50% 41.917304
75% 51.271708
max 88.843940
I tried to find a proper distribution function using the following code, but the results are not what I expected.
size = 865
kappa=99
x = scipy.arange(size)
y = scipy.int_(scipy.round_(st.vonmises.rvs(kappa,size=size)*100))
h = plt.hist(df['spreadMaizChicagoAtlantico'],bins=100,color='b')
dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,100)
plt.legend(loc='upper right')
plt.show()
Data histogram with functions
Can Anyone please tell me what I'm doing wrong and guide me through a better understanding of this solutions.

Thanks to the reply from before I found my mistake.
I got all the values from the DataFrame and made a numpy array.
ser=df.values
Then I ran a similar code from before correcting the fitting of the distribution to the proper data
size = 867
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*60))
h = plt.hist(ser, bins=range(80))
dist_names = ['beta', 'rayleigh', 'norm']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(ser)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,100)
plt.legend(loc='upper right')
plt.show()
The result is as follows, showing the histogram and three probability density functions.

The distfit library can do this job as it searches for the best fit among 89 theoretical distributions.
pip install distfit
import numpy as np
from distfit import distfit
# Example data
X = np.random.normal(10, 3, 2000)
# Initialize
dfit = distfit()
# Search for best theoretical fit on your empirical data
dfit.fit_transform(X)
# The plot function will now also include the predictions of y
dfit.plot(chart='PDF',
emp_properties={'linewidth': 4, 'color': 'k'},
bar_properties={'edgecolor':'k', 'color':'g'},
pdf_properties={'linewidth': 4, 'color': 'r'})

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Understand the nature of distribution for a dataset in Python? - python-3.x

Related

random selection of numbers with known log-norm distibution

Converting Normal Distribution to Lognormal distribution

What is the most appropriate solving method for ODE or PDE based ecosystem models in Python GEKKO?

Skewed random sample from Numpy random generator sample (numpy.random.Generator.choice)

Fitting distribution functions to dataset in Python 3

Categories

Resources