random selection of numbers with known log-norm distibution - statistics

I have some (unknown) numbers that follow a log-norm distribution. What I know is that the mean value is 3 and the coefficient of variation of 0.5.
This means the range of St.dev. varies an order of magnitude.
How can in python generate 100 random variables from the mean and coefficient (in pyhton)?

From desired mean mu_d and coefficient of variation coeff_var,
var_d = mu_d * coeff_var
Solve these expressions 3 for mu_x and var_x.
With a given mean 'mu_x' and variance 'var_x' for underlying normal distribution. 4
import numpy as np
# Mean and variance of underlying normal distribution
mu_x = 0
var_x = 1
sigma_x = np.sqrt(var_x)
# Samples from the distribution
s = np.random.lognormal(mu_x, sigma_x, 100)

Is this what you're looking for?
https://numpy.org/doc/stable/reference/random/generated/numpy.random.lognormal.html
import numpy as np
mean = 3 # mean
var_coef = 0.5 # coefficient of variation
std = var_coef * mean / 100 # standard deviation
s = np.random.lognormal(mean, std, 100)
print(s)

Related

Understand the nature of distribution for a dataset in Python?

Let's say I have a dataset (sinusoidal curve in this example):
import matplotlib.pyplot as plt
import numpy as np
T = 1
Fs = 10000
N = T*Fs
t = np.linspace(0,T,N)
x = 10 * np.sin(2*np.pi*2*t)
plt.figure(figsize=(8,8))
plt.plot(t,x,'k')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
How do I figure out the nature distribution (normal/weibull/uniform/exponential/ etc.) Of 'x'?
Basically you have to do a Goodness of fit test iteratively over potentially fitting distributions to see which one best fits your sample data.
Luckily fitter not only provides the iteration process using Scipy (meaning you could do that manually with Scipy as well) but also displays a plot and table of the statistics values.
Some np.random example distributions and the sine function from your question along with the respective code below.
Note the heads-up sections at the end.
Pre-sets:
import numpy as np
from fitter import Fitter, get_common_distributions
distributions_set = get_common_distributions()
distributions_set.extend(['arcsine', 'cosine', 'expon', 'weibull_max',
'weibull_min', 'dweibull', 't', 'pareto',
'exponnorm', 'lognorm'])
Sine (from your example):
# arcsine = inverse sine
T = 1
Fs = 10_000
N = T*Fs
t = np.linspace(0,T,N)
np_sine_arr = 10 * np.sin(2*np.pi*2*t)
f_sine = Fitter(np_sine_arr, distributions = distributions_set)
f_sine.fit()
f_sine.summary()
Normal: Note t-test for normal distributions.
# normal
mu, sigma = 0.0, 0.1 # mean and standard deviation
np_normal_arr = np.random.normal(mu, sigma, 10_000)
f_normal = Fitter(np_normal_arr, distributions = distributions_set)
f_normal.fit()
f_normal.summary()
Rayleigh:
# rayleigh
meanvalue = 1
modevalue = np.sqrt(2 / np.pi) * meanvalue # shape
np_rayleigh_arr = np.random.rayleigh(modevalue, 10_000)
f_rayleigh = Fitter(np_rayleigh_arr, distributions = distributions_set)
f_rayleigh.fit()
f_rayleigh.summary()
Pareto:
# pareto
a, m = 3., 2. # shape and mode
np_pareto_arr = (np.random.pareto(a, 10_000) + 1) * m
f_pareto = Fitter(np_pareto_arr, distributions = distributions_set)
f_pareto.fit()
f_pareto.summary()
Weibull:
# weibull
a = 5. # shape
np_weibull_arr = np.random.weibull(a, 10_000)
f_weibull = Fitter(np_weibull_arr, distributions = distributions_set)
f_weibull.fit()
f_weibull.summary()
Exponent:
# exp
np_exp_arr = np.random.exponential(scale=1.0, size=10_000)
f_exp = Fitter(np_exp_arr, distributions = distributions_set)
f_exp.fit()
f_exp.summary()
Heads-up 1) Make sure the latest fitter version is installed - currently 1.4.1
You may have to install also some dependencies.
import fitter
print(fitter.version)
# 1.4.1
If you got an logging error that's likely because you have a previous version.
For me it was conda install -c bioconda fitter
Heads-up 2) fitter has a lot of distributions to test, which takes a long time if you go for all of them.
Best is to reduce the distrubtions based on the common ones with some you think are likely for your data (as done in the code above in the pre-sets section).
To get a list of all available distributions:
from fitter import get_distributions
get_distributions()
Heads-up 3) Depending on the distribution several very similar ones can come up close together. You could see that as well in some of the examples above.
Also especially when a distribution is slightly altered (e.g. mean ...) often a different can fit as well, see e.g. Wikipedia Gamma distribution probalility density plot which can look like a lot of other distributions depending on the parameters.

Skewed random sample from Numpy random generator sample (numpy.random.Generator.choice)

I have made a piece of Python to generate mixture of normal distributions and I would want to sample from it. As the result is my probability density function I would want the sample to be representative of the original distribution.
So I have developped the function to create the pdf:
def gaussian_pdf(amplitude, mean, std, sample_int):
coeff = (amplitude / std) / np.sqrt(2 * np.pi)
if len(amplitude > 1):
# create mixture distribution
# get distribution support
absciss_array = np.linspace(np.min(mean) - 4 * std[np.argmin(mean)],
np.max(mean) + 4 * std[np.argmax(mean)],
sample_int)
normal_array = np.zeros(len(absciss_array))
for index in range(0, len(amplitude)):
normal_array += coeff[index] * np.exp(-((absciss_array - mean[index]) / std[index]) ** 2)
else:
# create simple gaussian distribution
absciss_array = np.linspace(mean - 4*std, mean + 4*std, sample_int)
normal_array = coeff * np.exp(-((absciss_array - mean) / 2*std) ** 2)
return np.ascontiguousarray(normal_array / np.sum(normal_array))
An I have tested a sampling with the main part of the script :
def main():
amplitude = np.asarray([1, 2, 1])
mean = np.asarray([0.5, 1, 2.5])
std = np.asarray([0.1, 0.2, 0.3])
no_sample = 10000
# create mixture gaussian array
gaussian_array = gaussian_pdf(amplitude, mean, std, no_sample)
# pot data
fig, ax = plt.subplots()
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
ax.plot(absciss, gaussian_array)
# create random generator to sample from distribution
rng = np.random.default_rng(424242)
# sample from distribution
sample = rng.choice(a=gaussian_array, size=100, replace=True, p=gaussian_array)
# plot results
ax.plot(sample, np.full_like(sample, -0.00001), '|k', markeredgewidth=1)
plt.show()
return None
I then have the result :
You can see with the dark lines the samples that have been extracted from the distribution. The problem is that, even if I specify to use the probability array in the numpy function, the sampling is skewed towards the end of the distribution. I have tried several times with other seeds but the result does not change...
I expect to have more samples in the area where the probability density is greater...
Would someone please help me ? Am I missing something here ?
Thanks in advance.
Well actually the answer was to use an uniform distribution for sampling. Thanks to #amzon-ex for pointing it out.
The code is then :
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
sample_other = rng.choice(a=absciss, size=100, replace=True, p=gaussian_array)

Fill NaN values in a column within a specific range of values

I am wanting to do the following:
Fill NaN values in a single column using values within a specific range.
The range I am wanting to use is the mean of the non-Nan values in the column +/- 1 one standard
deviation of the computed mean.
NOTE If possible, I would like to be able to use multiples of the std dev by simply multiplying it by
a constant.
I thought I had it (see full code below) but the output from print(df['C'].describe()) shows that
I am generating values well outside my desired range. In fact, I am generating numbers outside
the original min and max of the column, which is definitely not what I want.
import pandas as pd
import numpy as np
import sys
print('Python: {}'.format(sys.version))
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('\033[1;31m' + '--------------' + '\033[0m') # Bold red
display_settings = {
'max_columns': 15,
'max_colwidth': 60,
'expand_frame_repr': False, # Wrap to multiple pages
'max_rows': 50,
'precision': 6,
'show_dimensions': False
}
# pd.options.display.float_format = '{:,.2f}'.format
for op, value in display_settings.items():
pd.set_option("display.{}".format(op), value)
df = pd.DataFrame(np.random.randint(0, 1000, size=(200, 10)), columns=list('ABCDEFGHIJ'))
# df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list(['AA','BB','C2','D2']))
print(df, '\n')
# https://stackoverflow.com/questions/55149738/pandas-replace-values-with-nan-at-random
df['C'] = df['C'].sample(frac=0.65) # The percentage of non-NaN values.
df['H'] = df['H'].sample(frac=0.75) # The percentage of non-NaN values.
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe(), '\n')
def fillNaN_with_unifrand(col):
a = col.values
m = np.isnan(a) # mask of NaNs
mu, sigma = col.mean(), col.std()
a[m] = np.random.normal(mu, sigma, size=m.sum())
return col
# https://stackoverflow.com/questions/46543060/how-to-replace-every-nan-in-a-column-with-different-random-values-using-pandas?rq=1
fillNaN_with_unifrand(df['C'])
pd.options.display.float_format = '{:.0f}'.format
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe())
Output of print(df['C'].describe()):
Starting:
count 130.000000
mean 462.446154
std 290.760432
min 7.000000
25% 187.500000
50% 433.000000
75% 671.250000
max 992.000000
Name: C, dtype: float64
Ending:
count 200
mean 517
std 298
min -187
25% 281
50% 544
75% 763
max 1218
Name: C, dtype: float64
Note the min and max. All of my fill values (in this instance) should have been 462 +/- 290.
Well, this is not how statistics work. A Gaussian Normal Distribution has a mean and a std but values can be drawn far away from mean +- std, they are just less likeley. As per definition of a normal distribution, 68 % of all values are within +- 1*std, 95 % are within +-2*std and so on. The question is: What do you want to do with outliers? Set them to mean +- std or draw again?
Case 1: Set outliers to min/max
This is usually unwanted, as this changes your distribution and puts more weight on the lower and upper boundary.
from matplotlib import pyplot as plt
mu = 100
sigma = 7
a = np.random.normal(mu, sigma, size=2000) # I used a size of 2000 as an example
a[a<(mu-sigma)] = mu-sigma
a[a>(mu+sigma)] = mu+sigma
plt.hist(a, bins=12, edgecolor='black')
plt.show()
Case 2: Truncated Normal Distribution
What you usually want is the Truncated Normal Distribution. It creates a distribution with an upper and a lower boundary. You find this function at the scipy.stats module. It works a bit different though: you first create the distribution by normalizing the lower and upper clip and then you create a numer of random variates rvs from it like this:
from matplotlib import pyplot as plt
import scipy.stats as stats
mu = 100
sigma = 7
lower_clip = mu-sigma
upper_clip = mu+sigma
a = stats.truncnorm((lower_clip - mu) / sigma, (upper_clip - mu) / sigma, loc=mu, scale=sigma)
plt.hist(a.rvs(2000), bins=12, edgecolor='black')
plt.show()
The constant of multiples of sigma is easily implemented. You can just change your lower and upper clip like
lower_clip = mu-x*sigma
with x being your constant.

How to calculate geometric mean in a differentiable way?

How to calculate goemetric mean along a dimension using Pytorch? Some numbers can be negative. The function must be differentiable.
A known (reasonably) numerically-stable version of the geometric mean is:
import torch
def gmean(input_x, dim):
log_x = torch.log(input_x)
return torch.exp(torch.mean(log_x, dim=dim))
x = torch.Tensor([2.0] * 1000).requires_grad_(True)
print(gmean(x, dim=0))
# tensor(2.0000, grad_fn=<ExpBackward>)
This kind of implementation can be found, for example, in SciPy (see here), which is a quite stable lib.
The implementation above does not handle zeros and negative numbers. Some will argue that the geometric mean with negative numbers is not well-defined, at least when not all of them are negative.
torch.prod() helps:
import torch
x = torch.FloatTensor(3).uniform_().requires_grad_(True)
print(x)
y = x.prod() ** (1.0/x.shape[0])
print(y)
y.backward()
print(x.grad)
# tensor([0.5692, 0.7495, 0.1702], requires_grad=True)
# tensor(0.4172, grad_fn=<PowBackward0>)
# tensor([0.2443, 0.1856, 0.8169])
EDIT: ?what about
y = (x.abs() ** (1.0/x.shape[0]) * x.sign() ).prod()

scikit learn: how to check coefficients significance

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_pvalue(model, x):
""" Calculate z-scores for scikit-learn LogisticRegression.
parameters:
model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
x: matrix on which the model was fit
This function uses asymtptics for maximum likelihood estimates.
"""
p = model.predict_proba(x)
n = len(p)
m = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
ans = np.zeros((m, m))
for i in range(n):
ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
vcov = np.linalg.inv(np.matrix(ans))
se = np.sqrt(np.diag(vcov))
t = coefs/se
p = (1 - norm.cdf(abs(t))) * 2
return p
# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()
The outputs of print() are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978]
[ 0.11413093 0.08779979]
sm_model.summary() also prints a nicely formatted HTML summary.

Resources