I am wanting to do the following:
Fill NaN values in a single column using values within a specific range.
The range I am wanting to use is the mean of the non-Nan values in the column +/- 1 one standard
deviation of the computed mean.
NOTE If possible, I would like to be able to use multiples of the std dev by simply multiplying it by
a constant.
I thought I had it (see full code below) but the output from print(df['C'].describe()) shows that
I am generating values well outside my desired range. In fact, I am generating numbers outside
the original min and max of the column, which is definitely not what I want.
import pandas as pd
import numpy as np
import sys
print('Python: {}'.format(sys.version))
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('\033[1;31m' + '--------------' + '\033[0m') # Bold red
display_settings = {
'max_columns': 15,
'max_colwidth': 60,
'expand_frame_repr': False, # Wrap to multiple pages
'max_rows': 50,
'precision': 6,
'show_dimensions': False
}
# pd.options.display.float_format = '{:,.2f}'.format
for op, value in display_settings.items():
pd.set_option("display.{}".format(op), value)
df = pd.DataFrame(np.random.randint(0, 1000, size=(200, 10)), columns=list('ABCDEFGHIJ'))
# df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list(['AA','BB','C2','D2']))
print(df, '\n')
# https://stackoverflow.com/questions/55149738/pandas-replace-values-with-nan-at-random
df['C'] = df['C'].sample(frac=0.65) # The percentage of non-NaN values.
df['H'] = df['H'].sample(frac=0.75) # The percentage of non-NaN values.
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe(), '\n')
def fillNaN_with_unifrand(col):
a = col.values
m = np.isnan(a) # mask of NaNs
mu, sigma = col.mean(), col.std()
a[m] = np.random.normal(mu, sigma, size=m.sum())
return col
# https://stackoverflow.com/questions/46543060/how-to-replace-every-nan-in-a-column-with-different-random-values-using-pandas?rq=1
fillNaN_with_unifrand(df['C'])
pd.options.display.float_format = '{:.0f}'.format
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe())
Output of print(df['C'].describe()):
Starting:
count 130.000000
mean 462.446154
std 290.760432
min 7.000000
25% 187.500000
50% 433.000000
75% 671.250000
max 992.000000
Name: C, dtype: float64
Ending:
count 200
mean 517
std 298
min -187
25% 281
50% 544
75% 763
max 1218
Name: C, dtype: float64
Note the min and max. All of my fill values (in this instance) should have been 462 +/- 290.
Well, this is not how statistics work. A Gaussian Normal Distribution has a mean and a std but values can be drawn far away from mean +- std, they are just less likeley. As per definition of a normal distribution, 68 % of all values are within +- 1*std, 95 % are within +-2*std and so on. The question is: What do you want to do with outliers? Set them to mean +- std or draw again?
Case 1: Set outliers to min/max
This is usually unwanted, as this changes your distribution and puts more weight on the lower and upper boundary.
from matplotlib import pyplot as plt
mu = 100
sigma = 7
a = np.random.normal(mu, sigma, size=2000) # I used a size of 2000 as an example
a[a<(mu-sigma)] = mu-sigma
a[a>(mu+sigma)] = mu+sigma
plt.hist(a, bins=12, edgecolor='black')
plt.show()
Case 2: Truncated Normal Distribution
What you usually want is the Truncated Normal Distribution. It creates a distribution with an upper and a lower boundary. You find this function at the scipy.stats module. It works a bit different though: you first create the distribution by normalizing the lower and upper clip and then you create a numer of random variates rvs from it like this:
from matplotlib import pyplot as plt
import scipy.stats as stats
mu = 100
sigma = 7
lower_clip = mu-sigma
upper_clip = mu+sigma
a = stats.truncnorm((lower_clip - mu) / sigma, (upper_clip - mu) / sigma, loc=mu, scale=sigma)
plt.hist(a.rvs(2000), bins=12, edgecolor='black')
plt.show()
The constant of multiples of sigma is easily implemented. You can just change your lower and upper clip like
lower_clip = mu-x*sigma
with x being your constant.
Related
I have some (unknown) numbers that follow a log-norm distribution. What I know is that the mean value is 3 and the coefficient of variation of 0.5.
This means the range of St.dev. varies an order of magnitude.
How can in python generate 100 random variables from the mean and coefficient (in pyhton)?
From desired mean mu_d and coefficient of variation coeff_var,
var_d = mu_d * coeff_var
Solve these expressions 3 for mu_x and var_x.
With a given mean 'mu_x' and variance 'var_x' for underlying normal distribution. 4
import numpy as np
# Mean and variance of underlying normal distribution
mu_x = 0
var_x = 1
sigma_x = np.sqrt(var_x)
# Samples from the distribution
s = np.random.lognormal(mu_x, sigma_x, 100)
Is this what you're looking for?
https://numpy.org/doc/stable/reference/random/generated/numpy.random.lognormal.html
import numpy as np
mean = 3 # mean
var_coef = 0.5 # coefficient of variation
std = var_coef * mean / 100 # standard deviation
s = np.random.lognormal(mean, std, 100)
print(s)
I am trying to find the locations (i.e., the x-value) of minimum, start of season, peak growing season, maximum growth, senescence, end of season, minimum (i.e., inflection points) in a vegetation curve. I am using a normal curve here as an example. I did come across few codes to find the change in slope and 1st/2nd order derivative, but not able to implement them for my case. Please direct me if there is any relevant example and your help is appreciated. Thanks!
## Version 2 code
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
x_min = 0.0
x_max = 16.0
mean = 8
std = 2
x = np.linspace(x_min, x_max, 100)
y = norm.pdf(x, mean, std)
# Slice the group in 3
def group_in_threes(slicable):
for i in range(len(slicable)-2):
yield slicable[i:i+3]
# Locate the change in slope
def turns(L):
for index, three in enumerate(group_in_threes(L)):
if (three[0] > three[1] < three[2]) or (three[0] < three[1] > three[2]):
yield index + 1
# 1st inflection point estimation
dy = np.diff(y, n=1) # first derivative
idx_max_dy = np.argmax(dy)
ix = list(turns(dy))
print(ix)
# All inflection point estimation
dy2 = np.diff(dy, n=2) # Second derivative?
idx_max_dy2 = np.argmax(dy2)
ix2 = list(turns(dy2))
print(ix2)
# Graph
plt.plot(x, y)
#plt.plot(x[ix], y[ix], 'or', label='estimated inflection point')
plt.plot(x[ix2], y[ix2], 'or', label='estimated inflection point - 2')
plt.xlabel('x'); plt.ylabel('y'); plt.legend(loc='best');
Here is a very simple and not robust method to find the inflection point of a non-noisy curve:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
x_min = 0.0
x_max = 16.0
mean = 8
std = 2
x = np.linspace(x_min, x_max, 100)
y = norm.pdf(x, mean, std)
# 1st inflection point estimation
dy = np.diff(y) # first derivative
idx_max_dy = np.argmax(dy)
# Graph
plt.plot(x, y)
plt.plot(x[idx_max_dy], y[idx_max_dy], 'or', label='estimated inflection point')
plt.xlabel('x'); plt.ylabel('y'); plt.legend();
The actual position of the inflection point is x1 = mean - std for a Gaussian curve.
For this to work with real data, they have to be smoothed before looking for the max, by applying for example a simple moving average, a gaussian filter or a Savitzky-Golay filter which can directly output the second derivative... the choice of the right filter depends on the data
I am trying to play with some time series data. I would like to plot the area with maximum numbers of changes based on some interval.
I have written some sample code but I am not able to move forward in highlighting the region.
import pandas as pd
import numpy as np
import seaborn as sns
f = pd.DataFrame(np.random.randint(0,50,size=(300, 1)))
sns.tsplot(f[0])
I want to highlight the region with maximum changes say with window size 30.
Here is one approach that performs most of the operations in numpy, and then displays the region with matplotlib.axvspan:
f = pd.DataFrame(np.random.randint(0,50,size=(300, 1))) # dataframe
y = f[0].values # working vector in numpy
thr = 5 # criterion for counting as a change
chunk_size = 30 # window length
chunks = np.array_split(y, y.shape[0]/chunk_size) # split into 30-element chunks
# compute how many elements differ from one element to the next
diffs_by_chunk = [(np.abs(np.ediff1d(chunk)) > thr).sum() for chunk in chunks]
ix = np.argmax(diffs_by_chunk) # chunk with most differences
sns.tsplot(f[0])
plt.axvspan(ix * chunk_size, (ix+1) * chunk_size, alpha=0.5)
With a baseline of uniform random data, it is difficult to relate this to a use case, but alternative criteria for what to maximise over might be useful, e.g. just looking at the sum of absolute changes, rather than the number that exceed a threshold:
diffs_by_chunk = [(np.abs(np.ediff1d(chunk))).sum() for chunk in chunks] # criterion #2
It would also be possible to show multiple regions that all have enough differences:
for i, df in enumerate(diffs_by_chunk):
if df >= 25:
sns.mpl.pyplot.axvspan(i*chunk_size, (i+1)*chunk_size, alpha=0.5)
I have a dataframe called 'games':
Game_id Goals P_value
1 2 0.4
2 3 0.321
45 0 0.64
I need to split the P value to 0.05 steps, bin the rows per P value and than create a line graph that shows the sum per p value.
What I currently have:
games.set_index('p value', inplace=True)
games.sort_index()
np.cumsum(games['goals']).plot()
But I get this:
No matter what I tried, I couldn't group the P values and show the sum of goals per P value..
I also tried to use matplotlib.pyplot but than I couldn't use the cumsum function..
If I understood you correctly, you want to have discrete steps in the p-value of width 0.05 and show the cumulative sum?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create some random example data
df = pd.DataFrame({
'goals': np.random.poisson(3, size=1000),
'p_value': np.random.uniform(0, 1, size=1000)
})
# define binning in p-value
bin_edges = np.arange(0, 1.025, 0.05)
bin_center = 0.5 * (bin_edges[:-1] + bin_edges[1:])
bin_width = np.diff(bin_edges)
# find the p_value bin, each row belongs to
# 0 is underflow, len(edges) is overflow bin
df['bin'] = np.digitize(df['p_value'], bins=bin_edges)
# get the number of goals per p_value bin
goals_per_bin = df.groupby('bin')['goals'].sum()
print(goals_per_bin)
# not every bin might be filled, so we will use pandas index
# matching t
binned = pd.DataFrame({
'center': bin_center,
'width': bin_width,
'goals': np.zeros(len(bin_center))
}, index=np.arange(1, len(bin_edges)))
binned['goals'] = goals_per_bin
plt.step(
binned['center'],
binned['goals'],
where='mid',
)
plt.xlabel('p-value')
plt.ylabel('goals')
plt.show()
I am trying to find the equation of a line within a DF
Here is a fake data set to explain:
Clicks Sales
5 10
5 11
10 16
10 20
10 18
15 28
15 26
... ...
100 200
What I am trying to do:
Calculate the equation of the line between so that I am able to input a number of clicks and have an output of sales at any predicted level. The thing I am trying to wrap my brain around is that I have many different line functions (e.g. there are multiple sales for each amount of clicks). How can I iterate through my DF to just to calculate one aggregate line function?
Here's what I have but it only accept ONE input at a time, I would like to create an average or aggregate...
def slope(self, target):
return slope(target.x - self.x, target.y - self.y)
def y_int(self, target): # <= here's the magic
return self.y - self.slope(target)*self.x
def line_function(self, target):
slope = self.slope(target)
y_int = self.y_int(target)
def fn(x):
return slope*x + y_int
return fn
a = Point(5, 10) # I am stuck here since - what to input!?
b = Point(10, 16) # I am stuck here since - what to input!?
line = a.line_function(b)
print(line(x=10))
Use the scipy function scipy.stats.linregress to fit your data.
Maybe also check https://en.wikipedia.org/wiki/Linear_regression to better understand linear regression.
You could group by Clicks and take the average of the Sales per group:
In [307]: sales = df.groupby('Clicks')['Sales'].mean(); sales
Out[307]:
Clicks
5 10.5
10 18.0
15 27.0
100 200.0
Name: Sales, dtype: float64
Then form the piecewise linear interpolating function based on
the groupwise-averaged data above using interpolate.interp1d:
from scipy import interpolate
fn = interpolate.interp1d(sales.index, sales.values, kind='linear')
For example,
import numpy as np
import pandas as pd
from scipy import interpolate
import matplotlib.pyplot as plt
df = pd.DataFrame({'Clicks': [5, 5, 10, 10, 10, 15, 15, 100],
'Sales': [10, 11, 16, 20, 18, 28, 26, 200]})
sales = df.groupby('Clicks')['Sales'].mean()
Once you have the groupwise-averaged sales, you can compute the interpolated sales
a number of ways. One way is to use np.interp:
newx = [10]
print(np.interp(newx, sales.index, sales.values))
# [ 18.] <-- The interpolated sales when the number of clicks is 10 (newx)
The problem with np.interp is that you are passing sales.index and sales.values to np.interp every time you call it -- it has no memory of the interpolating function. It is re-computing the interpolating function every time you call it.
If you have scipy, then you could create the interpolating function once and then use it as many times as you like later:
fn = interpolate.interp1d(sales.index, sales.values, kind='linear')
print(fn(newx))
# [ 18.]
For example, you could evaluate the interpolating function at a whole bunch of points (and plot the result) like this:
newx = np.linspace(5, 100, 100)
plt.plot(newx, fn(newx))
plt.plot(df['Clicks'], df['Sales'], 'o')
plt.show()
Pandas Series (and DataFrames) have an iterpolate method too. To use it, you reindex the Series to include the points where you wish to interpolate:
In [308]: sales.reindex(sales.index.union([14]))
Out[308]:
5 10.5
10 18.0
14 NaN
15 27.0
100 200.0
Name: Sales, dtype: float64
and then interpolate fills in the interpolated values where the Series is NaN:
In [295]: sales.reindex(sales.index.union([14])).interpolate('values')
Out[295]:
5 10.5
10 18.0
14 25.2 # <-- interpolated value
15 27.0
100 200.0
Name: Sales, dtype: float64
But I think it is perhaps not appropriate for your problem since it does not
return just the interpolated values you are looking for; it returns a whole
Series.