I have a df with groundwater level time series and I am trying to remove the outliers from the data. I tend to do it using a rolling window, so the outlier removal method I want to use is Generalized Extreme Studentized Deviate (ESD). But due to the fact that my timesieres are sometimes not normally distributed, I want to apply this method for a specific time window (12months or 24months) for monthly data to get better results.
from __future__ import print_function, division
import numpy as np
import matplotlib.pylab as plt
from PyAstronomy import pyasl
# Convert data given at:
# http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
# to array.
x = np.array([float(x) for x in "-0.25 0.68 0.94 1.15 1.20 1.26 1.26 1.34 1.38 1.43 1.49 1.49 \
1.55 1.56 1.58 1.65 1.69 1.70 1.76 1.77 1.81 1.91 1.94 1.96 \
1.99 2.06 2.09 2.10 2.14 2.15 2.23 2.24 2.26 2.35 2.37 2.40 \
2.47 2.54 2.62 2.64 2.90 2.92 2.92 2.93 3.21 3.26 3.30 3.59 \
3.68 4.30 4.64 5.34 5.42 6.01".split()])
# Apply the generalized ESD
r = pyasl.generalizedESD(x, 10, 0.05, fullOutput=True)
print("Number of outliers: ", r[0])
print("Indices of outliers: ", r[1])
print(" R Lambda")
for i in range(len(r[2])):
print("%2d %8.5f %8.5f" % ((i+1), r[2][i], r[3][i]))
# Plot the "data"
plt.plot(x, 'b.')
# and mark the outliers.
for i in range(r[0]):
plt.plot(r[1][i], x[r[1][i]], 'rp')
plt.show()
I just simply want to apply the code abow to a rolling window in my dataframe an remove outliers.
thank you,
Related
I am generating line charts with the following syntax:
df2 = df2[['runtime','per','dev','var']]
op = f"/tmp/image.png"
fig, ax = plt.subplots(facecolor='darkslategrey')
df2.plot(x='runtime',xlabel="Date", kind='line', marker='o',linewidth=2,alpha=.7,subplots=True,color=['khaki', 'lightcyan','thistle'])
plt.style.use('dark_background')
plt.suptitle(f'Historical Data:', fontsize=12,fontname = 'monospace')
#file output
plt.savefig(op, transparent=False,bbox_inches="tight")
plt.close('all')
Where df2 dataframe sample:
runtime per dev var
1 2021-05-28 50.85 2.11 2.13
1 2021-05-30 50.85 2.11 2.13
1 2021-06-02 51.13 2.16 2.11
1 2021-06-04 51.13 2.16 2.11
1 2021-06-07 51.13 2.16 2.11
1 2021-06-09 51.11 2.13 2.10
1 2021-06-10 51.11 2.13 2.10
1 2021-06-14 51.11 2.13 2.10
1 2021-06-16 51.34 2.12 2.10
1 2021-06-18 51.34 2.12 2.10
1 2021-06-21 51.34 2.12 2.10
1 2021-06-23 51.69 1.97 2.17
1 2021-06-25 51.69 1.97 2.17
1 2021-06-28 51.69 1.97 2.17
1 2021-06-30 56.46 1.74 2.14
1 2021-07-02 56.46 1.74 2.14
1 2021-07-05 56.46 1.74 2.14
1 2021-07-07 55.10 1.84 2.08
1 2021-07-09 55.10 1.84 2.08
1 2021-07-12 55.10 1.84 2.08
1 2021-07-14 54.58 1.85 2.07
1 2021-07-16 54.58 1.85 2.07
1 2021-07-19 54.58 1.85 2.07
1 2021-07-21 54.33 1.87 2.06
1 2021-07-23 54.33 1.87 2.06
1 2021-07-26 54.33 1.87 2.06
1 2021-07-28 54.98 1.91 2.19
1 2021-07-30 54.98 1.91 2.19
This works great.
Now, I would like to change the color of points if their values are "abnormal", specifically if per < 90.00 or per > 10.00, or if dev < 10.00 or if var < 10.00 to color the point RED.
Is this possible?
Instead of drawing the 3 subplots in one call, they could be drawn one-by-one. First draw the subplot as before, and on top of it a scatter plot, only with the "abnormal" points. zorder=3 makes sure that the scatter dots appear on top of the existing dots.
Here is some example code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'runtime': pd.date_range('20210101', freq='D', periods=100),
'per': np.random.uniform(1, 99, 100),
'dev': np.random.uniform(1, 11, 100),
'var': np.random.uniform(2, 11, 100)})
fig, axs = plt.subplots(nrows=3, figsize=(6, 10), facecolor='darkslategrey', sharex=True)
for ax, column, color, (min_normal, max_normal) in zip(axs,
['per', 'dev', 'var'],
['khaki', 'lightcyan', 'thistle'],
[(10, 90), (-np.inf, 10), (-np.inf, 10)]):
df2.plot(x='runtime', xlabel="Date", y=column, ylabel=column,
kind='line', marker='o', linewidth=2, alpha=.7, color=color, legend=False, ax=ax)
df_abnormal = df2[(df2[column] < min_normal) | (df2[column] > max_normal)]
df_abnormal.plot(x='runtime', xlabel="Date", y=column, ylabel=column,
kind='scatter', marker='o', color='red', legend=False, zorder=3, ax=ax)
plt.style.use('dark_background')
plt.suptitle(f'Historical Data:', fontsize=12, fontname='monospace')
plt.tight_layout()
plt.show()
I parsed a table from a website using Selenium (by xpath), then used pd.read_html on the table element, and now I'm left with what looks like a list that makes up the table. It looks like this:
[Empty DataFrame
Columns: [Symbol, Expiration, Strike, Last, Open, High, Low, Change, Volume]
Index: [], Symbol Expiration Strike Last Open High Low Change Volume
0 XPEV Dec20 12/18/2020 46.5 3.40 3.00 5.05 2.49 1.08 696.0
1 XPEV Dec20 12/18/2020 47.0 3.15 3.10 4.80 2.00 1.02 2359.0
2 XPEV Dec20 12/18/2020 47.5 2.80 2.67 4.50 1.89 0.91 2231.0
3 XPEV Dec20 12/18/2020 48.0 2.51 2.50 4.29 1.66 0.85 3887.0
4 XPEV Dec20 12/18/2020 48.5 2.22 2.34 3.80 1.51 0.72 2862.0
5 XPEV Dec20 12/18/2020 49.0 1.84 2.00 3.55 1.34 0.49 4382.0
6 XPEV Dec20 12/18/2020 50.0 1.36 1.76 3.10 1.02 0.30 14578.0
7 XPEV Dec20 12/18/2020 51.0 1.14 1.26 2.62 0.78 0.31 4429.0
8 XPEV Dec20 12/18/2020 52.0 0.85 0.95 2.20 0.62 0.19 2775.0
9 XPEV Dec20 12/18/2020 53.0 0.63 0.79 1.85 0.50 0.13 1542.0]
How do I turn this into an actual dataframe, with the "Symbol, Expiration, etc..." as the header, and the far left column as the index?
I've been trying several different things, but to no avail. Where I left off was trying:
# From reading the html of the table step
dfs = pd.read_html(table.get_attribute('outerHTML'))
dfs = pd.DataFrame(dfs)
... and when I print the new dfs, I get this:
0 Empty DataFrame
Columns: [Symbol, Expiration, ...
1 Symbol Expiration Strike Last Open ...
Per pandas.read_html docs,
This function will always return a list of DataFrame or it will fail, e.g., it will not return an empty list.
According to your list output the non-empty dataframe is the second element in that list. So retrieve it by indexing (remember Python uses zero as first index of iterables). Do note you can use data frames stored in lists or dicts.
dfs[1].head()
dfs[1].tail()
dfs[1].describe()
...
single_df = dfs[1].copy()
del dfs
Or index on same call
single_df = pd.read_html(...)[1]
I am trying to use the summation expression in Gnuplot but it is not working properly. I have the following data structure with many number of rows:
t x1 y1 z1 x2 y2 z2 x3 y3 z3 ... x98 y98 z98
I would like to plot the following equation:
u = (sqrt(sum(x)**2 + sum(y)**2 + sum(z)**2))/98
98 is the number of points (x,y,z).
What I have until now is how to plot the average of columns x1, x2, x3.. as following:
plot 'data file' u 1:((sum[i=0:ColCount-1] column(i*ColStep+ColStart))/ColCount) w lines ls 4 notitle
Where ColCount = 98, ColStep = 3 and ColStart=2.
But I have been trying to plot the equation, but it is not working. I would really appreciate any help.
What the following script does:
It takes the square root of the sum of (x1+x2+x3)**2 and (y1+y2+y3)**2 and (z1+z2+z3)**2. This you can adapt to your column numbers.
But I'm still not sure whether this is what you want. Please clarify.
Code:
### summing up columns
reset session
$Data <<EOD
#t x1 y1 z1 x2 y2 z2 x3 y3 z3
1 1.11 1.21 1.31 2.11 2.21 2.31 3.11 3.21 3.31
2 1.12 1.22 1.32 2.12 2.22 2.32 3.12 3.22 3.32
3 1.13 1.23 1.33 2.13 2.23 2.33 3.13 3.23 3.33
4 1.14 1.24 1.34 2.14 2.24 2.34 3.14 3.24 3.34
5 1.15 1.25 1.35 2.15 2.25 2.35 3.15 3.25 3.35
6 1.16 1.26 1.36 2.16 2.26 2.36 3.16 3.26 3.36
7 1.17 1.27 1.37 2.17 2.27 2.37 3.17 3.27 3.37
8 1.18 1.28 1.38 2.18 2.28 2.38 3.18 3.28 3.38
9 1.19 1.29 1.39 2.19 2.29 2.39 3.19 3.29 3.39
EOD
ColStep = 3
ColCount = 3
mySum(ColStart) = sum[i=0:ColCount-1] column(i*ColStep+ColStart)
plot $Data u 1:(sqrt(mySum(2)**2 + mySum(3)**2 + mySum(4)**2)) w lp pt 7 notitle
### end of code
Result:
I consider using the lifelines package to fit a Cox-Proportional-Hazards-Model. I read that lifelines uses a nonparametric approach to fit the baseline hazard, which results in different baseline_hazards for some time points (see code example below). For my application, I need an
exponential distribution leading to a baseline hazard h0(t) = lambda which is constant across time.
So my question is: is it (in the meantime) possible to run a Cox-Proportional-Hazards-Model with an exponential distribution for the baseline hazard in lifelines or another Python package?
Example code:
from lifelines import CoxPHFitter
import pandas as pd
df = pd.DataFrame({'duration': [4, 6, 5, 5, 4, 6],
'event': [0, 0, 0, 1, 1, 1],
'cat': [0, 1, 0, 1, 0, 1]})
cph = CoxPHFitter()
cph.fit(df, duration_col='duration', event_col='event', show_progress=True)
cph.baseline_hazard_
gives
baseline hazard
T
4.0 0.160573
5.0 0.278119
6.0 0.658032
👋lifelines author here.
So, this model is not natively in lifelines, but you can easily implement it yourself (and maybe something I'll do for a future release). This idea relies on the intersection of proportional hazard models and AFT (accelerated failure time) models. In the cox-ph model with exponential hazard (i.e. constant baseline hazard), the hazard looks like:
h(t|x) = lambda_0(t) * exp(beta * x) = lambda_0 * exp(beta * x)
In the AFT specification for an exponential distribution, the hazard looks like:
h(t|x) = exp(-beta * x - beta_0) = exp(-beta * x) * exp(-beta_0) = exp(-beta * x) * lambda_0
Note the negative sign difference!
So instead of doing a CoxPH, we can do an Exponential AFT fit (and flip the signs if we want the same interpretation as the CoxPH). We can use the custom regession model syntax to do this:
from lifelines.fitters import ParametricRegressionFitter
from autograd import numpy as np
class ExponentialAFTFitter(ParametricRegressionFitter):
# this is necessary, and should always be a non-empty list of strings.
_fitted_parameter_names = ['lambda_']
def _cumulative_hazard(self, params, T, Xs):
# params is a dictionary that maps unknown parameters to a numpy vector.
# Xs is a dictionary that maps unknown parameters to a numpy 2d array
lambda_ = np.exp(np.dot(Xs['lambda_'], params['lambda_']))
return T / lambda_
Testing this,
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi = load_rossi()
rossi['intercept'] = 1
regressors = {'lambda_': rossi.columns}
eaf = ExponentialAFTFitter().fit(rossi, "week", "arrest", regressors=regressors)
eaf.print_summary()
"""
<lifelines.ExponentialAFTFitter: fitted with 432 observations, 318 censored>
event col = 'arrest'
number of subjects = 432
number of events = 114
log-likelihood = -686.37
time fit was run = 2019-06-27 15:13:18 UTC
---
coef exp(coef) se(coef) z p -log2(p) lower 0.95 upper 0.95
lambda_ fin 0.37 1.44 0.19 1.92 0.06 4.18 -0.01 0.74
age 0.06 1.06 0.02 2.55 0.01 6.52 0.01 0.10
race -0.30 0.74 0.31 -0.99 0.32 1.63 -0.91 0.30
wexp 0.15 1.16 0.21 0.69 0.49 1.03 -0.27 0.56
mar 0.43 1.53 0.38 1.12 0.26 1.93 -0.32 1.17
paro 0.08 1.09 0.20 0.42 0.67 0.57 -0.30 0.47
prio -0.09 0.92 0.03 -3.03 <0.005 8.65 -0.14 -0.03
_intercept 4.05 57.44 0.59 6.91 <0.005 37.61 2.90 5.20
_fixed _intercept 0.00 1.00 0.00 nan nan nan 0.00 0.00
---
"""
CoxPHFitter().fit(load_rossi(), 'week', 'arrest').print_summary()
"""
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
number of subjects = 432
number of events = 114
partial log-likelihood = -658.75
time fit was run = 2019-06-27 15:17:41 UTC
---
coef exp(coef) se(coef) z p -log2(p) lower 0.95 upper 0.95
fin -0.38 0.68 0.19 -1.98 0.05 4.40 -0.75 -0.00
age -0.06 0.94 0.02 -2.61 0.01 6.79 -0.10 -0.01
race 0.31 1.37 0.31 1.02 0.31 1.70 -0.29 0.92
wexp -0.15 0.86 0.21 -0.71 0.48 1.06 -0.57 0.27
mar -0.43 0.65 0.38 -1.14 0.26 1.97 -1.18 0.31
paro -0.08 0.92 0.20 -0.43 0.66 0.59 -0.47 0.30
prio 0.09 1.10 0.03 3.19 <0.005 9.48 0.04 0.15
---
Concordance = 0.64
Log-likelihood ratio test = 33.27 on 7 df, -log2(p)=15.37
"""
Notice the sign change! So if you want the constant baseline hazard in the model, it's exp(-4.05).
The floating point numbers with finite precision are represented with different precision in identical conditions
It is detected and tested on python version 3.x under Linux and Windows. And take the negative effect for the next calculation.
for i in range(100):
k = 1 + i / 100;
print(k)
1.0
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08
1.09
1.1
1.11
1.12
1.13
1.1400000000000001
1.15
1.16
1.17
1.18
1.19
1.2
1.21
1.22
1.23
1.24
1.25
1.26
1.27
1.28
1.29
1.3
1.31
1.32
1.33
1.34
1.35
1.3599999999999999
1.37
1.38
1.3900000000000001
1.4
1.41
1.42
1.43
1.44
1.45
1.46
1.47
1.48
1.49
1.5
1.51
1.52
1.53
1.54
1.55
1.56
1.5699999999999998
1.58
1.5899999999999999
1.6
1.6099999999999999
1.62
1.63
1.6400000000000001
1.65
1.6600000000000001
1.67
1.6800000000000002
1.69
1.7
1.71
1.72
1.73
1.74
1.75
1.76
1.77
1.78
1.79
1.8
1.81
1.8199999999999998
1.83
1.8399999999999999
1.85
1.8599999999999999
1.87
1.88
1.8900000000000001
1.9
1.9100000000000001
1.92
1.9300000000000002
1.94
1.95
1.96
1.97
1.98
1.99
It is possible to set the precision in the following way:
for i in range(100):
k = 1 + i / 100;
print("%.Nf"%k)
Where N - decimal numbers.
Keep in mind, that regularly you don't need a lot of them, though the number could be really huge.