I have a df with groundwater level time series and I am trying to remove the outliers from the data. I tend to do it using a rolling window, so the outlier removal method I want to use is Generalized Extreme Studentized Deviate (ESD). But due to the fact that my timesieres are sometimes not normally distributed, I want to apply this method for a specific time window (12months or 24months) for monthly data to get better results.
from __future__ import print_function, division
import numpy as np
import matplotlib.pylab as plt
from PyAstronomy import pyasl
# Convert data given at:
# http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
# to array.
x = np.array([float(x) for x in "-0.25 0.68 0.94 1.15 1.20 1.26 1.26 1.34 1.38 1.43 1.49 1.49 \
1.55 1.56 1.58 1.65 1.69 1.70 1.76 1.77 1.81 1.91 1.94 1.96 \
1.99 2.06 2.09 2.10 2.14 2.15 2.23 2.24 2.26 2.35 2.37 2.40 \
2.47 2.54 2.62 2.64 2.90 2.92 2.92 2.93 3.21 3.26 3.30 3.59 \
3.68 4.30 4.64 5.34 5.42 6.01".split()])
# Apply the generalized ESD
r = pyasl.generalizedESD(x, 10, 0.05, fullOutput=True)
print("Number of outliers: ", r[0])
print("Indices of outliers: ", r[1])
print(" R Lambda")
for i in range(len(r[2])):
print("%2d %8.5f %8.5f" % ((i+1), r[2][i], r[3][i]))
# Plot the "data"
plt.plot(x, 'b.')
# and mark the outliers.
for i in range(r[0]):
plt.plot(r[1][i], x[r[1][i]], 'rp')
plt.show()
I just simply want to apply the code abow to a rolling window in my dataframe an remove outliers.
thank you,
Below is the code that I wrote for some project.
The plots are not correct which I'm obtaining via matplotlib.
It would be very helpful, If I can know where am I doing wrong?
randomFL =[]
for i in range(0,100):
x = round(random.uniform(0.1, 10.1),2)
randomFL.append(x)
indexer = np.arange(10)[None, :] + np.arange(8)[:, None]
a=np.array(randomFL)
xyz=a[indexer]
print(randomFL)
[2.84, 5.03, 3.67, 3.65, 9.61, 8.76, 3.18, 8.12, 7.95, 0.36, 4.16, 8.44, 9.99, 0.44, 5.24, 4.5, 9.63, 6.75, 1.82, 6.14, 0.92, 8.02, 8.08, 6.57, 4.93, 2.96, 5.18, 1.55, 0.74, 6.64, 6.57, 8.96, 8.3, 3.04, 3.5, 3.94, 8.42, 0.45, 8.63, 6.09, 8.2, 0.93, 7.48, 6.73, 7.09, 4.88, 9.14, 5.33, 3.4, 9.45, 2.15, 8.54, 7.08, 8.25, 9.9, 0.43, 7.74, 9.56, 8.98, 2.24, 1.54, 0.93, 9.56, 9.08, 6.62, 4.71, 4.08, 0.63, 2.48, 8.34, 2.24, 5.67, 9.97, 7.92, 6.23, 7.76, 2.89, 0.8, 3.18, 7.9, 7.49, 5.54, 8.44, 0.13, 8.31, 6.21, 2.43, 10.02, 8.38, 1.93, 6.79, 5.28, 7.35, 2.23, 6.16, 9.04, 5.57, 2.7, 2.22, 4.4]
print(indexer)
[[ 0 1 2 3 4 5 6 7 8 9]
[ 1 2 3 4 5 6 7 8 9 10]
[ 2 3 4 5 6 7 8 9 10 11]
[ 3 4 5 6 7 8 9 10 11 12]
[ 4 5 6 7 8 9 10 11 12 13]
[ 5 6 7 8 9 10 11 12 13 14]
[ 6 7 8 9 10 11 12 13 14 15]
[ 7 8 9 10 11 12 13 14 15 16]]
print(xyz)
[[2.84 5.03 3.67 3.65 9.61 8.76 3.18 8.12 7.95 0.36]
[5.03 3.67 3.65 9.61 8.76 3.18 8.12 7.95 0.36 4.16]
[3.67 3.65 9.61 8.76 3.18 8.12 7.95 0.36 4.16 8.44]
[3.65 9.61 8.76 3.18 8.12 7.95 0.36 4.16 8.44 9.99]
[9.61 8.76 3.18 8.12 7.95 0.36 4.16 8.44 9.99 0.44]
[8.76 3.18 8.12 7.95 0.36 4.16 8.44 9.99 0.44 5.24]
[3.18 8.12 7.95 0.36 4.16 8.44 9.99 0.44 5.24 4.5 ]
[8.12 7.95 0.36 4.16 8.44 9.99 0.44 5.24 4.5 9.63]]
import matplotlib.pyplot as plt
plt.figure()
plt.plot(xyz[0,:], indexer[0,:])
plt.show()
Similarly, I am not getting the correct plots for subsequenct rows as well.
Where am I wrong in this?
I am trying to use the summation expression in Gnuplot but it is not working properly. I have the following data structure with many number of rows:
t x1 y1 z1 x2 y2 z2 x3 y3 z3 ... x98 y98 z98
I would like to plot the following equation:
u = (sqrt(sum(x)**2 + sum(y)**2 + sum(z)**2))/98
98 is the number of points (x,y,z).
What I have until now is how to plot the average of columns x1, x2, x3.. as following:
plot 'data file' u 1:((sum[i=0:ColCount-1] column(i*ColStep+ColStart))/ColCount) w lines ls 4 notitle
Where ColCount = 98, ColStep = 3 and ColStart=2.
But I have been trying to plot the equation, but it is not working. I would really appreciate any help.
What the following script does:
It takes the square root of the sum of (x1+x2+x3)**2 and (y1+y2+y3)**2 and (z1+z2+z3)**2. This you can adapt to your column numbers.
But I'm still not sure whether this is what you want. Please clarify.
Code:
### summing up columns
reset session
$Data <<EOD
#t x1 y1 z1 x2 y2 z2 x3 y3 z3
1 1.11 1.21 1.31 2.11 2.21 2.31 3.11 3.21 3.31
2 1.12 1.22 1.32 2.12 2.22 2.32 3.12 3.22 3.32
3 1.13 1.23 1.33 2.13 2.23 2.33 3.13 3.23 3.33
4 1.14 1.24 1.34 2.14 2.24 2.34 3.14 3.24 3.34
5 1.15 1.25 1.35 2.15 2.25 2.35 3.15 3.25 3.35
6 1.16 1.26 1.36 2.16 2.26 2.36 3.16 3.26 3.36
7 1.17 1.27 1.37 2.17 2.27 2.37 3.17 3.27 3.37
8 1.18 1.28 1.38 2.18 2.28 2.38 3.18 3.28 3.38
9 1.19 1.29 1.39 2.19 2.29 2.39 3.19 3.29 3.39
EOD
ColStep = 3
ColCount = 3
mySum(ColStart) = sum[i=0:ColCount-1] column(i*ColStep+ColStart)
plot $Data u 1:(sqrt(mySum(2)**2 + mySum(3)**2 + mySum(4)**2)) w lp pt 7 notitle
### end of code
Result:
I have a data with 109 columns and around 3000 rows. I would like to plot the average of the x1 until x108 (ignoring the y and z), the data looks like:
time x1 y1 z1 x2 y2 z2 x3 y3 z4 ...x108 y108 z108
With just a few columns it works well, for example:
time x1 y1 z1 x2 y2 z2 x3 y3 z3
plot 'file.dat' u 1:(($2+$5+$8)/3) with lines ls 4
But the problem comes when the data is bigger, for instance 108 columns or more. I would not like to do it manually, because the data will get bigger later.
I have tried:
plot for [i=2:108:3] 'file.dat' u 1:(column(i)) with lines ls 4
But then I would get plots for each combination, which is not what I want. So, how could I plot just the average of x1 ... x108 (ignoring the y and z)?
Thanks.
The following should do what you're looking for. The example below averages z1, ..., z3. In your case the parameters for averaging x1, ..., x108 would be ColStart=2, ColStep=3, and ColCount=108.
Also check help summation.
Code:
### average over several columns
reset session
$Data <<EOD
#n x1 y1 z1 x2 y2 z2 x3 y3 z3
1 1.11 1.21 1.31 2.11 2.21 2.31 3.11 3.21 3.31
2 1.12 1.22 1.32 2.12 2.22 2.32 3.12 3.22 3.32
3 1.13 1.23 1.33 2.13 2.23 2.33 3.13 3.23 3.33
4 1.14 1.24 1.34 2.14 2.24 2.34 3.14 3.24 3.34
5 1.15 1.25 1.35 2.15 2.25 2.35 3.15 3.25 3.35
6 1.16 1.26 1.36 2.16 2.26 2.36 3.16 3.26 3.36
7 1.17 1.27 1.37 2.17 2.27 2.37 3.17 3.27 3.37
8 1.18 1.28 1.38 2.18 2.28 2.38 3.18 3.28 3.38
9 1.19 1.29 1.39 2.19 2.29 2.39 3.19 3.29 3.39
EOD
ColStart = 4
ColStep = 3
ColCount = 3
plot $Data u 1:((sum[i=0:ColCount-1] column(i*ColStep+ColStart))/ColCount) w lp pt 7 notitle
### end of code
Result:
I want to find the rows from pandas Dataframe_1 if the value in the fourth column within this row exists in any row of the entire first column of Dataframe_2. I need to copy these rows to the new table.
EDIT
Here I also include the dataframes:
Dataframe_1:
1 2 3 4
0
chr1 128611 128681 cuffs_1_128645 .
chr1 186868 186933 cuffs_2_186901 .
chr1 186978 187035 cuffs_3_187015 .
chr1 187054 187122 cuffs_4_187082 .
chr1 262712 262773 cuffs_5_262742 .
Dataframe_2:
1 2 3 4 5 6 7 8
0
cuffs_100001_101338862 1.24 3.11 1.86 11.19 5.59 8.08 0.62 0
cuffs_100004_101354225 2.49 0.62 1.86 1.86 2.49 1.24 0.00 0
cuffs_100045_101386584 14.92 14.92 3.11 10.57 5.59 15.54 0.62 0
cuffs_100089_101719129 2.49 0.62 1.86 5.59 1.86 1.86 0.00 0
cuffs_100111_101726996 6.84 0.00 3.73 3.11 6.84 2.49 0.62 0
Both dataframes are imported from .csv and are huge, so here I've put only a few rows and columns.
This is what I tried:
import pandas as pd
df1 = pd.DataFrame.from_csv(Dataframe_1, sep = '\t', index_col=list(range(0,1,2)), header = None)
df2 = pd.DataFrame.from_csv(Dataframe_2, sep = '\t', index_col=list(range(0,1,2)), header = None)
df1 = df1[df1[3] == df2[0]]
df1.to_csv(fileout, sep = '\t', header = False)
When performing this I get eight (or so) lines of response referring to the pandas package files, index.pyx and hashtable.pyx which I don't understand.
Got it!
Apparently, none of the tested commands for filtering, be it df1 = df1[df1[3].isin(df2[0])] or df1 = df1[df1[3] == df2[0]] recognise the "0" columns, which represented the rows indexes. The way out would be to import the Dataframe_2 assigning the columns not like (0,1,2) but (1,2,3) this will lead to the following formatting of the df2:
0 2 3 4 5 6 7 8
1
1.24 cuffs_100001_101338862 3.11 1.86 11.19 5.59 8.08 0.62 0
2.49 cuffs_100004_101354225 0.62 1.86 1.86 2.49 1.24 0.00 0
14.92 cuffs_100045_101386584 14.92 3.11 10.57 5.59 15.54 0.62 0
2.49 cuffs_100089_101719129 0.62 1.86 5.59 1.86 1.86 0.00 0
6.84 cuffs_100111_101726996 0.00 3.73 3.11 6.84 2.49 0.62 0
Where the "0" column is no longer the index for rows. Then we can apply df1 = df1[df1[3].isin(df2[0])]. NOTE: application of df1 = df1[df1[3] == df2[0]] will raise the error message Series lengths must match to compare
Thanks!