I have a df with groundwater level time series and I am trying to remove the outliers from the data. I tend to do it using a rolling window, so the outlier removal method I want to use is Generalized Extreme Studentized Deviate (ESD). But due to the fact that my timesieres are sometimes not normally distributed, I want to apply this method for a specific time window (12months or 24months) for monthly data to get better results.
from __future__ import print_function, division
import numpy as np
import matplotlib.pylab as plt
from PyAstronomy import pyasl
# Convert data given at:
# http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
# to array.
x = np.array([float(x) for x in "-0.25 0.68 0.94 1.15 1.20 1.26 1.26 1.34 1.38 1.43 1.49 1.49 \
1.55 1.56 1.58 1.65 1.69 1.70 1.76 1.77 1.81 1.91 1.94 1.96 \
1.99 2.06 2.09 2.10 2.14 2.15 2.23 2.24 2.26 2.35 2.37 2.40 \
2.47 2.54 2.62 2.64 2.90 2.92 2.92 2.93 3.21 3.26 3.30 3.59 \
3.68 4.30 4.64 5.34 5.42 6.01".split()])
# Apply the generalized ESD
r = pyasl.generalizedESD(x, 10, 0.05, fullOutput=True)
print("Number of outliers: ", r[0])
print("Indices of outliers: ", r[1])
print(" R Lambda")
for i in range(len(r[2])):
print("%2d %8.5f %8.5f" % ((i+1), r[2][i], r[3][i]))
# Plot the "data"
plt.plot(x, 'b.')
# and mark the outliers.
for i in range(r[0]):
plt.plot(r[1][i], x[r[1][i]], 'rp')
I just simply want to apply the code abow to a rolling window in my dataframe an remove outliers.
thank you,
Below is the code that I wrote for some project.
The plots are not correct which I'm obtaining via matplotlib.
It would be very helpful, If I can know where am I doing wrong?
randomFL =[]
for i in range(0,100):
x = round(random.uniform(0.1, 10.1),2)
indexer = np.arange(10)[None, :] + np.arange(8)[:, None]
[2.84, 5.03, 3.67, 3.65, 9.61, 8.76, 3.18, 8.12, 7.95, 0.36, 4.16, 8.44, 9.99, 0.44, 5.24, 4.5, 9.63, 6.75, 1.82, 6.14, 0.92, 8.02, 8.08, 6.57, 4.93, 2.96, 5.18, 1.55, 0.74, 6.64, 6.57, 8.96, 8.3, 3.04, 3.5, 3.94, 8.42, 0.45, 8.63, 6.09, 8.2, 0.93, 7.48, 6.73, 7.09, 4.88, 9.14, 5.33, 3.4, 9.45, 2.15, 8.54, 7.08, 8.25, 9.9, 0.43, 7.74, 9.56, 8.98, 2.24, 1.54, 0.93, 9.56, 9.08, 6.62, 4.71, 4.08, 0.63, 2.48, 8.34, 2.24, 5.67, 9.97, 7.92, 6.23, 7.76, 2.89, 0.8, 3.18, 7.9, 7.49, 5.54, 8.44, 0.13, 8.31, 6.21, 2.43, 10.02, 8.38, 1.93, 6.79, 5.28, 7.35, 2.23, 6.16, 9.04, 5.57, 2.7, 2.22, 4.4]
[[ 0 1 2 3 4 5 6 7 8 9]
[ 1 2 3 4 5 6 7 8 9 10]
[ 2 3 4 5 6 7 8 9 10 11]
[ 3 4 5 6 7 8 9 10 11 12]
[ 4 5 6 7 8 9 10 11 12 13]
[ 5 6 7 8 9 10 11 12 13 14]
[ 6 7 8 9 10 11 12 13 14 15]
[ 7 8 9 10 11 12 13 14 15 16]]
[[2.84 5.03 3.67 3.65 9.61 8.76 3.18 8.12 7.95 0.36]
[5.03 3.67 3.65 9.61 8.76 3.18 8.12 7.95 0.36 4.16]
[3.67 3.65 9.61 8.76 3.18 8.12 7.95 0.36 4.16 8.44]
[3.65 9.61 8.76 3.18 8.12 7.95 0.36 4.16 8.44 9.99]
[9.61 8.76 3.18 8.12 7.95 0.36 4.16 8.44 9.99 0.44]
[8.76 3.18 8.12 7.95 0.36 4.16 8.44 9.99 0.44 5.24]
[3.18 8.12 7.95 0.36 4.16 8.44 9.99 0.44 5.24 4.5 ]
[8.12 7.95 0.36 4.16 8.44 9.99 0.44 5.24 4.5 9.63]]
import matplotlib.pyplot as plt
plt.plot(xyz[0,:], indexer[0,:])
Similarly, I am not getting the correct plots for subsequenct rows as well.
Where am I wrong in this?
I am trying to use the summation expression in Gnuplot but it is not working properly. I have the following data structure with many number of rows:
t x1 y1 z1 x2 y2 z2 x3 y3 z3 ... x98 y98 z98
I would like to plot the following equation:
u = (sqrt(sum(x)**2 + sum(y)**2 + sum(z)**2))/98
98 is the number of points (x,y,z).
What I have until now is how to plot the average of columns x1, x2, x3.. as following:
plot 'data file' u 1:((sum[i=0:ColCount-1] column(i*ColStep+ColStart))/ColCount) w lines ls 4 notitle
Where ColCount = 98, ColStep = 3 and ColStart=2.
But I have been trying to plot the equation, but it is not working. I would really appreciate any help.
What the following script does:
It takes the square root of the sum of (x1+x2+x3)**2 and (y1+y2+y3)**2 and (z1+z2+z3)**2. This you can adapt to your column numbers.
But I'm still not sure whether this is what you want. Please clarify.
### summing up columns
reset session
$Data <<EOD
#t x1 y1 z1 x2 y2 z2 x3 y3 z3
1 1.11 1.21 1.31 2.11 2.21 2.31 3.11 3.21 3.31
2 1.12 1.22 1.32 2.12 2.22 2.32 3.12 3.22 3.32
3 1.13 1.23 1.33 2.13 2.23 2.33 3.13 3.23 3.33
4 1.14 1.24 1.34 2.14 2.24 2.34 3.14 3.24 3.34
5 1.15 1.25 1.35 2.15 2.25 2.35 3.15 3.25 3.35
6 1.16 1.26 1.36 2.16 2.26 2.36 3.16 3.26 3.36
7 1.17 1.27 1.37 2.17 2.27 2.37 3.17 3.27 3.37
8 1.18 1.28 1.38 2.18 2.28 2.38 3.18 3.28 3.38
9 1.19 1.29 1.39 2.19 2.29 2.39 3.19 3.29 3.39
ColStep = 3
ColCount = 3
mySum(ColStart) = sum[i=0:ColCount-1] column(i*ColStep+ColStart)
plot $Data u 1:(sqrt(mySum(2)**2 + mySum(3)**2 + mySum(4)**2)) w lp pt 7 notitle
### end of code
I have a data with 109 columns and around 3000 rows. I would like to plot the average of the x1 until x108 (ignoring the y and z), the data looks like:
time x1 y1 z1 x2 y2 z2 x3 y3 z4 ...x108 y108 z108
With just a few columns it works well, for example:
time x1 y1 z1 x2 y2 z2 x3 y3 z3
plot 'file.dat' u 1:(($2+$5+$8)/3) with lines ls 4
But the problem comes when the data is bigger, for instance 108 columns or more. I would not like to do it manually, because the data will get bigger later.
I have tried:
plot for [i=2:108:3] 'file.dat' u 1:(column(i)) with lines ls 4
But then I would get plots for each combination, which is not what I want. So, how could I plot just the average of x1 ... x108 (ignoring the y and z)?
The following should do what you're looking for. The example below averages z1, ..., z3. In your case the parameters for averaging x1, ..., x108 would be ColStart=2, ColStep=3, and ColCount=108.
Also check help summation.
### average over several columns
reset session
$Data <<EOD
#n x1 y1 z1 x2 y2 z2 x3 y3 z3
1 1.11 1.21 1.31 2.11 2.21 2.31 3.11 3.21 3.31
2 1.12 1.22 1.32 2.12 2.22 2.32 3.12 3.22 3.32
3 1.13 1.23 1.33 2.13 2.23 2.33 3.13 3.23 3.33
4 1.14 1.24 1.34 2.14 2.24 2.34 3.14 3.24 3.34
5 1.15 1.25 1.35 2.15 2.25 2.35 3.15 3.25 3.35
6 1.16 1.26 1.36 2.16 2.26 2.36 3.16 3.26 3.36
7 1.17 1.27 1.37 2.17 2.27 2.37 3.17 3.27 3.37
8 1.18 1.28 1.38 2.18 2.28 2.38 3.18 3.28 3.38
9 1.19 1.29 1.39 2.19 2.29 2.39 3.19 3.29 3.39
ColStart = 4
ColStep = 3
ColCount = 3
plot $Data u 1:((sum[i=0:ColCount-1] column(i*ColStep+ColStart))/ColCount) w lp pt 7 notitle
### end of code
I want to find the rows from pandas Dataframe_1 if the value in the fourth column within this row exists in any row of the entire first column of Dataframe_2. I need to copy these rows to the new table.
Here I also include the dataframes:
1 2 3 4
chr1 128611 128681 cuffs_1_128645 .
chr1 186868 186933 cuffs_2_186901 .
chr1 186978 187035 cuffs_3_187015 .
chr1 187054 187122 cuffs_4_187082 .
chr1 262712 262773 cuffs_5_262742 .
1 2 3 4 5 6 7 8
cuffs_100001_101338862 1.24 3.11 1.86 11.19 5.59 8.08 0.62 0
cuffs_100004_101354225 2.49 0.62 1.86 1.86 2.49 1.24 0.00 0
cuffs_100045_101386584 14.92 14.92 3.11 10.57 5.59 15.54 0.62 0
cuffs_100089_101719129 2.49 0.62 1.86 5.59 1.86 1.86 0.00 0
cuffs_100111_101726996 6.84 0.00 3.73 3.11 6.84 2.49 0.62 0
Both dataframes are imported from .csv and are huge, so here I've put only a few rows and columns.
This is what I tried:
import pandas as pd
df1 = pd.DataFrame.from_csv(Dataframe_1, sep = '\t', index_col=list(range(0,1,2)), header = None)
df2 = pd.DataFrame.from_csv(Dataframe_2, sep = '\t', index_col=list(range(0,1,2)), header = None)
df1 = df1[df1[3] == df2[0]]
df1.to_csv(fileout, sep = '\t', header = False)
When performing this I get eight (or so) lines of response referring to the pandas package files, index.pyx and hashtable.pyx which I don't understand.
Got it!
Apparently, none of the tested commands for filtering, be it df1 = df1[df1[3].isin(df2[0])] or df1 = df1[df1[3] == df2[0]] recognise the "0" columns, which represented the rows indexes. The way out would be to import the Dataframe_2 assigning the columns not like (0,1,2) but (1,2,3) this will lead to the following formatting of the df2:
0 2 3 4 5 6 7 8
1.24 cuffs_100001_101338862 3.11 1.86 11.19 5.59 8.08 0.62 0
2.49 cuffs_100004_101354225 0.62 1.86 1.86 2.49 1.24 0.00 0
14.92 cuffs_100045_101386584 14.92 3.11 10.57 5.59 15.54 0.62 0
2.49 cuffs_100089_101719129 0.62 1.86 5.59 1.86 1.86 0.00 0
6.84 cuffs_100111_101726996 0.00 3.73 3.11 6.84 2.49 0.62 0
Where the "0" column is no longer the index for rows. Then we can apply df1 = df1[df1[3].isin(df2[0])]. NOTE: application of df1 = df1[df1[3] == df2[0]] will raise the error message Series lengths must match to compare