I have a bash script as below:
day=(58 34 107 91 43 39 41 76 37 47 70 74 56 19 95 38 48 96 50 76 89 79 46 105 26 88 69 87 23 82 99 77 114 52 87 63 33 52 57 45 48 49 55 60 34 107 48 40 25 20 16)
year=(1952 1953 1954 1955 1956 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004)
for dom in $day; do
for yrs in $year; do
ncks -O -d time,$dom imdJJAS$yrs.nc ac_$yrs.nc
done
done
Basically i am trying to extract the time dimension for each year using the NCO ncks command, the scripts run but the outputs are incorrect. For year 1951, it succesfully extracted the 58th time value, but from 1952 onwards, it extracts the last value in the day array (16), which is incorrect.
ive tried setting {$day[a]} since its an array, but if i used this, for all years in extracts the last value in the array instead.
I am not too sure what im doing wrong, ive looked through quite a few post regarding this, but it doest seem to be working.
Id appreciate any help.
Cheers!
$array by itself will expand to the first element in the array. To expand to the full array you should use ${array[#]}:
for dom in "${day[#]}"; do
for yrs in "${year[#]}"; do
ncks -O -d "time,${dom}" "imdJJAS${yrs}.nc" "ac_${yrs}.nc"
done
done
I also quoted your variable expansions and changed $dom and $yrs to ${dom} and ${yrs}. The later is done to prevent mistakenly referring to an undefined variable $dom_abc is not the same as ${dom}_abc
If I understand your intention correctly, you are trying to use corresponding values from both arrays. In that case you need a numerical index. for VAR in ARRAY iterates over all values of the array.
Related
on excel I did a two way ANOVA with replication (is this the same as two way repeated measures ANOVA?) and need to do a post hoc Tukey. How do I do this in excel 2016?
days represent the score on the day the measurement was taken
treatment
day6
day7
day10
day11
1
20
30
500
490
1
2
400
900
500
1
3
32
1000
145
2
67
56
45
89
2
54
67
67
23
2
78
77
68
90
3
32
32
34
99
3
56
58
103
23
3
17
45
115
1043
I have a timeseries data of ice thickness. The plot is only useful for winter months and there is no interest in seeing big gaps during summer months. Would it be possible to skip summer months (say April to October) in the x-axis and have a smaller area with different color and labeled Summer?
Let's take this data:
import datetime
n_samples = 20
index = pd.date_range(start='1/1/2018', periods=n_samples, freq='M')
values = np.random.randint(0,100, size=(n_samples))
data = pd.Series(values, index=index)
print(data)
2018-01-31 58
2018-02-28 93
2018-03-31 15
2018-04-30 87
2018-05-31 51
2018-06-30 67
2018-07-31 22
2018-08-31 66
2018-09-30 55
2018-10-31 73
2018-11-30 70
2018-12-31 61
2019-01-31 95
2019-02-28 97
2019-03-31 31
2019-04-30 50
2019-05-31 75
2019-06-30 80
2019-07-31 84
2019-08-31 19
Freq: M, dtype: int64
You can filter the data that is not in the range of the months, so you take the index of Serie, take the month, check if is in the range, and take the negative (with ~)
filtered1 = data[~data.index.month.isin(range(4,10))]
print(filtered1)
2018-01-31 58
2018-02-28 93
2018-03-31 15
2018-10-31 73
2018-11-30 70
2018-12-31 61
2019-01-31 95
2019-02-28 97
2019-03-31 31
If you plot that,
filtered1.plot()
you will have this image
so you need to set the frecuency, in this case, monthly (M)
filtered1.asfreq('M').plot()
Additionally, you could use filters like:
filtered2 = data[data.index.month.isin([1,2,3,11,12])]
filtered3 = data[~ data.index.month.isin([4,5,6,7,8,9,10])]
if you need keep/filter specific months.
I have the following issue with python pandas (I am relatively new to it): I have a simple dataset with a column for date, and a corresponding column of values. I am able to sort this Dataframe by date and value by doing the following:
df = df.sort_values(['date', 'value'],ascending=False)
I obtain this:
date value
2019-11 100
2019-11 89
2019-11 87
2019-11 86
2019_11 45
2019_11 33
2019_11 24
2019_11 11
2019_11 8
2019_11 5
2019-10 100
2019-10 98
2019-10 96
2019-10 94
2019_10 94
2019_10 78
2019_10 74
2019_10 12
2019_10 3
2019_10 1
Now, what I would like to do, is to get rid of the lowest fifth percentile for the value column for EACH month (each group). I know that I should use a groupby method, and perhaps also a function:
df = df.sort_values(['date', 'value'],ascending=False).groupby('date', group_keys=False).apply(<???>)
The ??? is where I am struggling. I know how to suppress the lowest 5th percentile on a sorted Dataframe as a WHOLE, for instance by doing:
df = df[df.value > df.value.quantile(.05)]
This was the object of another post on StackOverflow. I know that I can also use numpy to do this, and that it is much faster, but my issue is really how to apply that to EACH GROUP independently (each portion of the value column sorted by month) in the Dataframe, not just the whole Dataframe.
Any help would be greatly appreciated
Thank you so very much,
Kind regards,
Berti
Use GroupBy.transform with lambda function for Series with same size like original DataFrame, so possible filter by boolean indexing:
df = df.sort_values(['date', 'value'],ascending=False)
q = df.groupby('date')['value'].transform(lambda x: x.quantile(.05))
df = df[df.value > q]
print (df)
date value
4 2019_11 45
5 2019_11 33
6 2019_11 24
7 2019_11 11
8 2019_11 8
14 2019_10 94
15 2019_10 78
16 2019_10 74
17 2019_10 12
18 2019_10 3
0 2019-11 100
1 2019-11 89
2 2019-11 87
10 2019-10 100
11 2019-10 98
12 2019-10 96
You could create your own function and apply it:
def remove_bottom_5_pct(arr):
thresh = np.percentile(arr, 5)
return arr[arr > thresh]
df.groupby('date', sort=False)['value'].apply(remove_bottom_5_pct)
[out]
date
2019-11 0 100
1 89
2 87
3 86
4 45
5 33
6 24
7 11
8 8
2019-10 10 100
11 98
12 96
13 94
14 94
15 78
16 74
17 12
18 3
Name: value, dtype: int64
Column A are dates and B & C are Measurements
Dates Measurements
1 56 15
2 45 25
3 62 76
4 15 42
5 165 56
6 16 79
7 45 46
8 47 79
9 24 47
10 12 14
11 147 47
12 195 19
13 443 79
14 642 43
15 462 75
16 156 87
17 794 49
Start Date:2
Measurement:45
Code used to solve for the measurement
=VLOOKUP(B21,A2:C18,2,FALSE)
end date:14
Measure:642
=VLOOKUP(B22,A2:C18,2,FALSE)
I used vlookup to find me the values that I desire, but now I want to find the median values of that range from the start to end date in each column.
How can I code it so that once it selects the values, it can select the whole range and find the median values?
Since your column A values are ordered ascendingly, we can use the very efficient:
=MEDIAN(INDEX(B2:B18,MATCH(B21,A2:A18)):INDEX(B2:B18,MATCH(B22,A2:A18,0)))
Regards
This question already has answers here:
Calculate Percentile in Excel 2010
(3 answers)
Closed 9 years ago.
I am trying to calculate how many calls came back in 95 percentile of time. Below is my Result Set. I am working with Excel 2010
Milliseconds Number
0 1702
1 15036
2 14262
3 13190
4 9137
5 5635
6 3742
7 2628
8 1899
9 1298
10 963
11 727
12 503
13 415
14 311
15 235
16 204
17 140
18 109
19 83
20 72
21 55
22 52
23 35
24 33
25 25
26 15
27 18
28 14
29 15
30 13
31 19
32 23
33 19
34 21
35 20
36 25
37 26
38 13
39 12
40 10
41 17
42 6
43 7
44 8
45 4
46 7
47 9
48 11
49 12
50 9
51 9
52 9
53 8
54 10
55 10
56 11
57 3
58 7
59 7
60 2
61 5
62 7
63 5
64 5
65 2
66 3
67 2
68 1
70 1
71 2
72 1
73 4
74 1
75 1
76 1
77 3
80 1
81 1
85 1
87 2
93 1
96 1
100 1
107 1
112 1
116 1
125 1
190 1
356 1
450 1
492 1
497 1
554 1
957 1
Just some background what does above information means-
1702 calls came back in 0 milliseconds
15036 calls came back in 1 milliseconds
14262 calls came back in 2 milliseconds
etc etc
So to calculate the 95th percentile from the above data, I am using this formula in excel 2010-
=PERCENTILE.EXC(IF(TRANSPOSE(ROW(INDIRECT("1:"&MAX(H$2:H$96))))<=H$2:H$96,A$2:A$96),0.95)
Can anyone help me whether the way I am doing in Excel 2010 is right or not?
I am getting 95th percentile as 10 by using the above scenario.
Thanks for the help.
that's essentially the same question you asked here and the formula I suggested. As per my last comments in that question - that formula should work OK as long as you use CTRL+SHIFT+ENTER correctly. I get 10 as the answer for this example using that formula.
I think you can verify manually that that is indeed the correct answer. If you have a running total in an adjacent column then you can see where the 95th percentile is reached......