Python dataframes - How to track recent lowest value - python-3.x

I have a data frame like below in a dataframe called data
SN value
1 895.1
2 900.94
3 920.26
4 918.9
5 927.23
6 919.32
7 923.33
8 896.42
9 898.72
10 881.03
11 879.56
12 882.68
13 879.13
14 901.05
15 905.84
16 932.68
17 940.74
I need to keep track of recent lowest value.
I tried
# creating new column to track recent low and initializing all values to first value. This is to update it if a new low value is found
# data['new_low'] = data.iat[0,0]
#creating another column to compare last two rows and record whether it is new low or not
data['is_new_low'] = data.value.lt(data.new_low)
#if new low is found, make current value as the new low, otherwise keep the previous value
data['new_low'] = np.where((data['is_new_low']== True),data.value,data.new_low.shift())
This code is working for one pass. But when the flips for 2nd time, its not updating. Refer Row with SN 14
My Code out put
SN value new_low is_new_low
1 895.1 NaN FALSE
2 900.94 895.1 FALSE
3 920.26 895.1 FALSE
4 918.9 895.1 FALSE
5 927.23 895.1 FALSE
6 919.32 895.1 FALSE
7 923.33 895.1 FALSE
8 896.42 895.1 FALSE
9 898.72 895.1 FALSE
10 881.03 881.03 TRUE
11 879.56 879.56 TRUE
12 882.68 882.68 TRUE
13 879.13 879.13 TRUE
14 901.05 895.1 FALSE #Here it should be 879.13. But 895.1 is coming
15 905.84 895.1 FALSE
16 850.2 895.1 TRUE
17 870.14 895.1 TRUE
Desired output is
SN value new_low is_new_low
1 895.1 NaN FALSE
2 900.94 895.1 FALSE
3 920.26 895.1 FALSE
4 918.9 895.1 FALSE
5 927.23 895.1 FALSE
6 919.32 895.1 FALSE
7 923.33 895.1 FALSE
8 896.42 895.1 FALSE
9 898.72 895.1 FALSE
10 881.03 881.03 TRUE
11 879.56 879.56 TRUE
12 882.68 882.68 TRUE
13 879.13 879.13 TRUE
14 901.05 879.13 FALSE
15 905.84 879.13 FALSE
16 850.2 850.2 TRUE
17 870.14 870.14 FALSE
How to achieve this ?

You can use df.itertuples() to iterate through all your rows and update the new_low value. For every row, check if your value is less than the new_low value in previous row. If yes, update the new low value as the current value else set the new_low value to the new_low value of previous column.
for row in data.itertuples():
if(row.Index):
if data.at[row.Index,"value"] < data.at[row.Index -1,"new_low"]:
data.at[row.Index,"new_low"] = data.at[row.Index,"value"]
else:
data.at[row.Index,"new_low"] = data.at[row.Index -1,"new_low"]
You can also do something like this:
df['new_low'] = df['value'].cummin()
And if you'd like to use numpy for better performance, you can use:
np.minimum.accumulate(df.value)

Related

Counting True or False

I have the following dataframe:
True_False
2018-01-02 True
2018-01-03 True
2018-01-04 False
2018-01-05 False
2018-01-08 False
... ...
2020-01-20 True
2020-01-21 True
2020-01-22 True
2020-01-23 True
2020-01-24 False
504 rows × 1 columns
I want to know how many successive True or False but not total it must stop counting after it toggles True or False. As such i want to eventually calculate mean(), max() and min() days. is it possible to show this data in Pandas?
Solution if all datetimes are consecutive:
You can create helper Series for consecutive groups by Series.shift and Series.cumsum, then get counts by GroupBy.size:
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s)
True_False True_False
False 2 3
4 1
True 1 2
3 4
dtype: int64
And last aggregate min, max and mean per first level of MultiIndex:
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 2 3 1
True 3 4 2
If datetimes are not consecutive first step is DataFrame.asfreq:
df = df.asfreq('d')
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 1.333333 2 1
True 3.000000 4 2

pandas dataframe groups check the number of unique values of a column is one but exclude empty strings

I have the following df,
id invoice_no
1 6636
1 6637
2 6639
2 6639
3
3
4 6635
4 6635
4 6635
the invoice_no for id 3 are all empty strings or spaces; I want to
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
but also consider spaces and empty string invoice_no in each group as same_invoice_no = False; I am wondering how to do that. The result will look like,
id invoice_no same_invoice_no
1 6636 False
1 6637 False
2 6639 True
2 6639 True
3 False
3 False
4 6635 True
4 6635 True
4 6635 True
Empty strings equate to True but NaNs don't. Replace empty strings by Numpy nan
df.replace('', np.nan, inplace = True)
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
id invoice_no same_invoice_no
0 1 6636.0 False
1 1 6637.0 False
2 2 6639.0 True
3 2 6639.0 True
4 3 NaN False
5 3 NaN False
6 4 6635.0 True
7 4 6635.0 True
8 4 6635.0 True

panda dataframe row wise iteration with referencing previous row values for conditional matching

I have to find out how many times a bike was on overspeed, and in each instances for how long(for simplicity how many kms)
df = pd.DataFrame({'bike':['b1']*15, 'km':list(range(1,16)), 'speed':[20,30,38,33,28,39,26,33,35,46,53,27,37,42,20]})
>>> df
bike km speed
0 b1 1 20
1 b1 2 30
2 b1 3 38
3 b1 4 33
4 b1 5 28
5 b1 6 39
6 b1 7 26
7 b1 8 33
8 b1 9 35
9 b1 10 46
10 b1 11 53
11 b1 12 27
12 b1 13 37
13 b1 14 42
14 b1 15 20
#Expected result is
bike last_OS_loc for_how_long_on_OS
b1 4 2km
b1 11 5km
b1 15 1km
Now Logic-
has to flag the speed >= 30 as Overspeed_Flag
If the speed remains on more than 30 for 1 or 1+km then those continuation are treated as a overspeed session (eg: when b1 was in between 2 to 4 km, 6to11, 13-14km, MARK it was not a overspeed session when b1 was at 6km, as it was only for that row, no continuation on >30 found).
then measure for a session, how long/for how many kms he remains at overspeed limit. Refer expected result tab.
also finding out for a overspeed session what was the last km mark.
Kindly suggest how can i achieve this. And do let me know if anything is not clear in the question.
P:S: i am also trying, but it is little complex for me(Pretty confused on how to mark if it is a continuation of OS_flag or a single instance of OS.), Will get back if successful in doing this. Thanks in ADV.
You can use:
#boolean mask
mask = df['speed'] >= 30
#consecutive groups
df['g'] = mask.ne(mask.shift()).cumsum()
#get size of each group
df['count'] = mask.groupby(df['g']).transform('size')
#filter by mask and remove unique rows
df = df[mask & (df['count'] > 1)]
print (df)
bike km speed g count
1 b1 2 30 2 3
2 b1 3 38 2 3
3 b1 4 33 2 3
7 b1 8 33 6 4
8 b1 9 35 6 4
9 b1 10 46 6 4
10 b1 11 53 6 4
12 b1 13 37 8 2
13 b1 14 42 8 2
#aggregate first and last values
df1 = df.groupby(['bike','g'])['km'].agg([('last_OS_loc', 'last'),
('for_how_long_on_OS','first')])
#substract last with first
df1['for_how_long_on_OS'] = df1['last_OS_loc'] - df1['for_how_long_on_OS']
#data cleaning
df1 = df1.reset_index(level=1, drop=True).reset_index()
print (df1)
bike last_OS_loc for_how_long_on_OS
0 b1 4 2
1 b1 11 3
2 b1 14 1
EDIT:
print (pd.concat([mask,
mask.shift(),
mask.ne(mask.shift()),
mask.ne(mask.shift()).cumsum()], axis=1,
keys=('mask', 'shifted', 'not equal (!=)', 'cumsum')))
mask shifted not equal (!=) cumsum
0 False NaN True 1
1 True False True 2
2 True True False 2
3 True True False 2
4 False True True 3
5 True False True 4
6 False True True 5
7 True False True 6
8 True True False 6
9 True True False 6
10 True True False 6
11 False True True 7
12 True False True 8
13 True True False 8
14 False True True 9
Here is another approach using a couple of helper Series and a lambda func:
os_session = (df['speed'].ge(30) & (df['speed'].shift(-1).ge(30) | df['speed'].shift().ge(30))).astype(int)
groups = (os_session.diff(1) != 0).astype('int').cumsum()
f_how_long = lambda x: x.max() - x.min()
grouped_df = (df.groupby([os_session, groups, 'bike'])['km']
.agg([('last_OS_loc', 'max'),
('for_how_long_on_OS',f_how_long)])
.xs(1, level=0)
.reset_index(level=0, drop=True))
print(grouped_df)
last_OS_loc for_how_long_on_OS
bike
b1 4 2
b1 11 3
b1 14 1

Pandas first date condition is met while another condition is active

I have a dataframe with a time series of scores. My goal is to detect when the score is larger than a certain threshold th and then to find when the score goes back to 0. Is quite easy to find each condition separately
dates_1 = score > th
dates_2 = np.sign(score[1:]) == np.sign(score.shift(1).dropna())
However, I don't know what's the most pythonic way to override dates_2 so that only dates when an 'active' date_1 has been observed
Perhaps using an auxiliary column 'active' set to 1 whenever score > th is True and set it to False when the condition for dates_2 is met. That way I can ask for the change in sign AND active == True. However, that approach requires iteration and I'm wondering if there's a vectorized solution to my problem
Any thoughts on how to improve my approach?
Sample data:
date score
2010-01-04 0.0
2010-01-05 -0.3667779798467592
2010-01-06 -1.9641427199568868
2010-01-07 -0.49976215445519134
2010-01-08 -0.7069108074548405
2010-01-11 -1.4624766212523337
2010-01-12 -0.9132777669357441
2010-01-13 0.16204588193577152
2010-01-14 0.958085568609925
2010-01-15 1.4683022129399834
2010-01-19 3.036016680985081
2010-01-20 2.2357911432637345
2010-01-21 2.8827438241030707
2010-01-22 -3.395977874791837
Expected Output
if th = 0.94
date active
2010-01-04 False
2010-01-05 False
2010-01-06 False
2010-01-07 False
2010-01-08 False
2010-01-11 False
2010-01-12 False
2010-01-13 False
2010-01-14 True
2010-01-15 True
2010-01-19 True
2010-01-20 True
2010-01-21 True
2010-01-22 False
Not Vectorized!
def alt_cond(s, th):
active = False
for x in s:
active = [x >= th, x > 0][int(active)]
yield active
df.assign(A=[*alt_cond(df.score, 0.94)])
date score A
0 2010-01-04 0.000000 False
1 2010-01-05 -0.366778 False
2 2010-01-06 -1.964143 False
3 2010-01-07 -0.499762 False
4 2010-01-08 -0.706911 False
5 2010-01-11 -1.462477 False
6 2010-01-12 -0.913278 False
7 2010-01-13 0.162046 False
8 2010-01-14 0.958086 True
9 2010-01-15 1.468302 True
10 2010-01-19 3.036017 True
11 2010-01-20 2.235791 True
12 2010-01-21 2.882744 True
13 2010-01-22 -3.395978 False
Vectorized (Sort Of)
I used Numba to really speed things up. Still a loop but should be very fast if you can install numba
from numba import njit
#njit
def alt_cond(s, th):
active = False
out = np.zeros(len(s), dtype=np.bool8)
for i, x in enumerate(s):
if active:
if x <= 0:
active = False
else:
if x >= th:
active = True
out[i] = active
return out
df.assign(A=alt_cond(df.score.values, .94))
Response to Comment
You can have a dictionary of column names and threshold values and iterate
th = {'score': 0.94}
df.join(pd.DataFrame(
np.column_stack([[*alt_cond(df[k], v)] for k, v in th.items()]),
df.index, [f"{k}_A" for k in th]
))
date score score_A
0 2010-01-04 0.000000 False
1 2010-01-05 -0.366778 False
2 2010-01-06 -1.964143 False
3 2010-01-07 -0.499762 False
4 2010-01-08 -0.706911 False
5 2010-01-11 -1.462477 False
6 2010-01-12 -0.913278 False
7 2010-01-13 0.162046 False
8 2010-01-14 0.958086 True
9 2010-01-15 1.468302 True
10 2010-01-19 3.036017 True
11 2010-01-20 2.235791 True
12 2010-01-21 2.882744 True
13 2010-01-22 -3.395978 False
I'm assuming your data is in a pandas dataframe, and 'date' is your index column. Then this would be the way I'd do it:
th = 0.94 # Threshold value
i = df[df.score>th].index[0] # Check the index for the first condition
df[i:][df.score<0].index[0] # Check the index for the second condition, after the index of the first condition
So use conditional indexing to find the index for the first condition ([df.score>th]), then check for the second condition ([df.score<0]), but begin to look from the index found for the first condition ([i:])

Skipping every nth row in pandas

I am trying to slice my dataframe by skipping every 4th row. The best way I could get it done is by getting the index of every 4th row and then selecting all the other rows. Like below:-
df[~df.index.isin(df[::4].index)]
I was wondering if there is a simpler and/or more pythonic way of getting this done.
One possible solution is create mask by modulo and filter by boolean indexing:
df = pd.DataFrame({'a':range(10, 30)}, index=range(20))
#print (df)
b = df[np.mod(np.arange(df.index.size),4)!=0]
print (b)
a
1 11
2 12
3 13
5 15
6 16
7 17
9 19
10 20
11 21
13 23
14 24
15 25
17 27
18 28
19 29
Details:
print (np.mod(np.arange(df.index.size),4))
[0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3]
print (np.mod(np.arange(df.index.size),4)!=0)
[False True True True False True True True False True True True
False True True True False True True True]
If unique index values use a bit changed #jpp solution from comment:
b = df.drop(df.index[::4], 0)

Resources