Function to calculate timespan for a certain event - python-3.x

I have a pandas dataframe which looks like this
timestamp phase
2019-07-01 07:10:00 a
2019-07-01 07:11:00 a
2019-07-01 07:12:00 b
2019-07-01 07:13:00 b
2019-07-01 07:17:00 a
2019-07-01 07:19:00 a
2019-07-01 07:20:00 c
I am working on a function that creates a dataframe with an duration for every phase, till it hits the next phase.
I already have a solution but I have no clue how to write this in an user-defined-function, as I am new to python.
This is my "static" solution:
df['prev_phase'] = df["phase"].shift(1)
df['next_phase'] = df["phase"].shift(-1)
dfshift = df[df.next_phase != df.prev_phase]
dfshift["delta"] = (dfshift["timestamp"]-dfshift["timestamp"].shift()).fillna(0)
dfshift["helpcolumn"] = dfshift["phase"].shift(1)
dfshift2 = dfshift[dfshift.helpcolumn == dfshift["phase"]]
dfshift3 = dfshift2[["timestamp","phase","delta"]]
dfshift3["deltaminutes"] = dfshift3['delta'] / np.timedelta64(60, 's')
This gives me this as output (example):
timestamp phase delta deltam
2019-05-01 06:44:00 a 0 days 04:51:00 291.0
2019-05-01 07:25:00 b 0 days 00:40:00 40.0
2019-05-01 21:58:00 a 0 days 14:32:00 872.0
2019-05-01 22:07:00 c 0 days 00:08:00 8.0
I just need this in a function.
Thanks in advance
Edit for #Tom
timestamp phase
2019-05-05 08:58:00 a
2019-05-05 08:59:00 a
2019-05-05 09:00:00 b
2019-05-05 09:01:00 b
2019-05-05 09:02:00 b
2019-05-05 09:03:00 b
...
...
2019-05-05 09:38:00 b
2019-05-05 09:39:00 c
2019-05-05 09:40:00 c
2019-05-05 09:41:00 c
Those are the two colums + Index

df = pd.DataFrame({"timestamp": ["2019-07-01 07:10:00",
"2019-07-01 07:11:00",
"2019-07-01 07:12:00",
"2019-07-01 07:13:00",
"2019-07-01 07:17:00",
"2019-07-01 07:19:00",
"2019-07-01 07:20:00"],
"phase": ["a", "a", "b", "b", "a" ,"a", "c"]})
df["timestamp"] = pd.to_datetime(df["timestamp"])
# Create a 'phase_id' column to track when phase changes
df['phase_id'] = df['phase'].ne(df['phase'].shift()) + df.index
# Groupby new 'phase_id' variable and get time range for each phase
df_tdiff = df.groupby("phase_id").diff().reset_index()
df_tdiff.columns = ['phase_id', 'timediff']
# Merge this to old dataframe
df_new = pd.merge(df, df_tdiff, on=["phase_id"], how="left")
This then gives:
df_new
timestamp phase phase_id timediff
0 2019-07-01 07:10:00 a 1 00:01:00
1 2019-07-01 07:11:00 a 1 00:01:00
2 2019-07-01 07:12:00 b 3 00:01:00
3 2019-07-01 07:13:00 b 3 00:01:00
4 2019-07-01 07:17:00 a 5 00:02:00
5 2019-07-01 07:19:00 a 5 00:02:00
6 2019-07-01 07:20:00 c 7 NaT
Finally:
df_new = df_new.groupby("phase_id").first().reset_index(drop=True)
df_new
timestamp phase timediff
0 2019-07-01 07:10:00 a 00:01:00
1 2019-07-01 07:12:00 b 00:01:00
2 2019-07-01 07:17:00 a 00:02:00
3 2019-07-01 07:20:00 c NaT
Of course, if you need that all as a function (as originally requested), then:
def get_phase_timediff(df):
# Create a 'phase_id' column to track when phase changes
df['phase_id'] = df['phase'].ne(df['phase'].shift()) + df.index
# Groupby new 'phase_id' variable and get time range for each phase
df_tdiff = df.groupby("phase_id").diff().reset_index()
df_tdiff.columns = ['phase_id', 'timediff']
# Merge this to old dataframe
df_new = pd.merge(df, df_tdiff, on=["phase_id"], how="left")
# Groupby 'phase_id' again for final output
df_new = df_new.groupby("phase_id").first().reset_index(drop=True)
return(df_new)

Related

Iterate over unique date and hour in the pandas dataframe to run a function

Hi I am currently running a for loop through by unique dates in the dataframe to pass it to a function.
However what I wanted is to iterate over the unique date and hour (e.g. 2020-12-18 15:00, 2020-12-18 16:00) through my dataframe. Is there any possible way to do this?
This is my code and a sample of my dataframe.
for day in df['DateTime'].dt.day.unique():
testdf = df[df['DateTime'].dt.day == day]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
DateTime Values
0 2020-12-18 15:00:00 554.0
1 2020-12-18 15:00:00 594.0
2 2020-12-18 15:00:00 513.0
3 2020-12-18 16:00:00 651.0
4 2020-12-18 16:00:00 593.0
5 2020-12-18 17:00:00 521.0
6 2020-12-18 17:00:00 539.0
7 2020-12-18 17:00:00 534.0
8 2020-12-18 18:00:00 562.0
9 2020-12-19 08:00:00 511.0
10 2020-12-19 09:00:00 512.0
11 2020-12-19 09:00:00 584.0
12 2020-12-19 09:00:00 597.0
13 2020-12-22 09:00:00 585.0
14 2020-12-22 09:00:00 620.0
15 2020-12-22 09:00:00 593.0
You can use groupby if need filter by all dates in DataFrame:
for day, testdf in df.groupby('DateTime'):
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
EDIT: If need filter only some dates from list use:
for date in ['2020-12-18 15:00', '2020-12-18 16:00']:
testdf = df[df['DateTime'] == date]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
EDIT1:
for date in df['DateTime']:
testdf = df[df['DateTime'] == date]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)

Select rows of every specific number of col where values is negative and convert values 0 in another column in python3

i have below dataframe
date B C D E
2019-07-01 00:00 0.400157 0.978738 2.240893 1.867558
2019-07-01 00:10 -0.950088 0.151357 -0.103219 0.410599
2019-07-01 00:20 1.454274 0.761038 0.121675 0.443863
2019-07-01 00:30 -1.494079 0.205158 0.313068 0.854096
suppose every even columns of rows contains -ve values(may be multiple condition,ex. rows contains -ve value or more than 10) then i wants to next odd column values into 0
expected output
date B C D E
2019-07-01 00:00 0.400157 0.978738 2.240893 1.867558
2019-07-01 00:10 -0.950088 0 -0.103219 0
2019-07-01 00:20 1.454274 0.761038 0.121675 0.443863
2019-07-01 00:30 -1.494079 0 0.313068 0.854096
if code one liner solution is then its best or can we write function for this
This solution requires the date column to be set as the index:
df.set_index('date', inplace=True)
df[df.shift(axis=1) < 0] = 0
df.reset_index(inplace=True)
df.shift returns a new dataframe with all the columns shifted to the right (default behaviour; can be changed using the periods parameter). This enables you to compare a cell with one to its left.
Source: DataFrame.shift

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Indicate whether datetime of row is in a daterange

I'm trying to get dummy variables for holidays in a dataset. I have a couple of dateranges (pd.daterange()) with holidays and a dataframe to which I would like to append a dummy to indicate whether the datetime of that row is in a certain daterange of the specified holidays.
Small example:
ChristmasBreak = list(pd.date_range('2014-12-20','2015-01-04').date)
dates = pd.date_range('2015-01-03', '2015-01-06, freq='H')
d = {'Date': dates, 'Number': np.rand(len(dates))}
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
for i, row in df.iterrows():
if i in ChristmasBreak:
df[i,'Christmas] = 1
The if loop is never entered, so matching the dates won't work. Is there any way to do this? Alternative methods to come to dummies for this case are welcome as well!
First dont use iterrows, because really slow.
Better is use dt.date with Series,isin, last convert boolean mask to integer - Trues are 1:
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
Or use between:
df['Christmas'] = df['Date'].between('2014-12-20', '2015-01-04').astype(int)
If want compare with DatetimeIndex:
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
df['Christmas'] = df.index.date.isin(ChristmasBreak).astype(int)
df['Christmas'] = ((df.index > '2014-12-20') & (df.index < '2015-01-04')).astype(int)
Sample:
ChristmasBreak = pd.date_range('2014-12-20','2015-01-04').date
dates = pd.date_range('2014-12-19 20:00', '2014-12-20 05:00', freq='H')
d = {'Date': dates, 'Number': np.random.randint(10, size=len(dates))}
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
print (df)
Date Number Christmas
0 2014-12-19 20:00:00 6 0
1 2014-12-19 21:00:00 7 0
2 2014-12-19 22:00:00 0 0
3 2014-12-19 23:00:00 9 0
4 2014-12-20 00:00:00 1 1
5 2014-12-20 01:00:00 3 1
6 2014-12-20 02:00:00 1 1
7 2014-12-20 03:00:00 8 1
8 2014-12-20 04:00:00 2 1
9 2014-12-20 05:00:00 1 1
This should do what you want:
df['Christmas'] = df.index.isin(ChristmasBreak).astype(int)

DataFrame difference between rows based on multiple columns

I am trying to calculate the difference between rows based on multiple columns. The data set is very large and I am pasting dummy data below that describes the problem:
if I want to calculate the daily difference in weight at a pet+name level. So far I have only come up with the solution of concatenating these columns and creating multiindex based on the new column and the date column. But I think there should be a better way. In the real dataset I have more than 3 columns I am using calculate row difference.
df['pet_name']=df.pet + df.name
df.set_index(['pet_name','date'],inplace = True)
df.sort_index(inplace=True)
df['diffs']=np.nan
for idx in t.index.levels[0]:
df.diffs[idx] = df.weight[idx].diff()
Base on your description , you can try groupby
df['pet_name']=df.pet + df.name
df.groupby('pet_name')['weight'].diff()
Use groupby by 2 columns:
df.groupby(['pet', 'name'])['weight'].diff()
All together:
#convert dates to datetimes
df['date'] = pd.to_datetime(df['date'])
#sorting
df = df.sort_values(['pet', 'name','date'])
#get differences per groups
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
Sample:
np.random.seed(123)
N = 100
L = list('abc')
df = pd.DataFrame({'pet': np.random.choice(L, N),
'name': np.random.choice(L, N),
'date': pd.Series(pd.date_range('2015-01-01', periods=int(N/10)))
.sample(N, replace=True),
'weight':np.random.rand(N)})
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['pet', 'name','date'])
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
df['pet_name'] = df.pet + df.name
df = df.sort_values(['pet_name','date'])
df['diffs1'] = df.groupby(['pet_name', 'date'])['weight'].diff()
print (df.head(20))
date name pet weight diffs pet_name diffs1
1 2015-01-02 a a 0.105446 NaN aa NaN
2 2015-01-03 a a 0.845533 NaN aa NaN
2 2015-01-03 a a 0.980582 0.135049 aa 0.135049
2 2015-01-03 a a 0.443368 -0.537214 aa -0.537214
3 2015-01-04 a a 0.375186 NaN aa NaN
6 2015-01-07 a a 0.715601 NaN aa NaN
7 2015-01-08 a a 0.047340 NaN aa NaN
9 2015-01-10 a a 0.236600 NaN aa NaN
0 2015-01-01 b a 0.777162 NaN ab NaN
2 2015-01-03 b a 0.871683 NaN ab NaN
3 2015-01-04 b a 0.988329 NaN ab NaN
4 2015-01-05 b a 0.918397 NaN ab NaN
4 2015-01-05 b a 0.016119 -0.902279 ab -0.902279
5 2015-01-06 b a 0.095530 NaN ab NaN
5 2015-01-06 b a 0.894978 0.799449 ab 0.799449
5 2015-01-06 b a 0.365719 -0.529259 ab -0.529259
5 2015-01-06 b a 0.887593 0.521874 ab 0.521874
7 2015-01-08 b a 0.792299 NaN ab NaN
7 2015-01-08 b a 0.313669 -0.478630 ab -0.478630
7 2015-01-08 b a 0.281235 -0.032434 ab -0.032434

Resources