I have a pandas dataframe which looks like this
timestamp phase
2019-07-01 07:10:00 a
2019-07-01 07:11:00 a
2019-07-01 07:12:00 b
2019-07-01 07:13:00 b
2019-07-01 07:17:00 a
2019-07-01 07:19:00 a
2019-07-01 07:20:00 c
I am working on a function that creates a dataframe with an duration for every phase, till it hits the next phase.
I already have a solution but I have no clue how to write this in an user-defined-function, as I am new to python.
This is my "static" solution:
df['prev_phase'] = df["phase"].shift(1)
df['next_phase'] = df["phase"].shift(-1)
dfshift = df[df.next_phase != df.prev_phase]
dfshift["delta"] = (dfshift["timestamp"]-dfshift["timestamp"].shift()).fillna(0)
dfshift["helpcolumn"] = dfshift["phase"].shift(1)
dfshift2 = dfshift[dfshift.helpcolumn == dfshift["phase"]]
dfshift3 = dfshift2[["timestamp","phase","delta"]]
dfshift3["deltaminutes"] = dfshift3['delta'] / np.timedelta64(60, 's')
This gives me this as output (example):
timestamp phase delta deltam
2019-05-01 06:44:00 a 0 days 04:51:00 291.0
2019-05-01 07:25:00 b 0 days 00:40:00 40.0
2019-05-01 21:58:00 a 0 days 14:32:00 872.0
2019-05-01 22:07:00 c 0 days 00:08:00 8.0
I just need this in a function.
Thanks in advance
Edit for #Tom
timestamp phase
2019-05-05 08:58:00 a
2019-05-05 08:59:00 a
2019-05-05 09:00:00 b
2019-05-05 09:01:00 b
2019-05-05 09:02:00 b
2019-05-05 09:03:00 b
...
...
2019-05-05 09:38:00 b
2019-05-05 09:39:00 c
2019-05-05 09:40:00 c
2019-05-05 09:41:00 c
Those are the two colums + Index
df = pd.DataFrame({"timestamp": ["2019-07-01 07:10:00",
"2019-07-01 07:11:00",
"2019-07-01 07:12:00",
"2019-07-01 07:13:00",
"2019-07-01 07:17:00",
"2019-07-01 07:19:00",
"2019-07-01 07:20:00"],
"phase": ["a", "a", "b", "b", "a" ,"a", "c"]})
df["timestamp"] = pd.to_datetime(df["timestamp"])
# Create a 'phase_id' column to track when phase changes
df['phase_id'] = df['phase'].ne(df['phase'].shift()) + df.index
# Groupby new 'phase_id' variable and get time range for each phase
df_tdiff = df.groupby("phase_id").diff().reset_index()
df_tdiff.columns = ['phase_id', 'timediff']
# Merge this to old dataframe
df_new = pd.merge(df, df_tdiff, on=["phase_id"], how="left")
This then gives:
df_new
timestamp phase phase_id timediff
0 2019-07-01 07:10:00 a 1 00:01:00
1 2019-07-01 07:11:00 a 1 00:01:00
2 2019-07-01 07:12:00 b 3 00:01:00
3 2019-07-01 07:13:00 b 3 00:01:00
4 2019-07-01 07:17:00 a 5 00:02:00
5 2019-07-01 07:19:00 a 5 00:02:00
6 2019-07-01 07:20:00 c 7 NaT
Finally:
df_new = df_new.groupby("phase_id").first().reset_index(drop=True)
df_new
timestamp phase timediff
0 2019-07-01 07:10:00 a 00:01:00
1 2019-07-01 07:12:00 b 00:01:00
2 2019-07-01 07:17:00 a 00:02:00
3 2019-07-01 07:20:00 c NaT
Of course, if you need that all as a function (as originally requested), then:
def get_phase_timediff(df):
# Create a 'phase_id' column to track when phase changes
df['phase_id'] = df['phase'].ne(df['phase'].shift()) + df.index
# Groupby new 'phase_id' variable and get time range for each phase
df_tdiff = df.groupby("phase_id").diff().reset_index()
df_tdiff.columns = ['phase_id', 'timediff']
# Merge this to old dataframe
df_new = pd.merge(df, df_tdiff, on=["phase_id"], how="left")
# Groupby 'phase_id' again for final output
df_new = df_new.groupby("phase_id").first().reset_index(drop=True)
return(df_new)
Related
Hi I am currently running a for loop through by unique dates in the dataframe to pass it to a function.
However what I wanted is to iterate over the unique date and hour (e.g. 2020-12-18 15:00, 2020-12-18 16:00) through my dataframe. Is there any possible way to do this?
This is my code and a sample of my dataframe.
for day in df['DateTime'].dt.day.unique():
testdf = df[df['DateTime'].dt.day == day]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
DateTime Values
0 2020-12-18 15:00:00 554.0
1 2020-12-18 15:00:00 594.0
2 2020-12-18 15:00:00 513.0
3 2020-12-18 16:00:00 651.0
4 2020-12-18 16:00:00 593.0
5 2020-12-18 17:00:00 521.0
6 2020-12-18 17:00:00 539.0
7 2020-12-18 17:00:00 534.0
8 2020-12-18 18:00:00 562.0
9 2020-12-19 08:00:00 511.0
10 2020-12-19 09:00:00 512.0
11 2020-12-19 09:00:00 584.0
12 2020-12-19 09:00:00 597.0
13 2020-12-22 09:00:00 585.0
14 2020-12-22 09:00:00 620.0
15 2020-12-22 09:00:00 593.0
You can use groupby if need filter by all dates in DataFrame:
for day, testdf in df.groupby('DateTime'):
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
EDIT: If need filter only some dates from list use:
for date in ['2020-12-18 15:00', '2020-12-18 16:00']:
testdf = df[df['DateTime'] == date]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
EDIT1:
for date in df['DateTime']:
testdf = df[df['DateTime'] == date]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
i have below dataframe
date B C D E
2019-07-01 00:00 0.400157 0.978738 2.240893 1.867558
2019-07-01 00:10 -0.950088 0.151357 -0.103219 0.410599
2019-07-01 00:20 1.454274 0.761038 0.121675 0.443863
2019-07-01 00:30 -1.494079 0.205158 0.313068 0.854096
suppose every even columns of rows contains -ve values(may be multiple condition,ex. rows contains -ve value or more than 10) then i wants to next odd column values into 0
expected output
date B C D E
2019-07-01 00:00 0.400157 0.978738 2.240893 1.867558
2019-07-01 00:10 -0.950088 0 -0.103219 0
2019-07-01 00:20 1.454274 0.761038 0.121675 0.443863
2019-07-01 00:30 -1.494079 0 0.313068 0.854096
if code one liner solution is then its best or can we write function for this
This solution requires the date column to be set as the index:
df.set_index('date', inplace=True)
df[df.shift(axis=1) < 0] = 0
df.reset_index(inplace=True)
df.shift returns a new dataframe with all the columns shifted to the right (default behaviour; can be changed using the periods parameter). This enables you to compare a cell with one to its left.
Source: DataFrame.shift
I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)
I'm trying to get dummy variables for holidays in a dataset. I have a couple of dateranges (pd.daterange()) with holidays and a dataframe to which I would like to append a dummy to indicate whether the datetime of that row is in a certain daterange of the specified holidays.
Small example:
ChristmasBreak = list(pd.date_range('2014-12-20','2015-01-04').date)
dates = pd.date_range('2015-01-03', '2015-01-06, freq='H')
d = {'Date': dates, 'Number': np.rand(len(dates))}
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
for i, row in df.iterrows():
if i in ChristmasBreak:
df[i,'Christmas] = 1
The if loop is never entered, so matching the dates won't work. Is there any way to do this? Alternative methods to come to dummies for this case are welcome as well!
First dont use iterrows, because really slow.
Better is use dt.date with Series,isin, last convert boolean mask to integer - Trues are 1:
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
Or use between:
df['Christmas'] = df['Date'].between('2014-12-20', '2015-01-04').astype(int)
If want compare with DatetimeIndex:
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
df['Christmas'] = df.index.date.isin(ChristmasBreak).astype(int)
df['Christmas'] = ((df.index > '2014-12-20') & (df.index < '2015-01-04')).astype(int)
Sample:
ChristmasBreak = pd.date_range('2014-12-20','2015-01-04').date
dates = pd.date_range('2014-12-19 20:00', '2014-12-20 05:00', freq='H')
d = {'Date': dates, 'Number': np.random.randint(10, size=len(dates))}
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
print (df)
Date Number Christmas
0 2014-12-19 20:00:00 6 0
1 2014-12-19 21:00:00 7 0
2 2014-12-19 22:00:00 0 0
3 2014-12-19 23:00:00 9 0
4 2014-12-20 00:00:00 1 1
5 2014-12-20 01:00:00 3 1
6 2014-12-20 02:00:00 1 1
7 2014-12-20 03:00:00 8 1
8 2014-12-20 04:00:00 2 1
9 2014-12-20 05:00:00 1 1
This should do what you want:
df['Christmas'] = df.index.isin(ChristmasBreak).astype(int)
I am trying to calculate the difference between rows based on multiple columns. The data set is very large and I am pasting dummy data below that describes the problem:
if I want to calculate the daily difference in weight at a pet+name level. So far I have only come up with the solution of concatenating these columns and creating multiindex based on the new column and the date column. But I think there should be a better way. In the real dataset I have more than 3 columns I am using calculate row difference.
df['pet_name']=df.pet + df.name
df.set_index(['pet_name','date'],inplace = True)
df.sort_index(inplace=True)
df['diffs']=np.nan
for idx in t.index.levels[0]:
df.diffs[idx] = df.weight[idx].diff()
Base on your description , you can try groupby
df['pet_name']=df.pet + df.name
df.groupby('pet_name')['weight'].diff()
Use groupby by 2 columns:
df.groupby(['pet', 'name'])['weight'].diff()
All together:
#convert dates to datetimes
df['date'] = pd.to_datetime(df['date'])
#sorting
df = df.sort_values(['pet', 'name','date'])
#get differences per groups
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
Sample:
np.random.seed(123)
N = 100
L = list('abc')
df = pd.DataFrame({'pet': np.random.choice(L, N),
'name': np.random.choice(L, N),
'date': pd.Series(pd.date_range('2015-01-01', periods=int(N/10)))
.sample(N, replace=True),
'weight':np.random.rand(N)})
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['pet', 'name','date'])
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
df['pet_name'] = df.pet + df.name
df = df.sort_values(['pet_name','date'])
df['diffs1'] = df.groupby(['pet_name', 'date'])['weight'].diff()
print (df.head(20))
date name pet weight diffs pet_name diffs1
1 2015-01-02 a a 0.105446 NaN aa NaN
2 2015-01-03 a a 0.845533 NaN aa NaN
2 2015-01-03 a a 0.980582 0.135049 aa 0.135049
2 2015-01-03 a a 0.443368 -0.537214 aa -0.537214
3 2015-01-04 a a 0.375186 NaN aa NaN
6 2015-01-07 a a 0.715601 NaN aa NaN
7 2015-01-08 a a 0.047340 NaN aa NaN
9 2015-01-10 a a 0.236600 NaN aa NaN
0 2015-01-01 b a 0.777162 NaN ab NaN
2 2015-01-03 b a 0.871683 NaN ab NaN
3 2015-01-04 b a 0.988329 NaN ab NaN
4 2015-01-05 b a 0.918397 NaN ab NaN
4 2015-01-05 b a 0.016119 -0.902279 ab -0.902279
5 2015-01-06 b a 0.095530 NaN ab NaN
5 2015-01-06 b a 0.894978 0.799449 ab 0.799449
5 2015-01-06 b a 0.365719 -0.529259 ab -0.529259
5 2015-01-06 b a 0.887593 0.521874 ab 0.521874
7 2015-01-08 b a 0.792299 NaN ab NaN
7 2015-01-08 b a 0.313669 -0.478630 ab -0.478630
7 2015-01-08 b a 0.281235 -0.032434 ab -0.032434