I am trying to create a report that is grouped by day of week for each year.
I have a df that looks like this:
s1 s2 srd
dt
2004-02-04 11:21:00 2365.79 2372.37 -7.0
2004-02-05 10:15:00 2365.79 2368.03 -2.0
2004-02-17 06:43:00 2421.05 2425.26 -4.0
2004-02-17 12:43:00 2418.42 2420.53 -2.0
2004-02-17 12:44:00 2420.39 2420.53 -0.0
The dt index is in datetime format.
What I am looking for is a dataframe that looks like this (I only need srd column and function to group can be anything, like sum, count, etc.):
srd
dayOfWeek year
Mon 2004 10
2005 11
2006 8
2007 120
Tues 2004 105
2005 105
I have tried dayOfWeekDf = df.resample('B') , but I get a dataframe that looks like it is split by week number.
I also tried df.groupby([df.index.weekday, df.index.year])['srd'].transform('sum'), but it does not even group for some reason, as I get the following (Feb 17th appears 3 times).
srd
dt
2004-02-04 11:21:00 81.0
2004-02-05 10:15:00 203.0
2004-02-17 06:43:00 37.0
2004-02-17 12:43:00 37.0
2004-02-17 12:44:00 37.0
If you want the dayOfWeek and year names in the index, you can assign them:
>>> df.assign(year=df.index.year, dayOfWeek = df.index.weekday_name).groupby(['dayOfWeek','year']).srd.sum()
dayOfWeek year
Thursday 2004 -2.0
Tuesday 2004 -6.0
Wednesday 2004 -7.0
Name: srd, dtype: float64
Otherwise, you can use the way you were doing, but omit the transform:
>>> df.groupby([df.index.weekday_name, df.index.year])['srd'].sum()
dt dt
Thursday 2004 -2.0
Tuesday 2004 -6.0
Wednesday 2004 -7.0
Name: srd, dtype: float64
Related
Having a dataframe like this:
I would like to know what would be the most efficient way to transform it into this othe one:
I tried to generate all the combinations between Time column and days and then manually create the Value column by checking the given Day-Time cell, but Im sure it must be a more efficient way
IF the original index is not important for you,
You could also use the .melt() method which has the advantage of grouping the days so you have the values for 1 day after another:
df1 = df.melt(id_vars='Time', var_name='Day', value_name='Value')
Result:
index
Time
Day
Value
0
6am-2pm
Day1
15.4
1
2pm-10pm
Day1
15.0
2
10pm-6am
Day1
14.0
3
6am-2pm
Day2
13.4
4
2pm-10pm
Day2
2.1
5
10pm-6am
Day2
22.0
6
6am-2pm
Day3
45.0
7
2pm-10pm
Day3
3.4
8
10pm-6am
Day3
35.0
You could even rearrange the columns index like this to make it more readable in my own opinion:
df1 = df1.reindex(columns=['Day','Time','Value'])
Result:
index
Day
Time
Value
0
Day1
6am-2pm
15.4
1
Day1
2pm-10pm
15.0
2
Day1
10pm-6am
14.0
3
Day2
6am-2pm
13.4
4
Day2
2pm-10pm
2.1
5
Day2
10pm-6am
22.0
6
Day3
6am-2pm
45.0
7
Day3
2pm-10pm
3.4
8
Day3
10pm-6am
35.0
Use set_index and stack:
out = (df.set_index('Time').stack().rename_axis(index=['Time', 'Day'])
.rename('Value').reset_index())
print(out)
# Output
Time Day Value
0 6am – 2pm Day1 15.4
1 6am – 2pm Day2 13.4
2 6am – 2pm Day3 45.0
3 2pm – 10pm Day1 15.0
4 2pm – 10pm Day2 2.1
5 2pm – 10pm Day3 3.4
6 10pm – 6am Day1 14.0
7 10pm – 6am Day2 22.0
8 10pm – 6am Day3 35.0
I have been dealing with time-series with one entry per hour over a year. In order to better analyse the data, I have been resampling by month with pandas and summing the results with df = df.resample('M').sum()
As the last hour of the last day runs from 23:00 31/12 to 00:00 01/01 of the following year, the final hour is resampled into January of the following year (e.g. my time-series is for 2020, the last hour of 31/12/2020 is resampled into January 2021). This means I lose data for December.
I have considered adding the data back in to December, but is there a better way to achieve this?
Unfortunately you need add it to previous hour, e.g. by:
rng = pd.date_range('2017-12-31 12:00:00', periods=13, freq='H')
df = pd.DataFrame({'a': range(len(rng))}, index=rng)
y = df.index.year
df.index = df.index.where(y != y.max(), df.index - pd.Timedelta(1, unit='H'))
print (df)
a
2017-12-31 12:00:00 0
2017-12-31 13:00:00 1
2017-12-31 14:00:00 2
2017-12-31 15:00:00 3
2017-12-31 16:00:00 4
2017-12-31 17:00:00 5
2017-12-31 18:00:00 6
2017-12-31 19:00:00 7
2017-12-31 20:00:00 8
2017-12-31 21:00:00 9
2017-12-31 22:00:00 10
2017-12-31 23:00:00 11
2017-12-31 23:00:00 12
df = df.resample('M').sum()
print (df)
a
2017-12-31 78
I have Hourly time series starts by year 2013 and ends by year 2020 as below and I want to plot only day in which the system load reached it peak:
date_time system_load
2013-01-01 00:00:00 1.0
2013-01-01 01:00:00 0.9
2013-01-01 02:00:00 0.5
...
2020-12-31 21:00:00 2.1
2020-12-31 22:00:00 1.8
2020-12-31 23:00:00 0.8
The intended dataframe has 'one day(24hours) per year' :
date_time system_load
2013-07-09 00:00:00 3.1
2013-07-09 02:00:00 3.0
2013-07-09 03:00:00 4.8
2013-07-09 04:00:00 2.6
...
2013-07-09 21:00:00 3.7
2013-07-09 22:00:00 3.9
2013-07-09 23:00:00 5.1
2014-09-09 00:00:00 4.1
2014-09-09 02:00:00 5.3
2014-09-09 03:00:00 6.0
2014-09-09 04:00:00 4.8
...
2014-09-09 21:00:00 3.5
2014-09-09 22:00:00 2.6
2014-09-09 23:00:00 1.6
...
...
2020-06-01 00:00:00 4.2
2020-06-01 02:00:00 3.6
2020-06-01 03:00:00 3.9
2020-06-01 04:00:00 2.8
...
2020-06-01 21:00:00 2.7
2020-06-01 22:00:00 4.8
2020-06-01 23:00:00 3.8
Get only date and year part from date_time column
Groupby year column and get the row containing the max value of system_load column in each group
Getting all the time from the original dataframe where the date is the same with the date whose system_load value is the max
Plot the bar
df['date_time'] = pd.to_datetime(df['date_time']) # Ensure the `date_time` column is datetime type
df['just_date'] = df['date_time'].dt.date
df['year'] = df['date_time'].dt.year
idx = df.groupby(['year'])['system_load'].transform(max) == df['system_load']
df[df['just_date'].isin(df[idx]['just_date'])].plot.bar(x='date_time', y='system_load', rot=45)
If in one year there are several days have the same max system_load value, the above code returns all. If you want to keep only the first day, you can use pandas.DataFrame.idxmax()
idx = df.groupby(['year'])['system_load'].idxmax()
df[df['just_date'].isin(df.loc[idx]['just_date'])].plot.bar(x='date_time', y='system_load', rot=45)
Here's an approach to solve your problem:
let sourcedf contain the input data in the form of two columns 'TimeStamp' & 'Load'
Then do the following:
sourcedf['Date'] = sourcedf.apply(lambda row: row['Date_Time'].date(), axis = 1)
dfg = sourcedf.groupby('Date')
ldList = list(dfg['Load'].max().to_list())
tgtDate = dfg.max().index.to_list()[dList.index(max(ldList))]
dfout = sourcedf[sourcedf['Date'] == tgtDate]
dfout will then contain just the date on which the max load was experienced
I hope you can help me with this:
Trying to append missing weekends to the df['StartDate'] column and show the rest of the columns with the data except for Hours to show 0 or NaN
I don't need to see every single missing date between each displayed date in df['StartDate']. Need to only add the weekends 'Saturday' and 'Sunday' where ever they are missing.
Original Dataframe:
EmployeeId StartDate weekday Hours
111 1/20/2017 Friday 6
111 1/25/2017 Wednesday 5
111 1/30/2017 Monday 2
Final output would like this;
Desired Final output
EmployeeId StartDate weekday Hours
111 1/20/2017 Friday 6
111 1/21/2017 Saturday NaN
111 1/22/2017 Sunday NaN
111 1/25/2017 Wednesday 5
111 1/28/2017 Saturday NaN
111 1/29/2017 Sunday NaN
111 1/30/2017 Monday 2
One way is to create a separate data frame with the min and max values from your dataframe and just concatenate both frames together after filtering on weekends, we can handle duplicate values by dropping them and setting keep = 'first' which will keep the values from your first df.
s = pd.DataFrame(
{"StartDate": pd.date_range(df.StartDate.min(), df.StartDate.max(), freq="D")}
)
s["weekday"] = s.StartDate.dt.day_name()
s = s.loc[s["weekday"].isin(["Saturday", "Sunday"])]
df_new = (
pd.concat([df, s], sort=False)
.drop_duplicates(keep="first")
.sort_values("StartDate")
)
print(df_new)
EmployeeId StartDate weekday Hours
0 111.0 2017-01-20 Friday 6.0
1 NaN 2017-01-21 Saturday NaN
2 NaN 2017-01-22 Sunday NaN
1 111.0 2017-01-25 Wednesday 5.0
8 NaN 2017-01-28 Saturday NaN
9 NaN 2017-01-29 Sunday NaN
2 111.0 2017-01-30 Monday 2.0
to fill in NaN Employee IDs with the ones above them you can use fillna and ffill
df_new['EmployeeId'] = df_new['EmployeeId'].fillna(df_new['EmployeeId'].ffill())
print(df_new)
EmployeeId StartDate weekday Hours
0 111.0 2017-01-20 Friday 6.0
1 111.0 2017-01-21 Saturday NaN
2 111.0 2017-01-22 Sunday NaN
1 111.0 2017-01-25 Wednesday 5.0
8 111.0 2017-01-28 Saturday NaN
9 111.0 2017-01-29 Sunday NaN
2 111.0 2017-01-30 Monday 2.0
I have a dataframe with a RangeIndex, timestamps in the first column and several thousands hourly temperature observations in the second.
It is easy enough to group the observations by 24 and find daily Tmax and Tmin. But I also want the timestamp of each day's max and min values.
How can I do that?
I hope I can get help without posting a working example, because the nature of the data makes it unpractical.
EDIT: Here's some data, spanning two days.
DT T-C
0 2015-01-01 00:00:00 -2.5
1 2015-01-01 01:00:00 -2.1
2 2015-01-01 02:00:00 -2.3
3 2015-01-01 03:00:00 -2.3
4 2015-01-01 04:00:00 -2.3
5 2015-01-01 05:00:00 -2.0
...
24 2015-01-02 00:00:00 1.1
25 2015-01-02 01:00:00 1.1
26 2015-01-02 02:00:00 0.8
27 2015-01-02 03:00:00 0.5
28 2015-01-02 04:00:00 1.0
29 2015-01-02 05:00:00 0.7
First create DatetimeIndex, then aggregate by Grouper with days and idxmax
idxmin for datetimes for min and max temperature:
df['DT'] = pd.to_datetime(df['DT'])
df = df.set_index('DT')
df = df.groupby(pd.Grouper(freq='D'))['T-C'].agg(['idxmax','idxmin','max','min'])
print (df)
idxmax idxmin max min
DT
2015-01-01 2015-01-01 05:00:00 2015-01-01 00:00:00 -2.0 -2.5
2015-01-02 2015-01-02 00:00:00 2015-01-02 03:00:00 1.1 0.5