I have the follow dataset:
cod date value
0 1O8 2015-01-01 00:00:00 2.1
1 1O8 2015-01-01 01:00:00 2.3
2 1O8 2015-01-01 02:00:00 3.5
3 1O8 2015-01-01 03:00:00 4.5
4 1O8 2015-01-01 04:00:00 4.4
5 1O8 2015-01-01 05:00:00 3.2
6 1O9 2015-01-01 00:00:00 1.4
7 1O9 2015-01-01 01:00:00 8.6
8 1O9 2015-01-01 02:00:00 3.3
10 1O9 2015-01-01 03:00:00 1.5
11 1O9 2015-01-01 04:00:00 2.4
12 1O9 2015-01-01 05:00:00 7.2
I want to aggregate by cod and date(month) and do an average of the value, like this:
value
cod date
1O8 2015-01-01 3.3
1O9 2015-01-01 4.9
My data have the follow type: dtypes: object(1), datetime64[ns](1), float64(1)
I try to use .groupby() function to aggegrate:
df.groupby(['cod', 'date', 'value']).size().reset_index().groupby('value').mean()
But did'nt produce the correct result
using a Grouper
df.groupby(["cod", pd.Grouper(key="date", freq="MS")]).mean()
Extra info on pbpython.com
Related
I have Hourly time series starts by year 2013 and ends by year 2020 as below and I want to plot only day in which the system load reached it peak:
date_time system_load
2013-01-01 00:00:00 1.0
2013-01-01 01:00:00 0.9
2013-01-01 02:00:00 0.5
...
2020-12-31 21:00:00 2.1
2020-12-31 22:00:00 1.8
2020-12-31 23:00:00 0.8
The intended dataframe has 'one day(24hours) per year' :
date_time system_load
2013-07-09 00:00:00 3.1
2013-07-09 02:00:00 3.0
2013-07-09 03:00:00 4.8
2013-07-09 04:00:00 2.6
...
2013-07-09 21:00:00 3.7
2013-07-09 22:00:00 3.9
2013-07-09 23:00:00 5.1
2014-09-09 00:00:00 4.1
2014-09-09 02:00:00 5.3
2014-09-09 03:00:00 6.0
2014-09-09 04:00:00 4.8
...
2014-09-09 21:00:00 3.5
2014-09-09 22:00:00 2.6
2014-09-09 23:00:00 1.6
...
...
2020-06-01 00:00:00 4.2
2020-06-01 02:00:00 3.6
2020-06-01 03:00:00 3.9
2020-06-01 04:00:00 2.8
...
2020-06-01 21:00:00 2.7
2020-06-01 22:00:00 4.8
2020-06-01 23:00:00 3.8
Get only date and year part from date_time column
Groupby year column and get the row containing the max value of system_load column in each group
Getting all the time from the original dataframe where the date is the same with the date whose system_load value is the max
Plot the bar
df['date_time'] = pd.to_datetime(df['date_time']) # Ensure the `date_time` column is datetime type
df['just_date'] = df['date_time'].dt.date
df['year'] = df['date_time'].dt.year
idx = df.groupby(['year'])['system_load'].transform(max) == df['system_load']
df[df['just_date'].isin(df[idx]['just_date'])].plot.bar(x='date_time', y='system_load', rot=45)
If in one year there are several days have the same max system_load value, the above code returns all. If you want to keep only the first day, you can use pandas.DataFrame.idxmax()
idx = df.groupby(['year'])['system_load'].idxmax()
df[df['just_date'].isin(df.loc[idx]['just_date'])].plot.bar(x='date_time', y='system_load', rot=45)
Here's an approach to solve your problem:
let sourcedf contain the input data in the form of two columns 'TimeStamp' & 'Load'
Then do the following:
sourcedf['Date'] = sourcedf.apply(lambda row: row['Date_Time'].date(), axis = 1)
dfg = sourcedf.groupby('Date')
ldList = list(dfg['Load'].max().to_list())
tgtDate = dfg.max().index.to_list()[dList.index(max(ldList))]
dfout = sourcedf[sourcedf['Date'] == tgtDate]
dfout will then contain just the date on which the max load was experienced
I want to convert two timestamp columns start_date and end_date to normal date columns:
id start_date end_date
0 1 1578448800000 1583632800000
1 2 1582164000000 1582250400000
2 3 1582509600000 1582596000000
3 4 1583373600000 1588557600000
4 5 1582509600000 1582596000000
5 6 1582164000000 1582250400000
6 7 1581040800000 1586224800000
7 8 1582423200000 1582509600000
8 9 1583287200000 1583373600000
The following code works for one timestamp, but how could I apply it to those two columns?
Thanks for your kind helps.
import datetime
timestamp = datetime.datetime.fromtimestamp(1500000000)
print(timestamp.strftime('%Y-%m-%d %H:%M:%S'))
Output:
2017-07-14 10:40:00
I also try with pd.to_datetime(df['start_date']/1000).apply(lambda x: x.date()) which give a incorrect result.
0 1970-01-01
1 1970-01-01
2 1970-01-01
3 1970-01-01
4 1970-01-01
5 1970-01-01
6 1970-01-01
7 1970-01-01
8 1970-01-01
Use DataFrame.apply with list of columns names and to_datetime with parameter unit='ms':
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(pd.to_datetime, unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
EDIT: For dates add lambda function with Series.dt.date:
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x, unit='ms').dt.date)
print (df)
id start_date end_date
0 1 2020-01-08 2020-03-08
1 2 2020-02-20 2020-02-21
2 3 2020-02-24 2020-02-25
3 4 2020-03-05 2020-05-04
4 5 2020-02-24 2020-02-25
5 6 2020-02-20 2020-02-21
6 7 2020-02-07 2020-04-07
7 8 2020-02-23 2020-02-24
8 9 2020-03-04 2020-03-05
Or convert each column separately:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms')
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
And for dates:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms').dt.date
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms').dt.date
My dataset df looks like this:
DateTimeVal Open
2017-01-01 17:00:00 5.1532
2017-01-01 17:01:00 5.3522
2017-01-01 17:02:00 5.4535
2017-01-01 17:03:00 5.3567
2017-01-01 17:04:00 5.1512
....
It is a minute diff based dataset.
In my calculation, a single day(24 hour) is defined as:
17:00:00 Sunday to 16:59:00 Monday and so on for other days
What I want to do is find the AVG, and STD of each 24 hour from 17:00:00 Sunday to 16:59:00 Monday and so on for all the day
What did I do?
I did the rolling to find the AVG but it does for a day and not with time-range
# day avg
# 7 day rolling avg
df = (
df.assign(DAY_AVG=df.rolling(window=1*24*60)['Open'].mean())
df.assign(7DAY_AVG=df.rolling(window=7*24*60)['Open'].mean())
.groupby(df['DateTimeVal'].dt.date)
.last() )
I need help with these 2 things:
How do I find the AVG, and STD between fixed time period?
How do I find the AVG, and STD between fixed time period for 7D rolling and 14 Days rolling?
Use resample with base:
#Create empty dataframe for 2 days
df = pd.DataFrame(index = pd.date_range('2017-07-01', periods=48, freq='1H'))
#Set value equal to 1 from 17:00 to 16:59 next day
df.loc['2017-07-01 17:00:00': '2017-07-02 16:59:59', 'Value'] = 1
print(df)
Output:
Value
2017-07-01 00:00:00 NaN
2017-07-01 01:00:00 NaN
2017-07-01 02:00:00 NaN
2017-07-01 03:00:00 NaN
2017-07-01 04:00:00 NaN
2017-07-01 05:00:00 NaN
2017-07-01 06:00:00 NaN
2017-07-01 07:00:00 NaN
2017-07-01 08:00:00 NaN
2017-07-01 09:00:00 NaN
2017-07-01 10:00:00 NaN
2017-07-01 11:00:00 NaN
2017-07-01 12:00:00 NaN
2017-07-01 13:00:00 NaN
2017-07-01 14:00:00 NaN
2017-07-01 15:00:00 NaN
2017-07-01 16:00:00 NaN
2017-07-01 17:00:00 1.0
2017-07-01 18:00:00 1.0
2017-07-01 19:00:00 1.0
2017-07-01 20:00:00 1.0
2017-07-01 21:00:00 1.0
2017-07-01 22:00:00 1.0
2017-07-01 23:00:00 1.0
2017-07-02 00:00:00 1.0
2017-07-02 01:00:00 1.0
2017-07-02 02:00:00 1.0
2017-07-02 03:00:00 1.0
2017-07-02 04:00:00 1.0
2017-07-02 05:00:00 1.0
2017-07-02 06:00:00 1.0
2017-07-02 07:00:00 1.0
2017-07-02 08:00:00 1.0
2017-07-02 09:00:00 1.0
2017-07-02 10:00:00 1.0
2017-07-02 11:00:00 1.0
2017-07-02 12:00:00 1.0
2017-07-02 13:00:00 1.0
2017-07-02 14:00:00 1.0
2017-07-02 15:00:00 1.0
2017-07-02 16:00:00 1.0
2017-07-02 17:00:00 NaN
2017-07-02 18:00:00 NaN
2017-07-02 19:00:00 NaN
2017-07-02 20:00:00 NaN
2017-07-02 21:00:00 NaN
2017-07-02 22:00:00 NaN
2017-07-02 23:00:00 NaN
Now use, resample with base=17
df.resample('24H', base=17).sum()
Output:
Value
2017-06-30 17:00:00 0.0
2017-07-01 17:00:00 24.0
2017-07-02 17:00:00 0.0
Update for minute sampling:
df = pd.DataFrame({'Value': 0}, index = pd.date_range('2018-10-01', '2018-10-03', freq='1T'))
df.loc['2018-10-01 15:00:00':'2018-10-02 18:59:50', 'Value'] = 1
df.resample('24H', base=17).agg(['sum','mean'])
Output:
Value
sum mean
2018-09-30 17:00:00 120 0.117647
2018-10-01 17:00:00 1440 1.000000
2018-10-02 17:00:00 120 0.285036
I have a Pandas series time of dates and times, like:
UTC:
0 2015-01-01 00:00:00
1 2015-01-01 01:00:00
2 2015-01-01 02:00:00
3 2015-01-01 03:00:00
4 2015-01-01 04:00:00
Name: DT, dtype: datetime64[ns]
That I'd like to convert to another timezone:
time2 = time.dt.tz_localize('UTC').dt.tz_convert('Europe/Rome')
print("CET: ",'\n', time2)
CET:
0 2015-01-01 01:00:00+01:00
1 2015-01-01 02:00:00+01:00
2 2015-01-01 03:00:00+01:00
3 2015-01-01 04:00:00+01:00
4 2015-01-01 05:00:00+01:00
Name: DT, dtype: datetime64[ns, Europe/Rome]
But, the result is not what I need. I want it in the form 2015-01-01 02:00:00 (the local time at UTC 01:00:00), not 2015-01-01 01:00:00+01:00.
How can I do that?
EDIT: While there is another question that deal with this issue (Convert pandas timezone-aware DateTimeIndex to naive timestamp, but in certain timezone), I think this question is more to the point, providing a clear and concise example for what appears a common problem.
I turns out that my question already has an answer here:
Convert pandas timezone-aware DateTimeIndex to naive timestamp, but in certain timezone)
I just wasn't able to phrase my question correctly. Anyway, what works is:
time3 = time2.dt.tz_localize(None)
print("Naive: ",'\n', time3)
Naive:
0 2015-01-01 01:00:00
1 2015-01-01 02:00:00
2 2015-01-01 03:00:00
3 2015-01-01 04:00:00
4 2015-01-01 05:00:00
Name: DT, dtype: datetime64[ns]`
I have a dataframe with a RangeIndex, timestamps in the first column and several thousands hourly temperature observations in the second.
It is easy enough to group the observations by 24 and find daily Tmax and Tmin. But I also want the timestamp of each day's max and min values.
How can I do that?
I hope I can get help without posting a working example, because the nature of the data makes it unpractical.
EDIT: Here's some data, spanning two days.
DT T-C
0 2015-01-01 00:00:00 -2.5
1 2015-01-01 01:00:00 -2.1
2 2015-01-01 02:00:00 -2.3
3 2015-01-01 03:00:00 -2.3
4 2015-01-01 04:00:00 -2.3
5 2015-01-01 05:00:00 -2.0
...
24 2015-01-02 00:00:00 1.1
25 2015-01-02 01:00:00 1.1
26 2015-01-02 02:00:00 0.8
27 2015-01-02 03:00:00 0.5
28 2015-01-02 04:00:00 1.0
29 2015-01-02 05:00:00 0.7
First create DatetimeIndex, then aggregate by Grouper with days and idxmax
idxmin for datetimes for min and max temperature:
df['DT'] = pd.to_datetime(df['DT'])
df = df.set_index('DT')
df = df.groupby(pd.Grouper(freq='D'))['T-C'].agg(['idxmax','idxmin','max','min'])
print (df)
idxmax idxmin max min
DT
2015-01-01 2015-01-01 05:00:00 2015-01-01 00:00:00 -2.0 -2.5
2015-01-02 2015-01-02 00:00:00 2015-01-02 03:00:00 1.1 0.5