Pandas: How to Convert UTC Time to Local Time? - python-3.x

I have a Pandas series time of dates and times, like:
UTC:
0 2015-01-01 00:00:00
1 2015-01-01 01:00:00
2 2015-01-01 02:00:00
3 2015-01-01 03:00:00
4 2015-01-01 04:00:00
Name: DT, dtype: datetime64[ns]
That I'd like to convert to another timezone:
time2 = time.dt.tz_localize('UTC').dt.tz_convert('Europe/Rome')
print("CET: ",'\n', time2)
CET:
0 2015-01-01 01:00:00+01:00
1 2015-01-01 02:00:00+01:00
2 2015-01-01 03:00:00+01:00
3 2015-01-01 04:00:00+01:00
4 2015-01-01 05:00:00+01:00
Name: DT, dtype: datetime64[ns, Europe/Rome]
But, the result is not what I need. I want it in the form 2015-01-01 02:00:00 (the local time at UTC 01:00:00), not 2015-01-01 01:00:00+01:00.
How can I do that?
EDIT: While there is another question that deal with this issue (Convert pandas timezone-aware DateTimeIndex to naive timestamp, but in certain timezone), I think this question is more to the point, providing a clear and concise example for what appears a common problem.

I turns out that my question already has an answer here:
Convert pandas timezone-aware DateTimeIndex to naive timestamp, but in certain timezone)
I just wasn't able to phrase my question correctly. Anyway, what works is:
time3 = time2.dt.tz_localize(None)
print("Naive: ",'\n', time3)
Naive:
0 2015-01-01 01:00:00
1 2015-01-01 02:00:00
2 2015-01-01 03:00:00
3 2015-01-01 04:00:00
4 2015-01-01 05:00:00
Name: DT, dtype: datetime64[ns]`

Related

How to extract date in which the hour of the peak value occurred?

I have Hourly time series starts by year 2013 and ends by year 2020 as below and I want to plot only day in which the system load reached it peak:
date_time system_load
2013-01-01 00:00:00 1.0
2013-01-01 01:00:00 0.9
2013-01-01 02:00:00 0.5
...
2020-12-31 21:00:00 2.1
2020-12-31 22:00:00 1.8
2020-12-31 23:00:00 0.8
The intended dataframe has 'one day(24hours) per year' :
date_time system_load
2013-07-09 00:00:00 3.1
2013-07-09 02:00:00 3.0
2013-07-09 03:00:00 4.8
2013-07-09 04:00:00 2.6
...
2013-07-09 21:00:00 3.7
2013-07-09 22:00:00 3.9
2013-07-09 23:00:00 5.1
2014-09-09 00:00:00 4.1
2014-09-09 02:00:00 5.3
2014-09-09 03:00:00 6.0
2014-09-09 04:00:00 4.8
...
2014-09-09 21:00:00 3.5
2014-09-09 22:00:00 2.6
2014-09-09 23:00:00 1.6
...
...
2020-06-01 00:00:00 4.2
2020-06-01 02:00:00 3.6
2020-06-01 03:00:00 3.9
2020-06-01 04:00:00 2.8
...
2020-06-01 21:00:00 2.7
2020-06-01 22:00:00 4.8
2020-06-01 23:00:00 3.8
Get only date and year part from date_time column
Groupby year column and get the row containing the max value of system_load column in each group
Getting all the time from the original dataframe where the date is the same with the date whose system_load value is the max
Plot the bar
df['date_time'] = pd.to_datetime(df['date_time']) # Ensure the `date_time` column is datetime type
df['just_date'] = df['date_time'].dt.date
df['year'] = df['date_time'].dt.year
idx = df.groupby(['year'])['system_load'].transform(max) == df['system_load']
df[df['just_date'].isin(df[idx]['just_date'])].plot.bar(x='date_time', y='system_load', rot=45)
If in one year there are several days have the same max system_load value, the above code returns all. If you want to keep only the first day, you can use pandas.DataFrame.idxmax()
idx = df.groupby(['year'])['system_load'].idxmax()
df[df['just_date'].isin(df.loc[idx]['just_date'])].plot.bar(x='date_time', y='system_load', rot=45)
Here's an approach to solve your problem:
let sourcedf contain the input data in the form of two columns 'TimeStamp' & 'Load'
Then do the following:
sourcedf['Date'] = sourcedf.apply(lambda row: row['Date_Time'].date(), axis = 1)
dfg = sourcedf.groupby('Date')
ldList = list(dfg['Load'].max().to_list())
tgtDate = dfg.max().index.to_list()[dList.index(max(ldList))]
dfout = sourcedf[sourcedf['Date'] == tgtDate]
dfout will then contain just the date on which the max load was experienced

Pandas: Find original index of a value with a grouped dataframe

I have a dataframe with a RangeIndex, timestamps in the first column and several thousands hourly temperature observations in the second.
It is easy enough to group the observations by 24 and find daily Tmax and Tmin. But I also want the timestamp of each day's max and min values.
How can I do that?
I hope I can get help without posting a working example, because the nature of the data makes it unpractical.
EDIT: Here's some data, spanning two days.
DT T-C
0 2015-01-01 00:00:00 -2.5
1 2015-01-01 01:00:00 -2.1
2 2015-01-01 02:00:00 -2.3
3 2015-01-01 03:00:00 -2.3
4 2015-01-01 04:00:00 -2.3
5 2015-01-01 05:00:00 -2.0
...
24 2015-01-02 00:00:00 1.1
25 2015-01-02 01:00:00 1.1
26 2015-01-02 02:00:00 0.8
27 2015-01-02 03:00:00 0.5
28 2015-01-02 04:00:00 1.0
29 2015-01-02 05:00:00 0.7
First create DatetimeIndex, then aggregate by Grouper with days and idxmax
idxmin for datetimes for min and max temperature:
df['DT'] = pd.to_datetime(df['DT'])
df = df.set_index('DT')
df = df.groupby(pd.Grouper(freq='D'))['T-C'].agg(['idxmax','idxmin','max','min'])
print (df)
idxmax idxmin max min
DT
2015-01-01 2015-01-01 05:00:00 2015-01-01 00:00:00 -2.0 -2.5
2015-01-02 2015-01-02 00:00:00 2015-01-02 03:00:00 1.1 0.5

Groupby more than1 columns and then average over another column

I have the follow dataset:
cod date value
0 1O8 2015-01-01 00:00:00 2.1
1 1O8 2015-01-01 01:00:00 2.3
2 1O8 2015-01-01 02:00:00 3.5
3 1O8 2015-01-01 03:00:00 4.5
4 1O8 2015-01-01 04:00:00 4.4
5 1O8 2015-01-01 05:00:00 3.2
6 1O9 2015-01-01 00:00:00 1.4
7 1O9 2015-01-01 01:00:00 8.6
8 1O9 2015-01-01 02:00:00 3.3
10 1O9 2015-01-01 03:00:00 1.5
11 1O9 2015-01-01 04:00:00 2.4
12 1O9 2015-01-01 05:00:00 7.2
I want to aggregate by cod and date(month) and do an average of the value, like this:
value
cod date
1O8 2015-01-01 3.3
1O9 2015-01-01 4.9
My data have the follow type: dtypes: object(1), datetime64[ns](1), float64(1)
I try to use .groupby() function to aggegrate:
df.groupby(['cod', 'date', 'value']).size().reset_index().groupby('value').mean()
But did'nt produce the correct result
using a Grouper
df.groupby(["cod", pd.Grouper(key="date", freq="MS")]).mean()
Extra info on pbpython.com

Transform CSV structure with pandas dataframe

My CSV contains rows such as:
entryTime entryPrice exitTime exitPrice
06/01/2009 04:00 93.565 06/01/2009 06:00 93.825
I want to load them into a Dataframe that will have two rows per CSV row, in the following format:
datetime signal price
06/01/2009 04:00 entry 93.565
06/01/2009 06:00 exit 93.825
indexed by datetime column. What would be a fast way to do it?
Use numpy.tile with numpy.ravel:
print (df)
entryTime entryPrice exitTime exitPrice
0 01/01/2009 04:00 90.565 02/01/2009 06:00 91.825
1 03/01/2009 04:00 92.565 04/01/2009 06:00 93.825
2 05/01/2009 04:00 94.565 06/01/2009 06:00 95.825
3 07/01/2009 04:00 96.565 08/01/2009 07:00 97.825
4 09/01/2009 04:00 98.565 10/01/2009 06:00 99.825
a = np.tile(['entry','exit'], len(df))
b = df[['entryTime','exitTime']].values.ravel()
c = df[['entryPrice','exitPrice']].values.ravel()
df = pd.DataFrame({'price':c, 'signal':a},
index=pd.to_datetime(b),
columns=['signal','price'])
print (df)
signal price
2009-01-01 04:00:00 entry 90.565
2009-02-01 06:00:00 exit 91.825
2009-03-01 04:00:00 entry 92.565
2009-04-01 06:00:00 exit 93.825
2009-05-01 04:00:00 entry 94.565
2009-06-01 06:00:00 exit 95.825
2009-07-01 04:00:00 entry 96.565
2009-08-01 07:00:00 exit 97.825
2009-09-01 04:00:00 entry 98.565
2009-10-01 06:00:00 exit 99.825

Setting start time from previous night without dates from CSV using pandas

tI would like to run timeseries analysis on repeated measures data (time only, no dates) taken overnight from 22:00:00 to 09:00:00 the next morning.
How is the time set so that the Timeseries starts at 22:00:00. At the moment even when plotting it starts at 00:00:00 and ends at 23:00:00 with a flat line between 09:00:00 and 23:00:00?
df = pd.read_csv('1310.csv', parse_dates=True)
df['Time'] = pd.to_datetime(df['Time'])
df['Time'].apply( lambda d : d.time() )
df = df.set_index('Time')
df['2017-05-16 22:00:00'] + pd.Timedelta('-1 day')
Note: The date in the last line of code is automatically added, seen when df['Time'] is executed, so I inserted the same format with date in the last line for 22:00:00.
This is the error:
TypeError: Could not operate Timedelta('-1 days +00:00:00') with block values unsupported operand type(s) for +: 'numpy.ndarray' and 'Timedelta'
You should consider your timestamps as pd.Timedeltas and add a day to the samples before your start time.
Create some example data:
import pandas as pd
d = pd.date_range(start='22:00:00', periods=12, freq='h')
s = pd.Series(d).dt.time
df = pd.DataFrame(pd.np.random.randn(len(s)), index=s, columns=['value'])
df.to_csv('data.csv')
df
value
22:00:00 -0.214977
23:00:00 -0.006585
00:00:00 0.568259
01:00:00 0.603196
02:00:00 0.358124
03:00:00 0.027835
04:00:00 -0.436322
05:00:00 0.627624
06:00:00 0.168189
07:00:00 -0.321916
08:00:00 0.737383
09:00:00 1.100500
Read in, make index a timedelta, add a day to timedeltas before the start time, then assign back to the index.
df2 = pd.read_csv('data.csv', index_col=0)
df2.index = pd.to_timedelta(df2.index)
s = pd.Series(df2.index)
s[s < pd.Timedelta('22:00:00')] += pd.Timedelta('1d')
df2.index = pd.to_datetime(s)
df2
value
1970-01-01 22:00:00 -0.214977
1970-01-01 23:00:00 -0.006585
1970-01-02 00:00:00 0.568259
1970-01-02 01:00:00 0.603196
1970-01-02 02:00:00 0.358124
1970-01-02 03:00:00 0.027835
1970-01-02 04:00:00 -0.436322
1970-01-02 05:00:00 0.627624
1970-01-02 06:00:00 0.168189
1970-01-02 07:00:00 -0.321916
1970-01-02 08:00:00 0.737383
1970-01-02 09:00:00 1.100500
If you want to set the date of the first day:
df2.index += (pd.Timestamp('2015-06-06') - pd.Timestamp(0))
df2
value
2015-06-06 22:00:00 -0.214977
2015-06-06 23:00:00 -0.006585
2015-06-07 00:00:00 0.568259
2015-06-07 01:00:00 0.603196
2015-06-07 02:00:00 0.358124
2015-06-07 03:00:00 0.027835
2015-06-07 04:00:00 -0.436322
2015-06-07 05:00:00 0.627624
2015-06-07 06:00:00 0.168189
2015-06-07 07:00:00 -0.321916
2015-06-07 08:00:00 0.737383
2015-06-07 09:00:00 1.100500

Resources