I have a data frame with multiple users and timezones, like such:
cols = ['user', 'zone_name', 'utc_datetime']
data = [
[1, 'Europe/Amsterdam', pd.to_datetime('2019-11-13 11:14:15')],
[2, 'Europe/London', pd.to_datetime('2019-11-13 11:14:15')],
]
df = pd.DataFrame(data, columns=cols)
Based on this other post, I apply the following change to get the user local datetime:
df['local_datetime'] = df.groupby('zone_name')[
'utc_datetime'
].transform(lambda x: x.dt.tz_localize(x.name))
Which outputs this:
user zone_name utc_datetime local_datetime
1 Europe/Amsterdam 2019-11-13 11:14:15 2019-11-13 11:14:15+01:00
2 Europe/London 2019-11-13 11:14:15 2019-11-13 11:14:15+00:00
However, the local_datetime column is an object and I cannot find a way to get it as datetime64[ns] and in the following format (desired output):
user zone_name utc_datetime local_datetime
1 Europe/Amsterdam 2019-11-13 11:14:15 2019-11-13 12:14:15
2 Europe/London 2019-11-13 11:14:15 2019-11-13 11:14:15
I think you need Series.dt.tz_convert in lambda function:
df['local_datetime'] = (pd.to_datetime(df.groupby('zone_name')['utc_datetime']
.transform(lambda x: x.dt.tz_localize('UTC').dt.tz_convert(x.name))
.astype(str).str[:-6]))
print(df)
user zone_name utc_datetime local_datetime
0 1 Europe/Amsterdam 2019-11-13 11:14:15 2019-11-13 12:14:15
1 2 Europe/London 2019-11-13 11:14:15 2019-11-13 11:14:15
Relatively shorter answer using DataFrame.apply:
df['local_datetime'] = df.apply(lambda x: x.utc_datetime.tz_localize(tz = "UTC").tz_convert(x.zone_name), axis = 1)
print(df)
user zone_name utc_datetime local_datetime
0 1 Europe/Amsterdam 2019-11-13 11:14:15 2019-11-13 12:14:15+01:00
1 2 Europe/London 2019-11-13 11:14:15 2019-11-13 11:14:15+00:00
If you want to remove the time zone information, you can localize times by passing None
df['local_datetime'] = df.apply(lambda x: x.utc_datetime.tz_localize(tz = "UTC").tz_convert(x.zone_name).tz_localize(None), axis = 1)
print(df)
user zone_name utc_datetime local_datetime
0 1 Europe/Amsterdam 2019-11-13 11:14:15 2019-11-13 12:14:15
1 2 Europe/London 2019-11-13 11:14:15 2019-11-13 11:14:15
Related
I have a df describing transactions like
transaction start_in_s_since_epoch duration_in_s charged_energy_in_wh
1 1.457423e+09 1821.0 1732
2 1.457389e+09 35577.0 18397
3 1.457425e+09 2.0 0
[...]
I assume the charged_energy is linear through the transaction. I would like to transform it to a time series with the granularity of a day. charged_energy within a day should be summed up as well as duration.
day sum_duration_in_s sum_charged_energy_in_wh
2016-03-16 00:00 123 456
2016-03-17 00:00 456 789
2016-03-18 00:00 789 012
[...]
Any idea? I am struggling with the borders between days. This transaction with
transaction start_in_s_since_epoch duration_in_s charged_energy_in_wh
500 1620777300 600 1000
should be equally divided to
day sum_duration_in_s sum_charged_energy_in_wh
2021-05-11 00:00 300 500
2021-05-11 00:00 300 500
This did it for me. Slow af but works:
from datetime import datetime
from datetime_truncate import truncate
df_tmp = pd.DataFrame()
for index, row in df.iterrows():
day_in_s = 60*60*24
start = row.start_in_s_since_epoch
time = row.duration_in_s
energy_per_s = row.charged_energy_in_wh / row.duration_in_s
till_midnight_in_s = truncate(pd.to_datetime(start + day_in_s, unit='s'), 'day').timestamp() - start
rest_in_s = time - till_midnight_in_s
data = {'day':truncate(pd.to_datetime(start, unit='s'), 'day'),
'sum_duration_in_s':min(time, till_midnight_in_s),
'sum_charged_energy_in_wh':min(time, till_midnight_in_s) * energy_per_s}
df_tmp = df_tmp.append(data, ignore_index=True)
while rest_in_s > 0:
start += day_in_s
data = {'day':truncate(pd.to_datetime(start, unit='s'), 'day'),
'sum_duration_in_s':min(rest_in_s, day_in_s),
'sum_charged_energy_in_wh':min(rest_in_s, day_in_s) * energy_per_s}
df_tmp = df_tmp.append(data, ignore_index=True)
rest_in_s = rest_in_s - day_in_s
df_ts = df_tmp.groupby(['date']).agg({'sum_charged_energy_in_wh':sum,
'sum_duration_in_s':sum}).sort_values('date')
df_ts = df_ts.asfreq('D', fill_value=0)
So, I want to calculate the time differences; and the file looks something like this
id
message_id
send_date
status
0
5f74b996a2b7e
2020-10-01 00:00:07
sent
1
5f74b996a2b7e
2020-10-01 00:00:09
delivered
2
5f74b99e85b3c
2020-10-02 02:00:14
sent
3
5f74b99e85b3c
2020-10-02 02:01:16
delivered
4
5f74b99e85b3c
2020-10-02 08:06:49
read
5
5f74b996a2b7e
2020-10-02 15:16:32
read
6
5f9d97ff1af9e
2020-10-14 13:45:43
sent
7
5f9d97ff1af9e
2020-10-14 13:45:45
delivered
8
5f9d97ff1af9e
2020-10-14 13:50:48
read
9
5f74b9a35b6c5
2020-10-16 19:01:19
sent
10
5f74b9a35b6c5
2020-10-16 19:01:25
deleted
Inside is id which increment, message_id is unique to each message, send_date is the time, status is the message status (it has 5 statuses; sent, delivered, read, failed, and deleted).
I wanted to calculate the time differences when the message was sent then delivered, if delivered then read.
I know something like this can be handy, but I wasn't sure how to assign it uniquely to each of the message_id
from datetime import datetime
s1 = '2020-10-14 13:45:45'
s2 = '2020-10-14 13:50:48' # for example
FMT = '%Y-%m-%d %H:%M:%S'
tdelta = datetime.strptime(s2, FMT) - datetime.strptime(s1, FMT)
print(tdelta)
Ref: https://stackoverflow.com/questions/3096953/how-to-calculate-the-time-interval-between-two-time-strings
The expected output would be,
message_id
delivered_diff
read_diff
deleted_diff
0
5f74b996a2b7e
00:00:02
1 day, 15:16:23
1
5f74b99e85b3c
00:01:02
6:05:33
2
5f9d97ff1af9e
00:00:02
0:05:03
3
5f74b9a35b6c5
0:00:06
You can use pandas to do this and datetime.
The code is commented to better understand and realized with python 3.8.
import datetime
import pandas as pd
def time_delta(a, b):
return datetime.datetime.strptime(b, '%Y-%m-%d %H:%M:%S') - datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S') # calculate the timedelta
def calculate_diff(val, first_status, second_status):
if not val['status'].str.contains(first_status).any() or not val['status'].str.contains(second_status).any(): # Check if the status exist
return ''
a = val.loc[val['status'] == first_status, 'send_date'].values[0] # Get the first send_date value for the first status value
b = val.loc[val['status'] == second_status, 'send_date'].values[0] # Get the first send_date value for the second status value
return time_delta(a, b) # calculate the delta
df = pd.read_csv('test.csv', sep=';') # Load csv file with ; as separator
grouped = df.groupby('message_id') # Group by message ids
final_df = pd.DataFrame(columns=['message_id', 'delivered_diff', 'read_diff', 'deleted_diff']) # create empty result dataframe
for message_id, values in grouped: # calculate the results for each group
delivered_diff = calculate_diff(values, 'sent', 'delivered') # calculate delivered_diff as delta between sent status and delivered status
read_diff = calculate_diff(values, 'delivered', 'read') # calculate read_diff as delta between delivered status and read status
deleted_diff = calculate_diff(values, 'sent', 'deleted') # calculate deleted_diff as delta between sent status and deleted status
res = {
'message_id': message_id,
'delivered_diff': delivered_diff,
'read_diff': read_diff,
'deleted_diff': deleted_diff
}
# append the results
final_df = final_df.append(res, ignore_index=True)
# print final result
print(final_df)
The result:
message_id delivered_diff read_diff deleted_diff
0 5f74b996a2b7e 0 days 00:00:02 1 days 15:16:23
1 5f74b99e85b3c 0 days 00:01:02 0 days 06:05:33
2 5f74b9a35b6c5 0 days 00:00:06
3 5f9d97ff1af9e 0 days 00:00:02 0 days 00:05:03
import pandas as pd
from datetime import datetime, timedelta
final_dict = []
data = pd.read_csv('data.csv', names=['id','unique_id','time','status'])
data['time'] = pd.to_datetime(data['time'])
# data.info()
groupByUniqueId = data.groupby('unique_id')
for name,group in groupByUniqueId:
for row in group.iterrows():
if row[1][3] == "sent":
sent = row[1][2]
if row[1][3] == "read":
final_dict.append({row[1][1]: {"read": str(sent - row[1][2])}})
elif row[1][3] == "delivered":
final_dict.append({row[1][1]: {"delivered":str(sent - row[1][2])}})
elif row[1][3] == "deleted":
final_dict.append({row[1][1]: {"deleted":str(sent - row[1][2])}})
print(final_dict)
Data Sample for CSV
the input is a range of date for which we need to find the starting date of the month and end date of the month of all date in between the interval. example is given below
input:
start date: 2018-6-15
end date: 2019-3-20
desired output:
[
["month starting date","month ending date"],
["2018-6-15","2018-6-30"],
["2018-7-1","2018-7-31"],
["2018-8-1","2018-8-31"],
["2018-9-1","2018-9-30"],
["2018-10-1","2018-10-31"],
["2018-11-1","2018-11-30"],
["2018-12-1","2018-12-31"],
["2019-1-1","2019-1-31"],
["2019-2-1","2019-2-28"],
["2019-3-1","2019-3-20"]
]
An option using pandas: create a date_range from start to end date, extract the month numbers from that as a pandas.Series, shift it 1 element forward and 1 element backward to retrieve a boolean mask where the months change (!=). Now you can create a DataFrame to work with or create a list of lists if you like.
Ex:
import pandas as pd
start_date, end_date = '2018-6-15', '2019-3-20'
dtrange = pd.date_range(start=start_date, end=end_date, freq='d')
months = pd.Series(dtrange .month)
starts, ends = months.ne(months.shift(1)), months.ne(months.shift(-1))
df = pd.DataFrame({'month_starting_date': dtrange[starts].strftime('%Y-%m-%d'),
'month_ending_date': dtrange[ends].strftime('%Y-%m-%d')})
# df
# month_starting_date month_ending_date
# 0 2018-06-15 2018-06-30
# 1 2018-07-01 2018-07-31
# 2 2018-08-01 2018-08-31
# 3 2018-09-01 2018-09-30
# 4 2018-10-01 2018-10-31
# 5 2018-11-01 2018-11-30
# 6 2018-12-01 2018-12-31
# 7 2019-01-01 2019-01-31
# 8 2019-02-01 2019-02-28
# 9 2019-03-01 2019-03-20
# as a list of lists:
l = [df.columns.values.tolist()] + df.values.tolist()
# l
# [['month_starting_date', 'month_ending_date'],
# ['2018-06-15', '2018-06-30'],
# ['2018-07-01', '2018-07-31'],
# ['2018-08-01', '2018-08-31'],
# ['2018-09-01', '2018-09-30'],
# ['2018-10-01', '2018-10-31'],
# ['2018-11-01', '2018-11-30'],
# ['2018-12-01', '2018-12-31'],
# ['2019-01-01', '2019-01-31'],
# ['2019-02-01', '2019-02-28'],
# ['2019-03-01', '2019-03-20']]
Note that I use strftime when I create the DataFrame. Do this if you want the output to be of dtype string. If you want to continue to work with datetime objects (timestamps), don't apply strftime.
This code is simple and uses standard python packages.
import calendar
from datetime import datetime, timedelta
def get_time_range_list(start_date, end_date):
date_range_list = []
while 1:
month_end = start_date.replace(day=calendar.monthrange(start_date.year, start_date.month)[1])
next_month_start = month_end + timedelta(days=1)
if next_month_start <= end_date:
date_range_list.append((start_date, month_end))
start_date = next_month_start
else:
date_range_list.append((start_date, end_date))
return date_range_list
This is a quant work. I used my previous work to filter out some desired stocks(candidates) with technical analysis based on moving average, MACD, KDJ, and etc. And now I wanna check my candidates' fundamental values, in this case, ROE. here is my code:
root_path = '.\\__fundamentals__'
df = pd.DataFrame(pd.read_csv("C:\\candidates.csv", encoding='GBK')) # 14 candidates this time
for code in list(df['code']):
i = str(code).zfill(6)
for root, dirs, files in os.walk(root_path):
for csv in files:
if csv.startswith('{}.csv'.format(i)):
csv_path = os.path.join(root, csv) # based on candidates looking for dupont value
df1 = pd.DataFrame(pd.read_csv("{}".format(csv_path), encoding='GBK'))
df2['ROE'] = df2['净资产收益率'].str.strip("%").astype(float)/100
ROE = [df2['ROE'].mean().round(decimals=3)]
df3 = pd.DataFrame({'ROE_Mean': ROE})
print(df3)
Here is the DOM
C:\Users\Mike_Leigh\.conda\envs\LEIGH\python.exe "P:/LEIGH PYTHON/Codes/Quant/analyze_stock.py"
ROE_Mean
0 -0.218
ROE_Mean
0 0.121
ROE_Mean
0 0.043
ROE_Mean
0 0.197
ROE_Mean
0 0.095
ROE_Mean
0 0.085
...
ROE_Mean
0 0.178
Process finished with exit code 0
my desired output would be like this:
ROE_Mean
0 -0.218
1 0.121
2 0.043
3 0.197
4 0.095
5 0.085
...
14 0.178
Would you please give me a hint on this? Thanks alot, much appreciated!
acually I wasn't that bad solving the issue.
first, make a list outside the loop, I mean the very outside the loop, in this case, before df
roe_avg = []
df = pd.DataFrame(pd.read_csv("C:\\candidates.csv", encoding='GBK'))
....
df2['ROE'] = df2['净资产收益率'].str.strip("%").astype(float) / 100
ROE_avg = df2['ROE'].mean().round(decimals=3)
roe_avg.append(ROE_avg)
df['ROE_avg'] = roe_avg
print(df)
DOM
name code ROE_avg
1 仙鹤股份 603733 0.121
3 泸州老窖 568 0.197
4 兴蓉环境 598 0.095
...
15 濮阳惠成 300481 0.148
16 中科创达 300496 0.101
17 森霸传感 300701 0.178
Process finished with exit code 0
thanks to #filippo
I am trying to build a function which transform a dataframe based on certain conditions but I am getting a Systax Error. I am not sure what I am doing wrong. Any help will be appreciated. Thank you!
import pandas as pd
from datetime import datetime
from datetime import timedelta
df=pd.read_csv('example1.csv')
df.columns =(['dtime','kW'])
df['dtime'] = pd.to_datetime(df['dtime'])
df.head(5)
dtime kW
0 2019-08-27 23:30:00 0.016
1 2019-08-27 23:00:00 0
2 2019-08-27 22:30:00 0.016
3 2019-08-27 22:00:00 0.016
4 2019-08-27 21:30:00 0
def transdf(df):
a=df.loc[0,'dtime']
b=df.loc[1,'dtime']
c=a-b
minutes = c.total_seconds() / 60
d=int(minutes)
#d can be only 15 ,30 or 60
if d==15:
return df=df.set_index('dtime').asfreq('-15T',fill_value='Missing')
elif d==30:
return df=df.set_index('dtime').asfreq('-30T',fill_value='Missing')
elif d==60:
return df=df.set_index('dtime').asfreq('-60T',fill_value='Missing')
else:
return None
first. It is more efficient to have the return statement after the else at the end of your code. Inside each of the cases just update the value for df. Return is part of your function, not the if statement that's why you are getting errors.
def transform(df):
a = df.loc[0, 'dtime']
b = df.loc[1, 'dtime']
c = a - b
minutes = c.total_seconds() / 60
d=int(minutes)
#d can be only 15 ,30 or 60
if d==15:
df= df.set_index('dtime').asfreq('-15T',fill_value='Missing')
elif d==30:
df= df.set_index('dtime').asfreq('-30T',fill_value='Missing')
elif d==60:
df= df.set_index('dtime').asfreq('-60T',fill_value='Missing')
else:
None
return dfere