python - transform df to time series - python-3.x

I have a df describing transactions like
transaction start_in_s_since_epoch duration_in_s charged_energy_in_wh
1 1.457423e+09 1821.0 1732
2 1.457389e+09 35577.0 18397
3 1.457425e+09 2.0 0
[...]
I assume the charged_energy is linear through the transaction. I would like to transform it to a time series with the granularity of a day. charged_energy within a day should be summed up as well as duration.
day sum_duration_in_s sum_charged_energy_in_wh
2016-03-16 00:00 123 456
2016-03-17 00:00 456 789
2016-03-18 00:00 789 012
[...]
Any idea? I am struggling with the borders between days. This transaction with
transaction start_in_s_since_epoch duration_in_s charged_energy_in_wh
500 1620777300 600 1000
should be equally divided to
day sum_duration_in_s sum_charged_energy_in_wh
2021-05-11 00:00 300 500
2021-05-11 00:00 300 500

This did it for me. Slow af but works:
from datetime import datetime
from datetime_truncate import truncate
df_tmp = pd.DataFrame()
for index, row in df.iterrows():
day_in_s = 60*60*24
start = row.start_in_s_since_epoch
time = row.duration_in_s
energy_per_s = row.charged_energy_in_wh / row.duration_in_s
till_midnight_in_s = truncate(pd.to_datetime(start + day_in_s, unit='s'), 'day').timestamp() - start
rest_in_s = time - till_midnight_in_s
data = {'day':truncate(pd.to_datetime(start, unit='s'), 'day'),
'sum_duration_in_s':min(time, till_midnight_in_s),
'sum_charged_energy_in_wh':min(time, till_midnight_in_s) * energy_per_s}
df_tmp = df_tmp.append(data, ignore_index=True)
while rest_in_s > 0:
start += day_in_s
data = {'day':truncate(pd.to_datetime(start, unit='s'), 'day'),
'sum_duration_in_s':min(rest_in_s, day_in_s),
'sum_charged_energy_in_wh':min(rest_in_s, day_in_s) * energy_per_s}
df_tmp = df_tmp.append(data, ignore_index=True)
rest_in_s = rest_in_s - day_in_s
df_ts = df_tmp.groupby(['date']).agg({'sum_charged_energy_in_wh':sum,
'sum_duration_in_s':sum}).sort_values('date')
df_ts = df_ts.asfreq('D', fill_value=0)

Related

Subtracting specific values from column in pandas

I am new to pandas library.
I am working on a data set which looks like this :
suppose I want to subtract point score in the table.
I want to subtract 100 if the score in **point score** column if the score is below 1000
and subtract 200 if the score is above 1000 . How do I do this.
code :
import pandas as pd
df = pd.read_csv("files/Soccer_Football Clubs Ranking.csv")
df.head(4)
Use:
np.where(df['point score']<1000, df['point score']-100, df['point score']-200)
Demonstration:
test = pd.DataFrame({'point score':[14001,200,1500,750]})
np.where(test['point score']<1000, test['point score']-100, test['point score']-200)
Output:
Based on the comment:
temp = test[test['point score']<1000]
temp=temp['point score']-100
temp2 = test[test['point score']>=1000]
temp2=temp2['point score']-200
temp.append(temp2)
Another solution:
out = []
for cell in test['point score']:
if cell < 1000:
out.append(cell-100)
else:
out.append(cell-200)
test['res'] = out
Fourth solution:
test['point score']-(test['point score']<1000).replace(False,200).replace(True,100)
You can achieve vectorial code without explicitly using numpy (although numpy is used by pandas anyways), and keeping vectorial code:
# example 1
df['map'] = df['point score'] - df['point score'].gt(1000).map({True: 200, False: 100})
# example 2, multiple criteria
# -100 below 1000, -200 below 2000, -300 above 3000
df['cut'] = df['point score'] - pd.cut(df['point score'], bins=[0,1000,2000,float('inf')], labels=[100, 200, 300]).astype(int)
Output:
point score map cut
0 800 700 700
1 900 800 800
2 1000 900 900
3 2000 1800 1800
4 3000 2800 2700

How do I resample to 5 min correctly

I am trying to resample 1 min bars to 5 min but I am getting incorrect results.
1 min data:
I am using this to resample:
df2.resample("5min").agg({'open':'first',
'high':'max',
'low:'min',
'close':'last'})
I get:
For the second row bar (00:00:00) the high should be 110.34 not 110.35, and the close shoulb be 110.33.
How do I fix this?
EDIT 1 To create data:
import datetime
import pandas as pd
idx = pd.date_range("2021-09-23 23:55", periods=11, freq="1min")
df = pd.DataFrame(index = idx)
data = [110.34,
110.33,110.34,110.33,110.33,110.33,
110.32,110.35,110.34,110.32,110.33,
]
df['open'] = data
df['high'] = data
df['low'] = data
df['close'] = data
df2 = df.resample("5min").agg({'open':'first',
'high':'max',
'low':'min',
'close':'last'})
print(df)
print("----")
print(df2)
We can specify the closed='right' and label='right' optional keyword arguments
d = {'open':'first','high':'max',
'low':'min','close':'last'}
df.resample("5min", closed='right', label='right').agg(d)
open high low close
2021-09-23 23:55:00 110.34 110.34 110.34 110.34
2021-09-24 00:00:00 110.33 110.34 110.33 110.33
2021-09-24 00:05:00 110.32 110.35 110.32 110.33

Finding time difference between the message status in python

So, I want to calculate the time differences; and the file looks something like this
id
message_id
send_date
status
0
5f74b996a2b7e
2020-10-01 00:00:07
sent
1
5f74b996a2b7e
2020-10-01 00:00:09
delivered
2
5f74b99e85b3c
2020-10-02 02:00:14
sent
3
5f74b99e85b3c
2020-10-02 02:01:16
delivered
4
5f74b99e85b3c
2020-10-02 08:06:49
read
5
5f74b996a2b7e
2020-10-02 15:16:32
read
6
5f9d97ff1af9e
2020-10-14 13:45:43
sent
7
5f9d97ff1af9e
2020-10-14 13:45:45
delivered
8
5f9d97ff1af9e
2020-10-14 13:50:48
read
9
5f74b9a35b6c5
2020-10-16 19:01:19
sent
10
5f74b9a35b6c5
2020-10-16 19:01:25
deleted
Inside is id which increment, message_id is unique to each message, send_date is the time, status is the message status (it has 5 statuses; sent, delivered, read, failed, and deleted).
I wanted to calculate the time differences when the message was sent then delivered, if delivered then read.
I know something like this can be handy, but I wasn't sure how to assign it uniquely to each of the message_id
from datetime import datetime
s1 = '2020-10-14 13:45:45'
s2 = '2020-10-14 13:50:48' # for example
FMT = '%Y-%m-%d %H:%M:%S'
tdelta = datetime.strptime(s2, FMT) - datetime.strptime(s1, FMT)
print(tdelta)
Ref: https://stackoverflow.com/questions/3096953/how-to-calculate-the-time-interval-between-two-time-strings
The expected output would be,
message_id
delivered_diff
read_diff
deleted_diff
0
5f74b996a2b7e
00:00:02
1 day, 15:16:23
1
5f74b99e85b3c
00:01:02
6:05:33
2
5f9d97ff1af9e
00:00:02
0:05:03
3
5f74b9a35b6c5
0:00:06
You can use pandas to do this and datetime.
The code is commented to better understand and realized with python 3.8.
import datetime
import pandas as pd
def time_delta(a, b):
return datetime.datetime.strptime(b, '%Y-%m-%d %H:%M:%S') - datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S') # calculate the timedelta
def calculate_diff(val, first_status, second_status):
if not val['status'].str.contains(first_status).any() or not val['status'].str.contains(second_status).any(): # Check if the status exist
return ''
a = val.loc[val['status'] == first_status, 'send_date'].values[0] # Get the first send_date value for the first status value
b = val.loc[val['status'] == second_status, 'send_date'].values[0] # Get the first send_date value for the second status value
return time_delta(a, b) # calculate the delta
df = pd.read_csv('test.csv', sep=';') # Load csv file with ; as separator
grouped = df.groupby('message_id') # Group by message ids
final_df = pd.DataFrame(columns=['message_id', 'delivered_diff', 'read_diff', 'deleted_diff']) # create empty result dataframe
for message_id, values in grouped: # calculate the results for each group
delivered_diff = calculate_diff(values, 'sent', 'delivered') # calculate delivered_diff as delta between sent status and delivered status
read_diff = calculate_diff(values, 'delivered', 'read') # calculate read_diff as delta between delivered status and read status
deleted_diff = calculate_diff(values, 'sent', 'deleted') # calculate deleted_diff as delta between sent status and deleted status
res = {
'message_id': message_id,
'delivered_diff': delivered_diff,
'read_diff': read_diff,
'deleted_diff': deleted_diff
}
# append the results
final_df = final_df.append(res, ignore_index=True)
# print final result
print(final_df)
The result:
message_id delivered_diff read_diff deleted_diff
0 5f74b996a2b7e 0 days 00:00:02 1 days 15:16:23
1 5f74b99e85b3c 0 days 00:01:02 0 days 06:05:33
2 5f74b9a35b6c5 0 days 00:00:06
3 5f9d97ff1af9e 0 days 00:00:02 0 days 00:05:03
import pandas as pd
from datetime import datetime, timedelta
final_dict = []
data = pd.read_csv('data.csv', names=['id','unique_id','time','status'])
data['time'] = pd.to_datetime(data['time'])
# data.info()
groupByUniqueId = data.groupby('unique_id')
for name,group in groupByUniqueId:
for row in group.iterrows():
if row[1][3] == "sent":
sent = row[1][2]
if row[1][3] == "read":
final_dict.append({row[1][1]: {"read": str(sent - row[1][2])}})
elif row[1][3] == "delivered":
final_dict.append({row[1][1]: {"delivered":str(sent - row[1][2])}})
elif row[1][3] == "deleted":
final_dict.append({row[1][1]: {"deleted":str(sent - row[1][2])}})
print(final_dict)
Data Sample for CSV

How to transform a dataframe based on if,else conditions?

I am trying to build a function which transform a dataframe based on certain conditions but I am getting a Systax Error. I am not sure what I am doing wrong. Any help will be appreciated. Thank you!
import pandas as pd
from datetime import datetime
from datetime import timedelta
df=pd.read_csv('example1.csv')
df.columns =(['dtime','kW'])
df['dtime'] = pd.to_datetime(df['dtime'])
df.head(5)
dtime kW
0 2019-08-27 23:30:00 0.016
1 2019-08-27 23:00:00 0
2 2019-08-27 22:30:00 0.016
3 2019-08-27 22:00:00 0.016
4 2019-08-27 21:30:00 0
def transdf(df):
a=df.loc[0,'dtime']
b=df.loc[1,'dtime']
c=a-b
minutes = c.total_seconds() / 60
d=int(minutes)
#d can be only 15 ,30 or 60
if d==15:
return df=df.set_index('dtime').asfreq('-15T',fill_value='Missing')
elif d==30:
return df=df.set_index('dtime').asfreq('-30T',fill_value='Missing')
elif d==60:
return df=df.set_index('dtime').asfreq('-60T',fill_value='Missing')
else:
return None
first. It is more efficient to have the return statement after the else at the end of your code. Inside each of the cases just update the value for df. Return is part of your function, not the if statement that's why you are getting errors.
def transform(df):
a = df.loc[0, 'dtime']
b = df.loc[1, 'dtime']
c = a - b
minutes = c.total_seconds() / 60
d=int(minutes)
#d can be only 15 ,30 or 60
if d==15:
df= df.set_index('dtime').asfreq('-15T',fill_value='Missing')
elif d==30:
df= df.set_index('dtime').asfreq('-30T',fill_value='Missing')
elif d==60:
df= df.set_index('dtime').asfreq('-60T',fill_value='Missing')
else:
None
return dfere

Dark Sky API Iterate daily through one year in Python 3

I just try to get the weather data for a time range.
I want to get daily OR hourly data for a whole year.
I just tried the following code:
from forecastiopy import *
from datetime import date, datetime, timedelta
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + timedelta(n)
start_date = date(2015, 1, 1)
end_date = date(2015, 12, 31)
for single_date in daterange(start_date, end_date):
time = single_date.strftime("%Y-%m-%d")
print('DATE: ', time)
city = [40.730610, -73.935242]
fio = ForecastIO.ForecastIO(apikey,
units=ForecastIO.ForecastIO.UNITS_SI,
lang=ForecastIO.ForecastIO.LANG_ENGLISH,
latitude=city[0], longitude=city[1])
print('Latitude:', fio.latitude, 'Longitude:', fio.longitude)
print('Timezone', fio.timezone, 'Offset', fio.offset)
print(fio.get_url()) # You might want to see the request url
if fio.has_hourly() is True:
hourly = FIOHourly.FIOHourly(fio)
print('Hourly')
print('Summary:', hourly.summary)
print('Icon:', hourly.icon)
for hour in range(0, hourly.hours()):
print('Hour', hour+1)
for item in hourly.get_hour(hour).keys():
print(item + ' : ' + str(hourly.get_hour(hour)[item]))
# Or access attributes directly for a given minute.
print(hourly.hour_5_time)
else:
print('No Hourly data')
I get:
DATUM: 2015-01-01
DATUM: 2015-01-02
DATUM: 2015-01-03
...
DATUM: 2015-12-29
DATUM: 2015-12-30
Latitude: 40.73061 Longitude: -73.935242
Timezone America/New_York Offset -4
Hourly
Summary: Light rain starting this afternoon.
Icon: rain
Hour 1
visibility : 16.09
humidity : 0.52
...
Hour 49
visibility : 16.09
humidity : 0.57
apparentTemperature : 23.52
icon : partly-cloudy-day
precipProbability : 0
windGust : 2.7
uvIndex : 2
time : 1498395600
precipIntensity : 0
windSpeed : 2.07
pressure : 1014.84
summary : Mostly Cloudy
windBearing : 37
temperature : 23.34
ozone : 308.33
cloudCover : 0.65
dewPoint : 14.43
1498237200
How can I use for the time parameter each day of a specific year to get 365 daily reports or 365 * 24 hourly reports? I am not a specialist in python.
This blog provides some code to query between dates https://nipunbatra.github.io/blog/2013/download_weather.html
times = []
data = {}
for attr in attributes:
data[attr] = []
start = datetime.datetime(2015, 1, 1)
for offset in range(1, 60):
forecast = forecastio.load_forecast(api_key, lat, lng, time=start+datetime.timedelta(offset), units="us")
h = forecast.hourly()
d = h.data
for p in d:
times.append(p.time)
try:
for i in attributes:
data[i].append(p.d[i])
except:
print(KeyError)
df = pd.DataFrame(data, index=times)
It works for me on python 3.6...however, i am getting a error KeyError: 'temperature' when i query dates around march 2019 for my coordinates... so in this code I added try catach error in the for p in d loop
Hope this helps

Resources