What is the most performant way to slice a datetime in a multi-index? - python-3.x

What is the most 'performant' way to filter a DataFrame by time if the DataFrame has a multi-index containing a datetime index?
For example, how to filter for business hours only in a datetime index which is contained in a multi-index.

how to filter for business hours only in a datetime index which is contained in a multi-index.
df.index.get_level_values(1).hour.isin([9,10,11,13,14,15,16])
That's just one example--filter the second level of a MultiIndex which is a datetime column and get a boolean mask which is True wherever the hour is 9 to 5 excluding lunch break.
Need more precision?
dt = df.index.get_level_values(1)
minutes = dt.hour * 60 + dt.minute
minutes.between(8*60+15, 17*60+45)
That's 8:15 to 17:45.
Siesta?
minutes.between(9*60+30, 15*60) | minutes.between(17*60+30, 20*60)

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

Python Pandas: Supporting 25 hours in datetime index

I want to use a date/time as an index for a dataframe in Pandas.
However, daylight saving time is not properly addressed in the database, so the date/time values for the day in which daylight saving time ends have 25 hours and are represented as such:
2019102700
2019102701
...
2019102724
I am using the following code to convert those values to a DateTime object that I use as an index to a Pandas dataframe:
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")
However, that gives an error:
ValueError: unconverted data remains: 4
Presumably because the to_datetime function is not expecting the hour to be 24. Similarly, the day in which daylight saving time starts only has 23 hours.
One solution I thought of was storing the dates as strings, but that seems neither elegant nor efficient. Is there any way to solve the issue of handling daylight saving time when using to_datetime?
If you know the timezone, here's a way to calculate UTC timestamps. Parse only the date part, localize to the actual time zone the data "belongs" to, and convert that to UTC. Now you can parse the hour part and add it as a time delta - e.g.
import pandas as pd
df = pd.DataFrame({'date_time_str': ['2019102722','2019102723','2019102724',
'2019102800','2019102801','2019102802']})
df['date_time'] = (pd.to_datetime(df['date_time_str'].str[:-2], format='%Y%m%d')
.dt.tz_localize('Europe/Berlin')
.dt.tz_convert('UTC'))
df['date_time'] += df['date_time_str'].str[-2:].astype('timedelta64[h]')
# df['date_time']
# 0 2019-10-27 20:00:00+00:00
# 1 2019-10-27 21:00:00+00:00
# 2 2019-10-27 22:00:00+00:00
# 3 2019-10-27 23:00:00+00:00
# 4 2019-10-28 00:00:00+00:00
# 5 2019-10-28 01:00:00+00:00
# Name: date_time, dtype: datetime64[ns, UTC]
I'm not sure if it is the most elegant or efficient solution, but I would:
df.loc[df.date_time.str[-2:]=='25', 'date_time'] = (pd.to_numeric(df.date_time[df.date_time.str[-2:]=='25'])+100-24).apply(str)
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")
Pick the first and the last index, convert them to tz_aware datetime, then you can generate a date_range that handles 25-hour days. And assign the date_range to your df index:
start = pd.to_datetime(df.index[0]).tz_localize("Europe/Berlin")
end = pd.to_datetime(df.index[-1]).tz_localize("Europe/Berlin")
index_ = pd.date_range(start, end, freq="15min")
df = df.set_index(index_)

Trying to convert a string into a time format using pandas

i like to import a dataset with pandas, this is my code:
data = pd.read_csv(source + 'clustering_CB\\AAA_tableau_jan_oct_19_cardiologia.txt' ,sep='\t' , engine='python')
Column Col10 contains string values which let me know the duration of a web visit, here's an example
00:02:35
2 minute and 35 seconds.
What i like to do is to import this columns as a time format in order to measure (in seconds or in minutes) the duration of the web visit.
If I understand your question correctly, the column already contains timedelta values (in string format) - in this case, you can apply the pd.to_timedelta function to the column:
timedelta = pd.to_timedelta(df["Col10"])

How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.
csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

Creating a daily account log from a Pandas expense file in data frame format

I have an expense file that I am trying to read in and from this file create a daily log. A small subset of the file that extends over years is shown below, for a few days in January 2015.
Date,Checking_Debit,Checking_Addition,Savings_Debit,Savings_Addition
2015-01-07,342.1,0.0,0.0,0.0
2015-01-07,981.0,0.0,0.0,0.0
2015-01-07,3185.0,0.0,0.0,0.0
2015-01-05,55.0,0.0,0.0,0.0
2015-01-05,75.0,0.0,0.0,0.0
2015-01-03,287.0,0.0,0.0,0.0
2015-01-02,64.8,0.0,0.0,0.0
2015-01-02,75.0,0.0,0.0,75.0
2015-01-02,1280.0,0.0,0.0,0.0
2015-01-02,245.0,0.0,0.0,0.0
2015-01-01,45.0,0.0,0.0,0.0
In my code I start with the variables checking_start and savings_start that contain the start values of the checking and savings account. I would like to give the code a start date and an end date and have the code iterate through each day, see if there was an expense on that day and subtract the checking and savings debits and add the checking and savings additions. If there were no expenses on that day it should keep the accounts at the same value as the previous day. In addition, I am trying to constrain myself to Pandas data frames in the implementation. So far my code looks like this.
import pandas as pd
from date time import date
check_start = 8500.0
savings_start = 4000.0
start_date = date(2017, 1, 1)
end_date = date(2017, 1, 8)
df = pd.read_csv(file_name.csv, dtype={'Date': str, 'Checking_Debit': float,
'Checking_Addition': float,
'Savings_Debit': float,
'Savings_Addition': float})
In a Pythonic format with the Pandas module, how do I walk through from the start date to the end date, one day at a time, then see if there is an expense or expenses on those date and then subtract that from the checking and savings. At the end I should have an array for the value of the checking account on each date and the same for the savings account on that day.
The result should be arrays written into another .csv file with the following format.
Date,Checking,Savings
2017-01-07,1865.1,3925.0
2017-01-06,6373.2,3925.0
2017-01-05,6373.2,3925.0
2017-01-04,6503.2,3925.0
2017-01-03,6503.2,3925.0
2017-01-02,6790.2,3925.0
2017-01-01,8455.0,4000.0
Start by reading the data that you provided and identifying the date column in data with it
import pandas as pd
df = pd.read_csv(r"dat.csv", parse_dates=[0],dtype={'Checking_Debit': float,
'Checking_Addition': float,
'Savings_Debit': float,
'Savings_Addition': float})
Set Date as index for better data manipulation.
df = df.set_index("Date")
Initialize all the variables for the loop
check_start = 8500.0
savings_start = 4000.0
start_date = pd.to_datetime('2015/1/1')
end_date = pd.to_datetime('2015/1/8')
delta = pd.Timedelta('1 days') # time that needs to be added to start date
Now group the expense data w.r.t to each date
grp_df = df.groupby('Date').sum()
Now we will do while loop for create expense report for each day
expense_report = []
while start_date<=end_date:
if start_date in df.index:
savings_start += (grp_df.loc[start_date,"Savings_Addition"]-grp_df.loc[start_date,"Savings_Debit"])
check_start += (grp_df.loc[start_date,"Checking_Addition"]-grp_df.loc[start_date,"Checking_Debit"])
expense_report.append([start_date,check_start,savings_start])
elif start_date not in df.index:
expense_report.append([start_date,check_start,savings_start])
start_date += delta
convert expense_report list to pandas Dataframe
df_exp_rpt = pd.DataFrame(expense_report,columns=["Date","Checking","Savings"])
print(df_exp_rpt)
Date Checking Savings
0 2015-01-01 8455.0 4000.0
1 2015-01-02 6790.2 4075.0
2 2015-01-03 6503.2 4075.0
3 2015-01-04 6503.2 4075.0
4 2015-01-05 6373.2 4075.0
5 2015-01-06 6373.2 4075.0
6 2015-01-07 1865.1 4075.0
7 2015-01-08 1865.1 4075.0
You can save to csv by
df_exp_rpt.to_csv("filename.csv")
Note: The saving column values are 4075 instead of 3925.0 because you have 75 value in saving_addition column in your original data

Resources