PANDAS date summarisation - python-3.x

I have a pandas dataframe that looks like:
import pandas as pd
df= pd.DataFrame({'type':['Asset','Liability','Asset','Liability','Asset'],'Amount':[10,-10,20,-20,5],'Maturity Date':['2018-01-22','2018-02-22','2018-06-22','2019-06-22','2020-01-22']})
df
I want to aggregate the dates so it shows the first four quarters and then the year end. For the dataset above, I would expect:
df1= pd.DataFrame({'type':['Asset','Liability','Asset','Liability','Asset'],'Amount':[10,-10,20,-20,5],'Maturity Date':['2018-01-22','2018-02-22','2018-06-22','2019-06-22','2020-01-22'],'Mat Group':['1Q18','1Q18','2Q18','FY19','FY20']})
df1
right now I achieve this using a set of loc statements such as :
df.loc[(df['Maturity Date'] >'2018-01-01') & (df['Maturity Date'] <='2018-03-31'),'Mat Group']="1Q18"
df.loc[(df['Maturity Date'] >'2018-04-01') & (df['Maturity Date'] <='2018-06-30'),'Mat Group']="2Q18"
I was wondering if there is a more elegant way to achieve the same result? Perhaps have the buckets in a list and parse through the list so that the bucketing can be made more flexible ?

A bit specific. I would use.
the strftime format %y to get the short
the pandas built-in quarter to get the quarter
the python format function to construct strings
a lambda to apply it to the column
Here is the result. Maybe there is a better answer, but this one is pretty concise.
df['Mat Group'] = df['Maturity Date'].apply(
lambda x: '{}Q{:%y}'.format(x.quarter, x) if x.year < 2019
else 'FY{:%y}'.format(x))
df
# Amount Maturity Date type Mat Group
# 0 10 2018-01-22 Asset 1Q18
# 1 -10 2018-02-22 Liability 1Q18
# 2 20 2018-06-22 Asset 2Q18
# 3 -20 2019-06-22 Liability FY19
# 4 5 2020-01-22 Asset FY20

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

Extract row from pandas dateframe

I have a data frame as the image below. I want to extract the rows of data frame which are having year and month as '1395/01'. I used the code below, but I know it is not correct because we can use string slice on a series of strings. Can anyone show me a way without using nested for loops?
df[df['Date'][:7] == '1395/01']
I might use str.match here:
df[df['Date'].str.match(r'^1395/01')]
But in general it is usually preferable to store dates as datetime and not text. Also, the year 1395 seems dubious.
You can use loc and startswith to filter your dataframe.
Sample:
df = pd.DataFrame({'Date': ['1395/01/01', '1395/02/01', '1395/01/01', '1395/05/01']})
print(df)
Date
0 1395/01/01
1 1395/02/01
2 1395/01/01
3 1395/05/01
Solution:
print(df.loc[df['Date'].str.startswith('1395/01'), :])
Date
0 1395/01/01
2 1395/01/01
If you would like to extract year and month for all rows, you can use str.slice:
df['Extracted Date'] = df['Date'].str.slice(0, 7)
print(df)
Date Extracted Date
0 1395/01/01 1395/01
1 1395/02/01 1395/02
2 1395/01/01 1395/01
3 1395/05/01 1395/05

Add new rows to dataframe using existing rows from previous year

I'm creating a Pandas dataframe from an existing file and it ends up essentially like this.
import pandas as pd
import datetime
data = [[i, i+1] for i in range(14)]
index = pd.date_range(start=datetime.date(2019,1,1), end=datetime.date(2020,2,1), freq='MS')
columns = ['col1', 'col2']
df = pd.DataFrame(data, index, columns)
Notice that this doesn't go all the way up to the present -- often the file I'm pulling from is a month or two behind. What I then need to do is add on any missing months and fill them with the same value as the previous year.
So in this case I need to add another row that is
2020-03-01 2 3
It could be anywhere from 0-2 rows that need to be added to the end of the dataframe at a given point in time. What's the best way to do this?
Note: The data here is not real so please don't take advantage of the simple pattern of entries I gave above. It was just a quick way to fill two columns of a table as an example.
If I understand your problem, then the following should help you. This does assume that you always have data 12 months ago however. You can define a new DataFrame which includes the months up to the most recent date.
# First create the new index. Get the most recent date and add an offset.
start, end = df.index[-1] + pd.DateOffset(), pd.Timestamp.now()
index_new = pd.date_range(start, end, freq='MS')
Create your DataFrame
# Get the data from the previous year.
data = df.loc[index_new - pd.DateOffset(years=1)].values
df_new = pd.DataFrame(data, index = index_new, columns=df.columns)
which looks like
col1 col2
2020-03-01 2 3
then just use;
pd.concat([df, df_new], axis=0)
Which gives
col1 col2
2019-01-01 0 1
2019-02-01 1 2
2019-03-01 2 3
... ... ...
2020-02-01 13 14
2020-03-01 2 3
Note
This also works for cases where the number of months missing is greater than 1.
Edit
Slightly different variation
# Create series with missing months added.
# Get the corresponding data 12 months prior.
s = pd.date_range(df.index[0], pd.Timestamp.now(), freq='MS')
fill = df.loc[s[~s.isin(df.index)] - pd.DateOffset(years=1)]
# Reindex the original dataframe
df = df.reindex(s)
# Find the dates to fill and replace with lagged data
df.iloc[-1 * fill.shape[0]:] = fill.values

How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.
csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

Efficient way of converting String column to Date in Pandas (in Python), but without Timestamp

I am having a DataFrame which contains two String columns df['month'] and df['year']. I want to create a new column df['date'] by combining month and the year column. I have done that successfully using the structure below -
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y')
where by for df['month'] = '08' and df['year']='1968'
we get df['date']=1968-08-01
This is exactly what I wanted.
Problem at hand: My DataFrame has more than 200,000 rows and I notice that sometimes, in addition, I also get Timestamp like the one below for a few rows and I want to avoid that -
1972-03-01 00:00:00
I solved this issue by using the .dt acessor, which can be used to manipulate the Series, whereby I explicitly extracted only the date using the code below-
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y') #Line 1
df['date']=df['date']=.dt.date #Line 2
The problem was solved, just that the Line 2 took 5 times more time than Line 1.
Question: Is there any way where I could tweak Line 1 into giving just the dates and not the Timestamp? I am sure this simple problem cannot have such an inefficient solution. Can I solve this issue in a more time and resource efficient manner?
AFAIk we don't have date dtype n Pandas, we only have datetime, so we will always have a time part.
Even though Pandas shows: 1968-08-01, it has a time part: 00:00:00.
Demo:
In [32]: df = pd.DataFrame(pd.to_datetime(['1968-08-01', '2017-08-01']), columns=['Date'])
In [33]: df
Out[33]:
Date
0 1968-08-01
1 2017-08-01
In [34]: df['Date'].dt.time
Out[34]:
0 00:00:00
1 00:00:00
Name: Date, dtype: object
And if you want to have a string representation, there is a faster way:
df['date'] = df['year'].astype(str) + '-' + df['month'].astype(str) + '-01'
UPDATE: be aware that .dt.date will give you a string representation:
In [53]: df.dtypes
Out[53]:
Date datetime64[ns]
dtype: object
In [54]: df['new'] = df['Date'].dt.date
In [55]: df
Out[55]:
Date new
0 1968-08-01 1968-08-01
1 2017-08-01 2017-08-01
In [56]: df.dtypes
Out[56]:
Date datetime64[ns]
new object # <--- NOTE !!!
dtype: object

Resources