How to create this year_month sales and previous year_month sales in two different columns? - python-3.x

I need to create two different columns, one for this year sales and one column for last year sales from a transactional level data?
Data format:-
Date | bill amount
2019-07-22 | 500
2019-07-25 | 200
2020-11-15 | 100
2020-11-06 | 900
2020-12-09 | 50
2020-12-21 | 600
Required format:-
Year_month |This month Sales | Prev month sales
2019_07 | 700 | -
2020_11 | 1000 | -
2020_12 | 650 | 1000

The relatively tricky bit is to figure out what the previous month is. We do it by figuring out the beginning of the month for each date and then rolling back by 1 month. Note that this will take care of January -> December of previous year issues
We start by creating a sample dataframe and importing some useful modules
from io import StringIO
from datetime import datetime,timedelta
from dateutil.relativedelta import relativedelta
data = StringIO(
"""
date|amount
2019-07-22|500
2019-07-25|200
2020-11-15|100
2020-11-06|900
2020-12-09|50
2020-12-21|600
""")
df = pd.read_csv(data,sep='|')
df['date'] = pd.to_datetime(df['date'])
df
we get
date amount
0 2019-07-22 500
1 2019-07-25 200
2 2020-11-15 100
3 2020-11-06 900
4 2020-12-09 50
5 2020-12-21 600
Then we figure out the month start and the previous month start using datetime utilities
df['month_start'] = df['date'].apply(lambda d:datetime(year = d.year, month = d.month, day = 1))
df['prev_month_start'] = df['month_start'].apply(lambda d:d+relativedelta(months = -1))
Then we summarize monthly sales using groupby on month start
ms_df = df.drop(columns = 'date').groupby('month_start').agg({'prev_month_start':'first','amount':sum}).reset_index()
ms_df
so we get
month_start prev_month_start amount
0 2019-07-01 2019-06-01 700
1 2020-11-01 2020-10-01 1000
2 2020-12-01 2020-11-01 650
Then we join (merge) ms_df on itself by mapping 'prev_month_start' to 'month_start'
ms_df2 = ms_df.merge(ms_df, left_on='prev_month_start', right_on='month_start', how = 'left', suffixes = ('','_prev'))
We are more or less there but now make it pretty by getting rid of superfluous columns, adding labels, etc
ms_df2['label'] = ms_df2['month_start'].dt.strftime('%Y_%m')
ms_df2 = ms_df2.drop(columns = ['month_start','prev_month_start','month_start_prev','prev_month_start_prev'])
columns = ['label','amount','amount_prev']
ms_df2 = ms_df2[columns]
and we get
| | label | amount | amount_prev |
|---:|--------:|---------:|--------------:|
| 0 | 2019_07 | 700 | nan |
| 1 | 2020_11 | 1000 | nan |
| 2 | 2020_12 | 650 | 1000 |

Using #piterbarg's data, we can use resample, combined with shift and concat to get your desired data:
import pandas as pd
from io import StringIO
data = StringIO(
"""
date|amount
2019-07-22|500
2019-07-25|200
2020-11-15|100
2020-11-06|900
2020-12-09|50
2020-12-21|600
"""
)
df = pd.read_csv(data, sep="|", parse_dates=["date"])
df
date amount
0 2019-07-22 500
1 2019-07-25 200
2 2020-11-15 100
3 2020-11-06 900
4 2020-12-09 50
5 2020-12-21 600
Get the sum for current sales:
data = df.resample(on="date", rule="1M").amount.sum().rename("This_month")
data
date
2019-07-31 700
2019-08-31 0
2019-09-30 0
2019-10-31 0
2019-11-30 0
2019-12-31 0
2020-01-31 0
2020-02-29 0
2020-03-31 0
2020-04-30 0
2020-05-31 0
2020-06-30 0
2020-07-31 0
2020-08-31 0
2020-09-30 0
2020-10-31 0
2020-11-30 1000
2020-12-31 650
Freq: M, Name: This_month, dtype: int64
Now, we can shift the month to get values for previous month, and drop rows that have 0 as total sales to get your final output:
(pd.concat([data, data.shift().rename("previous_month")], axis=1)
.query("This_month!=0")
.fillna(0))
This_month previous_month
date
2019-07-31 700 0.0
2020-11-30 1000 0.0
2020-12-31 650 1000.0

Related

How to solve the ValueError: Unstacked DataFrame is too big, causing int32 overflow in python?

I have a dataframe in dynamic format for each ID
df:
ID |Start Date|End date |claim_no|claim_type|Admission_date|Discharge_date|Claim_amt|Approved_amt
10 |01-Apr-20 |31-Mar-21| 1123 |CSHLESS | 23-Aug-2020 | 25-Aug-2020 | 25406 | 19351
10 |01-Apr-20 |31-Mar-21| 1212 |POSTHOSP | 30-Aug-2020 | 01-Sep-2020 | 4209 | 3964
10 |01-Apr-20 |31-Mar-21| 1680 |CSHLESS | 18-Mar-2021 | 23-Mar-2021 | 18002 | 0
11 |12-Dec-20 |11-Dec-21| 1503 |CSHLESS | 12-Jan-2021 | 15-Jan-2021 | 76137 | 50286
11 |12-Dec-20 |11-Dec-21| 1505 |CSHLESS | 05-Jan-2021 | 07-Jan-2021 | 30000 | 0
Based on the ID column i am trying to convert all the dynamic variables into a static format so that i can have a single row for each ID.
Columns such as ID, Start Date,End date are static in nature and rest of the columns are dynamic in nature for each ID.
Inorder to acheive the below output:
ID |Start Date|End date |claim_no_1|claim_type_1|Admission_date_1|Discharge_date_1|Claim_amt_1|Approved_amt_1|claim_no_2|claim_type_2|Admission_date_2|Discharge_date_2|Claim_amt_2|Approved_amt_2|claim_no_3|claim_type_3|Admission_date_3|Discharge_date_3|Claim_amt_3|Approved_amt_3
10 |01-Apr-20 |31-Mar-21| 1123 |CSHLESS | 23-Aug-2020 | 25-Aug-2020 | 25406 | 19351 | 1212 |POSTHOSP | 30-Aug-2020 | 01-Sep-2020 | 4209 | 3964 | 1680 |CSHLESS | 18-Mar-2021 | 23-Mar-2021 | 18002 | 0
i am using the below code:
# Index columns
idx = ['ID', 'Start Date', 'End date']
# Sequential counter to identify unique rows per index columns
cols = df.groupby(idx).cumcount() + 1
# Reshape using stack and unstack
df_out = df.set_index([*idx, cols]).stack().unstack([-2, -1])
# Flatten the multiindex columns
df_out.columns = df_out.columns.map('{0[1]}_{0[0]}'.format)
but it throws a ValueError: Unstacked DataFrame is too big, causing int32 overflow
Try this:
Index columns (very similar to your code)
idx = ['ID', 'Start Date', 'End date']
# Sequential counter to identify unique rows per index columns
df['nrow'] = df.groupby(idx)['claim_no'].transform('rank')
df['nrow'] = df['nrow'].astype(int).astype(str)
instead of stack & unstack. Using these functions you can have better control over columns
df1 = pd.melt(df, id_vars =['nrow', *idx] , value_vars=['claim_no', 'claim_type', 'Admission_date',
'Discharge_date', 'Claim_amt', 'Approved_amt'],
value_name='var'
)
df2 = df1.pivot(index=[*idx],
columns=['variable', 'nrow'], values='var')
df2.columns = ['_'.join(col).rstrip('_') for col in df2.columns.values]
print(df2)
claim_no_1 claim_no_2 claim_no_3 claim_type_1 claim_type_2 claim_type_3 Admission_date_1 Admission_date_2 Admission_date_3 Discharge_date_1 Discharge_date_2 Discharge_date_3 Claim_amt_1 Claim_amt_2 Claim_amt_3 Approved_amt_1 Approved_amt_2 Approved_amt_3
ID Start Date End date
10 01-Apr-20 31-Mar-21 1123 1212 1680 CSHLESS POSTHOSP CSHLESS 23-Aug-2020 30-Aug-2020 18-Mar-2021 25-Aug-2020 01-Sep-2020 23-Mar-2021 25406 4209 18002 19351 3964 0
11 12-Dec-20 11-Dec-21 1503 1505 NaN CSHLESS CSHLESS NaN 12-Jan-2021 05-Jan-2021 NaN 15-Jan-2021 07-Jan-2021 NaN 76137 30000 NaN 50286 0 NaN

Create "leakage-free" Variables in Python?

I have a pandas data frame with several thousand observations and I would like to create "leakage-free" variables in Python. So I am looking for a way to calculate e.g. a group-specific mean of a variable without the single observation in row i.
For example:
| Group | Price | leakage-free Group Mean |
-------------------------------------------
| 1 | 20 | 25 |
| 1 | 40 | 15 |
| 1 | 10 | 30 |
| 2 | ... | ... |
I would like to do that with several variables and I would like to create mean, median and variance in such a way, so a computationally fast method might be good. If a group has only one row I would like to enter 0s in the leakage-free Variable.
As I am rather a beginner in Python, some piece of code might be very helpful. Thank You!!
With one-liner:
df = pd.DataFrame({'Group': [1,1,1,2], 'Price':[20,40,10,30]})
df['lfgm'] = df.groupby('Group').transform(lambda x: (x.sum()-x)/(len(x)-1)).fillna(0)
print(df)
Output:
Group Price lfgm
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
3 2 30 0.0
Update:
For median and variance (not one-liners unfortunately):
df = pd.DataFrame({'Group': [1,1,1,1,2], 'Price':[20,100,10,70,30]})
def f(x):
for i in x.index:
z = x.loc[x.index!=i, 'Price']
x.at[i, 'mean'] = z.mean()
x.at[i, 'median'] = z.median()
x.at[i, 'var'] = z.var()
return x[['mean', 'median', 'var']]
df = df.join(df.groupby('Group').apply(f))
print(df)
Output:
Group Price mean median var
0 1 20 60.000000 70.0 2100.000000
1 1 100 33.333333 20.0 1033.333333
2 1 10 63.333333 70.0 1633.333333
3 1 70 43.333333 20.0 2433.333333
4 2 30 NaN NaN NaN
Use:
grp = df.groupby('Group')
n = grp['Price'].transform('count')
mean = grp['Price'].transform('mean')
df['new_col'] = (mean*n - df['Price'])/(n-1)
print(df)
Group Price new_col
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
Note: This solution will be faster than using apply, you can test using %%timeit followed by the codes.

Cannot convert object to date after groupby

I have been successful with converting while working with a different dataset a couple days ago. However, I cannot apply the same technique to my current dataset. The set looks as:
totalHist.columns.values[[0, 1]] = ['Datez', 'Volumez']
totalHist.head()
Datez Volumez
0 2016-09-19 6.300000e+07
1 2016-09-20 3.382694e+07
2 2016-09-26 4.000000e+05
3 2016-09-27 4.900000e+09
4 2016-09-28 5.324995e+08
totalHist.dtypes
Datez object
Volumez float64
dtype: object
This used to do the trick:
totalHist['Datez'] = pd.to_datetime(totalHist['Datez'], format='%d-%m-%Y')
totalHist.dtypes
which now is giving me:
KeyError: 'Datez'
During handling of the above exception, another exception occurred:
How can I fix this? I am doing this groupby before trying:
totalHist = df.groupby('Date', as_index = False).agg({"Trading_Value": "sum"})
totalHist.head()
totalHist.columns.values[[0, 1]] = ['Datez', 'Volumez']
totalHist.head()
You can just use .rename() to rename your columns
Generate some data (in same format as OP)
d = ['1/1/2018','1/2/2018','1/3/2018',
'1/3/2018','1/4/2018','1/2/2018','1/1/2018','1/5/2018']
df = pd.DataFrame(d, columns=['Date'])
df['Trading_Value'] = [1000,1005,1001,1001,1002,1009,1010,1002]
print(df)
Date Trading_Value
0 1/1/2018 1000
1 1/2/2018 1005
2 1/3/2018 1001
3 1/3/2018 1001
4 1/4/2018 1002
5 1/2/2018 1009
6 1/1/2018 1010
7 1/5/2018 1002
GROUP BY
totalHist = df.groupby('Date', as_index = False).agg({"Trading_Value": "sum"})
print(totalHist.head())
Date Trading_Value
0 1/1/2018 2010
1 1/2/2018 2014
2 1/3/2018 2002
3 1/4/2018 1002
4 1/5/2018 1002
Rename columns
totalHist.rename(columns={'Date':'Datez','totalHist':'Volumez'}, inplace=True)
print(totalHist)
Datez Trading_Value
0 1/1/2018 2010
1 1/2/2018 2014
2 1/3/2018 2002
3 1/4/2018 1002
4 1/5/2018 1002
Finally, convert to datetime
totalHist['Datez'] = pd.to_datetime(totalHist['Datez'])
print(totalHist.dtypes)
Datez datetime64[ns]
Trading_Value int64
dtype: object
This was done with python --version = 3.6.7 and pandas (0.23.4).

grouping by weekly days pandas

I have a dataframe,df containing
Index Date & Time eventName eventCount
0 2017-08-09 ABC 24
1 2017-08-09 CDE 140
2 2017-08-10 CDE 150
3 2017-08-11 DEF 200
4 2017-08-11 ABC 20
5 2017-08-16 CDE 10
6 2017-08-16 ABC 15
7 2017-08-17 CDE 10
8 2017-08-17 DEF 50
9 2017-08-18 DEF 80
...
I want to sum the eventCount for each weekly day occurrences and plot for the total events for each weekly day(from MON to SUN) i.e. for example:
Summation of the eventCount values of:
2017-08-09 and 2017-08-16(Mondays)=189
2017-08-10 and 2017-08-17(Tuesdays)=210
2017-08-16 and 2017-08-23(Wednesdays)=300
I have tried
dailyOccurenceSum=df['eventCount'].groupby(lambda x: x.weekday).sum()
and I get this error:AttributeError: 'int' object has no attribute 'weekday'
Starting with df -
df
Index Date & Time eventName eventCount
0 0 2017-08-09 ABC 24
1 1 2017-08-09 CDE 140
2 2 2017-08-10 CDE 150
3 3 2017-08-11 DEF 200
4 4 2017-08-11 ABC 20
5 5 2017-08-16 CDE 10
6 6 2017-08-16 ABC 15
7 7 2017-08-17 CDE 10
8 8 2017-08-17 DEF 50
9 9 2017-08-18 DEF 80
First, convert Date & Time to a datetime column -
df['Date & Time'] = pd.to_datetime(df['Date & Time'])
Next, call groupby + sum on the weekday name.
df = df.groupby(df['Date & Time'].dt.weekday_name)['eventCount'].sum()
df
Date & Time
Friday 300
Thursday 210
Wednesday 189
Name: eventCount, dtype: int64
If you want to sort by weekday, convert the index to categorical and call sort_index -
cat = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday']
df.index = pd.Categorical(df.index, categories=cat, ordered=True)
df = df.sort_index()
df
Wednesday 189
Thursday 210
Friday 300
Name: eventCount, dtype: int64

Aggregating past and current values(monthly data) of Target column using pandas

I have dataframe like this below in pandas,
EMP_ID| Date| Target_GWP
1 | Jan-2017| 100
2 | Jan 2017| 300
1 | Feb-2017| 500
2 | Feb-2017| 200
and I need my output to be printed in below form.
EMP_ID| Date| Target_GWP | past_Target_GWP
1 | Feb-2017| 600 |100
2 | Feb-2017| 500 |300
Basically I have monthly data coming in excel and I want to aggregate this Target_GWP for each EMP_ID against the latest(current month) and have to create a back up column in pandas dataframe for past month Target_GWP. So How will i back the past month target_GWP and add it to current month Target GWP
Any leads on this would be appreciated.
Use:
#convert to datetime
df['Date'] = pd.to_datetime(df['Date'])
#sorting and get last 2 rows
df = df.sort_values(['EMP_ID','Date']).groupby('EMP_ID').tail(2)
#aggregation
df = df.groupby('EMP_ID', as_index=False).agg({'Date':'last', 'Target_GWP':['sum','first']})
df.columns = ['EMP_ID','Date','Target_GWP','past_Target_GWP']
print (df)
EMP_ID Date Target_GWP past_Target_GWP
0 1 2017-02-01 600 100
1 2 2017-02-01 500 300
Or if need top value in Target_GWP instead sum use last:
df = df.groupby('EMP_ID', as_index=False).agg({'Date':'last', 'Target_GWP':['last','first']})
df.columns = ['EMP_ID','Date','Target_GWP','past_Target_GWP']
print (df)
EMP_ID Date Target_GWP past_Target_GWP
0 1 2017-02-01 500 100
1 2 2017-02-01 200 300

Resources