Pandas deposits and withdrawals over a time period with n-number of people - python-3.x

I'm trying to dynamically build a format in which I want to display number of deposits compared to withdrawals in a timeline chart. Whenever a deposit is done, the graph will go up, and when a withdrawal is done the graph goes down.
This is how far I've gotten:
df.head()
name Deposits Withdrawals
Peter 2019-03-07 2019-03-11
Peter 2019-03-08 2019-03-19
Peter 2019-03-12 2019-05-22
Peter 2019-03-12 2019-10-31
Peter 2019-03-14 2019-04-05
Here is the data manipulation to show the net movements for one person; Peter.
x = pd.Series(df.groupby('Deposits').size())
y = pd.Series(df.groupby('Withdrawals').size())
balance = pd.DataFrame({'net_mov': x.sub(y, fill_value=0)})
balance = balance.assign(Peter=balance.net_mov.cumsum())
print(balance)
net_mov Peter
2019-03-07 1 1
2019-03-08 1 2
2019-03-11 -1 1
2019-03-12 2 3
2019-03-14 1 4
This works perfectly fine, and this is the format that I want to have. Now let's say I want to extend on this and not just list Peters deposits and withdrawals, but I want to add n-number of people. Lets assume that my dataframe looks like this:
df2.head()
name Deposits Withdrawals
Peter 2019-03-07 2019-03-11
Anna 2019-03-08 2019-03-19
Anna 2019-03-12 2019-05-22
Peter 2019-03-12 2019-10-31
Simon 2019-03-14 2019-04-05
The format I'm aiming for is this. I don't know how to group everything, and I don't know which names or how many columns there will be beforehand, so I can't hardcode names or number of columns. It has to be generate dynamically.
net_mov1 Peter net_mov2 Anna net_mov3 Simon
2019-03-07 1 1 1 1 2 2
2019-03-08 1 2 2 3 -1 1
2019-03-11 -1 1 0 3 2 3
2019-03-12 2 3 -2 1 4 7
2019-03-14 1 4 3 4 -1 6
UPDATE:
First off, thanks for the help. I'm getting closer to my goal. This is the progress:
x = pd.Series(df.groupby(['Created', 'name']).size())
y = pd.Series(df.groupby(['Finished', 'name']).size())
balance = pd.DataFrame({'net_mov': x.sub(y, fill_value=0)})
balance = balance.assign(balance=balance.groupby('name').net_mov.cumsum())
balance_byname = balance.groupby('name')
balance_byname.get_group("Peter")
Output:
net_mov balance
name Created Finished
Peter 2017-07-03 2017-07-06 1 1
2017-07-10 1 2
2017-07-13 0 2
2017-07-14 1 3
... ... ...
2020-07-29 2020-07-15 0 4581
2020-07-17 0 4581
2020-07-20 0 4581
2020-07-21 -1 4580
[399750 rows x 2 columns]
This is of course too many rows, the dataset I'm working with has around 2500 rows.
I've tried to unstack it but that creates problems on it's own.

Given df:
name Deposits Withdrawals
Peter 2019-03-07 2019-03-11
Anna 2019-03-08 2019-03-19
Anna 2019-03-12 2019-05-22
Peter 2019-03-12 2019-10-31
Simon 2019-03-14 2019-04-05
You can melt dataframe, indicate deposits by 1 and withdravals by -1, and then pivot:
df = pd.DataFrame(\
{'name': {0: 'Peter', 1: 'Anna', 2: 'Anna', 3: 'Peter', 4: 'Simon'},
'Deposits': {0: '2019-03-07',
1: '2019-03-08',
2: '2019-03-12',
3: '2019-03-12',
4: '2019-03-14'},
'Withdrawals': {0: '2019-03-11',
1: '2019-03-19',
2: '2019-05-22',
3: '2019-10-31',
4: '2019-04-05'}})
df2 = df.melt('name')\
.assign(variable = lambda x: x.variable.map({'Deposits':1,'Withdrawals':-1}))\
#.pivot('value','name','variable').fillna(0)\
#use pivot_table with sum aggregate, because there may be duplicates in data
.pivot_table('variable','value','name', aggfunc = 'sum').fillna(0)\
.rename(columns = lambda c: f'{c} netmov' )
Above will give net change of balance:
name Anna netmov Peter netmov Simon netmov
value
2019-03-07 0.0 1.0 0.0
2019-03-08 1.0 0.0 0.0
2019-03-11 0.0 -1.0 0.0
2019-03-12 1.0 1.0 0.0
2019-03-14 0.0 0.0 1.0
2019-03-19 -1.0 0.0 0.0
2019-04-05 0.0 0.0 -1.0
2019-05-22 -1.0 0.0 0.0
2019-10-31 0.0 -1.0 0.0
Finally calculate balance using cumulative sum and concatenate it with previously calculated net changes:
df2 = pd.concat([df2,df2.cumsum().rename(columns = lambda c: c.split()[0] + ' balance')], axis = 1)\
.sort_index(axis=1)
result:
name Anna balance Anna netmov ... Simon balance Simon netmov
value ...
2019-03-07 0.0 0.0 ... 0.0 0.0
2019-03-08 1.0 1.0 ... 0.0 0.0
2019-03-11 1.0 0.0 ... 0.0 0.0
2019-03-12 2.0 1.0 ... 0.0 0.0
2019-03-14 2.0 0.0 ... 1.0 1.0
2019-03-19 1.0 -1.0 ... 1.0 0.0
2019-04-05 1.0 0.0 ... 0.0 -1.0
2019-05-22 0.0 -1.0 ... 0.0 0.0
2019-10-31 0.0 0.0 ... 0.0 0.0
[9 rows x 6 columns]

Try making use of pandas MultiIndex. This is almost the same code copied from your question BUT
including the column name into the groupby argument
adding a .groupby('name') call in the last line
With the code:
x = pd.Series(df.groupby(['Deposits', 'name']).size())
y = pd.Series(df.groupby(['Withdrawals', 'name']).size())
balance = pd.DataFrame({'net_mov': x.sub(y, fill_value=0)})
balance = balance.assign(balance=balance.groupby('name').net_mov.cumsum())
The groupby in the lastline effectively tells pandas to treat each name as a separate dataframe before applying cumsum, so movements will be kept to each account.
Now you can keep it in this shape with only two columns and the name as a second level in the rows MultiIndex. You can set a groupby object by calling
balance_byname = balance.groupby('name') # notice there is no aggregation nor transformation
To be used whenever you need to access only one account with .get_group() https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.get_group.html#pandas.core.groupby.GroupBy.get_group
OR---
You can also add a new line at the end
balance = balance.unstack('name')
Which will give a shape similar to what you ask in the expected output. This will, however, possibly create a number of 'NaN' by having all dates by all names. This can drastically increase the memory usage IF there are many dates and many bames, with each name having movements only in a few dates.

Related

Improving speed when using a for loop for each user groups

Suppose we have following dataset with the output window_num:
index user1 date different_months org_different_months window_num
1690289 2670088 2006-08-01 243.0 243.0 1
1772121 2717874 2005-12-01 0.0 0.0 1
1772123 2717874 2005-12-01 0.0 0.0 1
1772125 2717874 2005-12-01 0.0 0.0 1
1772130 2717874 2005-12-01 0.0 0.0 1
1772136 2717874 2006-01-01 0.0 0.0 1
1772132 2717874 2006-02-01 0.0 2099.0 1
1772134 2717874 2020-08-27 0.0 0.0 4
1772117 2717874 0.0 0.0 4
1772118 2717874 0.0 0.0 4
1772128 2717874 2019-11-01 300.0 300.0 3
1772127 2717874 2011-11-01 2922.0 2922.0 2
1774815 2719456 2006-09-01 0.0 0.0 2
1774809 2719456 2006-10-01 0.0 1949.0 2
1774821 2719456 2020-05-20 0.0 0.0 7
1774803 2719456 0.0 0.0 7
1774806 2719456 0.0 0.0 7
1774819 2719456 2019-08-29 265.0 265.0 6
1774825 2719456 2014-10-01 384.0 384.0 4
1774812 2719456 2005-07-01 427.0 427.0 1
1774816 2719456 2012-02-01 973.0 973.0 3
1774824 2719456 2015-10-20 1409.0 1409.0 5
The user number is represented by user1. The output is the window_num which is generated using different_months and orig_different_months columns. The different_months column is the difference between the date[n] and date[n+1].
Previously, I was using groupby.apply to output window_num, however it became extremely slow when the dataset increased. The code was improved considerably by using the shift functions on the entire dataset to calculate the different_months and orig_different_months column, as well as applying the sort on entire dataset, as seen below:
data = data.sort_values(by=['user','ContractInceptionDateClean'], ascending=[True,True])
#data['user1'] =data['user']
data['different_months'] = (abs((data['ContractInceptionDateClean'].shift(-1)-data['ContractInceptionDateClean'] ).dt.days)).fillna(0)
data.different_months[data['different_months'] < 91] =0
data['shift_different_months']=data['different_months'].shift(1)
data['org_different_months']=data['different_months']
data.loc[((data['different_months'] == 0) | (data['shift_different_months'] == 0)),'different_months']=0
data = salesswindow_cal(data,list(data.user.unique()))
The code that I am currently struggling to improve the speed on is shown below:
def salesswindow_cal(data_,users):
temp = pd.DataFrame()
for u in range(0,len(users)):
df=data_[data_['user']==users[u]]
df['different_months'].values[0]= df['org_different_months'].values[0]
df['window_num']=(df['different_months'].diff() != 0).cumsum()
temp= pd.concat([df,temp],axis=0)
return pd.DataFrame(temp)
A rule of thumb is not to loop through the users and extract df = data_[data_['user']==user]. Instead do groupby:
for u, df in data_.gropuby('user'):
do_some_stuff
Another issue is not to concatenate data iteratively
data_out = []
for user, df in data.groupby('user'):
do_some_stuff
data_out.append(sub_data)
out = pd.concat(data_out)
In your case, you can do a function and groupby().apply() and pandas will concatenate the data for you.
def group_func(df):
d = df.copy()
d['different_months'].values[0] = d['org_different_months'].value[0]
d['window_num'] = (d['different_months'].diff().ne(0).cumsum()
return d
data.groupby('user').apply(group_func)
Update:
Let's try this vectorized approach, which modifies your data inplace
# update the first `different_months`
mask = ~data['user'].duplicated()
data.loc[mask, 'different_months'] == data.loc[mask, 'orginal_different_months']
groups = data.groupby('user')
data['diff'] = groups['different_months'].diff().ne(0)
data['window_num'] = groups['diff'].cumsum()

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Python - Create copies of rows based on column value and increase date by number of iterations

I have a dataframe in Python:
md
Out[94]:
Key_ID ronDt multidays
0 Actuals-788-8AA-0001 2017-01-01 1.0
11 Actuals-788-8AA-0012 2017-01-09 1.0
20 Actuals-788-8AA-0021 2017-01-16 1.0
33 Actuals-788-8AA-0034 2017-01-25 1.0
36 Actuals-788-8AA-0037 2017-01-28 1.0
... ... ...
55239 Actuals-789-8LY-0504 2020-02-12 1.0
55255 Actuals-788-T11-0001 2018-08-23 8.0
55257 Actuals-788-T11-0003 2018-09-01 543.0
55258 Actuals-788-T15-0001 2019-02-20 368.0
55259 Actuals-788-T15-0002 2020-02-24 2.0
I want to create an additional record for every multiday and increase the date (ronDt) by number of times that record was duplicated.
For example:
row[0] would repeat one time with the new date reading 2017-01-02.
row[55255] would be repeated 8 times with the corresponding dates ranging from 2018-08-24 - 2018-08-31.
When I did this in VBA, I used loops, and in Alteryx I used multirow functions. What is the best way to achieve this in Python? Thanks.
Here's a way to in pandas:
# get list of dates possible
df['datecol'] = df.apply(lambda x: pd.date_range(start=x['ronDt'], periods=x['multidays'], freq='D'), 1)
# convert the list into new rows
df = df.explode('datecol').drop('ronDt', 1)
# rename the columns
df.rename(columns={'datecol': 'ronDt'}, inplace=True)
print(df)
Key_ID multidays ronDt
0 Actuals-788-8AA-0001 1.0 2017-01-01
1 Actuals-788-8AA-0012 1.0 2017-01-09
2 Actuals-788-8AA-0021 1.0 2017-01-16
3 Actuals-788-8AA-0034 1.0 2017-01-25
4 Actuals-788-8AA-0037 1.0 2017-01-28
.. ... ... ...
8 Actuals-788-T15-0001 368.0 2020-02-20
8 Actuals-788-T15-0001 368.0 2020-02-21
8 Actuals-788-T15-0001 368.0 2020-02-22
9 Actuals-788-T15-0002 2.0 2020-02-24
9 Actuals-788-T15-0002 2.0 2020-02-25
# Get count of duplication for each row which corresponding to multidays col
df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0: 'multidays'})
# Assume ronDt dtype is str so convert it to datetime object
# Then sum ronDt and multidays columns
df['ronDt_new'] = pd.to_datetime(df['ronDt']) + pd.to_timedelta(df['multidays'], unit='d')

Perform arithmetic operation mainly subtraction and division over a pandas series on null values

Simply i want when i subtract/division operation with null value it will give the value(digit).Ex - 3/np.nan = 3 or 2-np.nan = 2.
By using np.nansum and np.nanprod i have handled addition and multiplication,but dont know how will i do operation for subtraction and division.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c=a-b d=a/b
0 1 1.0 0.0 1.0
1 2 2.0 0.0 1.0
2 3 NaN 3.0 3.0
3 4 NaN 4.0 4.0
Above i mention that actually what i am looking for.
#Use fill value of 0 for subtraction operation
df['c']=df.a.sub(df.b,fill_value=0)
#Use fill value of 1 for division operation
df['d']=df.a.div(df.b,fill_value=1)
IIUC using sub with fill_value
df.a.sub(df.b,fill_value=0)
Out[251]:
0 0.0
1 0.0
2 3.0
3 4.0
dtype: float64

Replace missing values based on another column

I am trying to replace the missing values in a dataframe based on filtering of another column, "Country"
>>> data.head()
Country Advanced skiers, freeriders Snow parks
0 Greece NaN NaN
1 Switzerland 5.0 5.0
2 USA NaN NaN
3 Norway NaN NaN
4 Norway 3.0 4.0
Obviously this is just a small snippet of the data, but I am looking to replace all the NaN values with the average value for each feature.
I have tried grouping the data by the country and then calculating the mean of each column. When I print out the resulting array, it comes up with the expected values. However, when I put it into the .fillna() method, the data appears unchanged
I've tried #DSM's solution from this similar post, but I am not sure how to apply it to multiple columns.
listOfRatings = ['Advanced skiers, freeriders', 'Snow parks']
print (data.groupby('Country')[listOfRatings].mean().fillna(0))
-> displays the expected results
data[listOfRatings] = data[listOfRatings].fillna(data.groupby('Country')[listOfRatings].mean().fillna(0))
-> appears to do nothing to the dataframe
Assuming this is the complete dataset, this is what I would expect the results to be.
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0
Can anyone explain what I am doing wrong, and how to fix the code?
You can use transform for return new DataFrame with same size as original filled by aggregated values:
print (data.groupby('Country')[listOfRatings].transform('mean').fillna(0))
Advanced skiers, freeriders Snow parks
0 0.0 0.0
1 5.0 5.0
2 0.0 0.0
3 3.0 4.0
4 3.0 4.0
#dynamic generate all columns names without Country
listOfRatings = data.columns.difference(['Country'])
df1 = data.groupby('Country')[listOfRatings].transform('mean').fillna(0)
data[listOfRatings] = data[listOfRatings].fillna(df1)
print (data)
print (data)
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0

Resources