Create Multiple Dataframes using Loop & function - python-3.x

I have a df over 1M rows similar to this
ID Date Amount
x May 1 10
y May 2 20
z May 4 30
x May 1 40
y May 1 50
z May 2 60
x May 1 70
y May 5 80
a May 6 90
b May 8 100
x May 10 110
I have to sort the data based on the date and then create new dataframes depending on the times the value is present in Amount column. So if x has made purchase 3 time then I need it in 3 different dataframes. first_purchase dataframe would have every ID that has purchased even once irrespective of date or amount.
If an ID purchases 3 times, I need that ID to be in first purchase then second and then 3rd with Date and Amount.
Doing it manually is easy with:-
df = df.sort_values('Date')
first_purchase = df.drop_duplicates('ID')
after_1stpurchase = df[~df.index.isin(first_purchase.index)]
second data frame would be created with:-
after_1stpurchase = after_1stpurchase.sort_values('Date')
second_purchase = after_1stpurchase.drop_duplicates('ID')
after_2ndpurchase = after_1stpurchase[~after_1stpurchase.index.isin(second_purchase.index)]
How do I create the loop to provide me with each dataframes?

IIUC, I was able to achieve what you wanted.
import pandas as pd
import numpy as np
# source data for the dataframe
data = {
"ID":["x","y","z","x","y","z","x","y","a","b","x"],
"Date":["May 01","May 02","May 04","May 01","May 01","May 02","May 01","May 05","May 06","May 08","May 10"],
"Amount":[10,20,30,40,50,60,70,80,90,100,110]
}
df = pd.DataFrame(data)
# convert the Date column to datetime and still maintain the format like "May 01"
df['Date'] = pd.to_datetime(df['Date'], format='%b %d').dt.strftime('%b %d')
# sort the values on ID and Date
df.sort_values(by=['ID', 'Date'], inplace=True)
df.reset_index(inplace=True, drop=True)
print(df)
Original Dataframe:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 40 May 01 x
4 70 May 01 x
5 110 May 10 x
6 50 May 01 y
7 20 May 02 y
8 80 May 05 y
9 60 May 02 z
10 30 May 04 z
.
# create a list of unique ids
list_id = sorted(list(set(df['ID'])))
# create an empty list that would contain dataframes
df_list = []
# count of iterations that must be seperated out
# for example if we want to record 3 entries for
# each id, the iter would be 3. This will create
# three new dataframes that will hold transactions
# respectively.
iter = 3
for i in range(iter):
df_list.append(pd.DataFrame())
for val in list_id:
tmp_df = df.loc[df['ID'] == val].reset_index(drop=True)
# consider only the top iter(=3) values to be distributed
counter = np.minimum(tmp_df.shape[0], iter)
for idx in range(counter):
df_list[idx] = df_list[idx].append(tmp_df.loc[tmp_df.index == idx])
for df in df_list:
df.reset_index(drop=True, inplace=True)
print(df)
Transaction #1:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 50 May 01 y
4 60 May 02 z
Transaction #2:
Amount Date ID
0 40 May 01 x
1 20 May 02 y
2 30 May 04 z
Transaction #3:
Amount Date ID
0 70 May 01 x
1 80 May 05 y
Note that in your data, there are four transactions for 'x'. If lets say you wanted to track the 4th iterative transaction as well. All you need to do is change the value if 'iter' to 4 and you will get the fourth dataframe as well with the following value:
Amount Date ID
0 110 May 10 x

Related

Use Switch/Case Statement to build DF2, by Iterating Over Rows in DF1

I've loaded data from a tab deliminated file into a DF. The Tab data is a form filled out with a template.
A critical concept is that a variable number of rows makes up one entry in the form. In DF1 below, every time the index is "A", a new record is starting. So the code will need to iterate through the rows to rebuild each record in DF2. Each record will be represented as one row in DF2.
Based on the fact that each "A" row in DF1 starts a new form entry (and corresponding row in DF2), we can see in DF1 below there are just two entries in my example, and will be just two rows in DF2. Also imortant: there are a different number of pieces of data (columns) in each row. Z has 2 (then NAs), A has 3, B has 4.
All of this needs to be mapped to DF2 depending on the index letters Z, A, B (note there are more index letters but this is simplified for this example).
DF 1
- A B C D
Z xyz 5 NA NA
A COA aa bb NA
B RE 01 02 03
B DE 04 05 06
A COB dd ee NA
B RE 01 02 03
B DE 04 05 06
In the past i've done this type of thing in VBA and would have used a CASE statement to transform the data. I've found a good start using dictionaries in this thread:
Replacements for switch statement in Python?
One code example at the above thread suggests using a dictionary type case statement:
return{
'a': 1,
'b': 2,
}[x]
This seems like it would work although i'm not certain how to execute in practice. In addition for each A, B, etc above, I need to output multiple instructions, depending on the index letter. For the most part, the instructions are where to map in DF2. For example, in my:
Index A:
Map column A to DF2.iloc[1]['B']
Map column B to DF2.iloc[1]['C']
Map column C to DF2.iloc[1]['D']
Index B:
Would have four instructions, similar to above.
DF2 would end up looking like so
- A B C D E F G H I J K L
1 xyz COA aa bb RE 01 02 03 DE 04 05 06
2 xyz COB dd ee RE 01 02 03 DE 04 05 06
So for each row in DF1, a different number of instructions is being performed depending on the "index letter." All instructions are telling the code where to put the data in DF2. The mapping instruction for each different index letter will always be the same for the columns, only the row will be changing (some type of counter as you move from one record group to the next in DF2).
How can I handle the different number of instructions for each type of index letter in a switch/case type format?
Thank you
I think you can use:
#filter only 2,3 index rows
df1 = df[df.index.isin([2,3])].copy()
#create new column for same value if 2 in index
df1['new'] = np.where(df1.index == 2, 'Z', df1.A)
#create groups by compare 2
df1['g'] = (df1.index == 2).cumsum()
#convert columns to index and reshape, then change order
df1 = (df1.set_index(['g','new']).unstack()
.swaplevel(0,1, axis=1)
.sort_index(axis=1, ascending=[False, True]))
#default columns names
df1.columns = range(len(df1.columns))
print (df1)
0 1 2 3 4 5 6 7 8 9 10 11
g
1 ABC aa bb cc R 01 02 NaN D NaN 03 04
2 DEF dd ee ff R 01 02 NaN D NaN 03 04

Reshape a pandas DataFrame using combination of row values in two columns

I have data for multiple customers in data frame as below-
Customer_id event_type month mins_spent
1 live CM 10
1 live CM1 10
1 catchup CM2 20
1 live CM2 30
2 live CM 45
2 live CM1 30
2 catchup CM2 20
2 live CM2 20
I need the result data frame so that there is one row for each customer and column are combined value of column month and event_type and value will be mins_spent. Result data frame as below-
Customer_id CM_live CM_catchup CM1_live CM1_catchup CM2_live CM2_catchup
1 10 0 10 0 30 20
2 45 0 30 0 20 20
Is there an efficient way to do this instead of iterating the input data frame and creating the new data frame ?
you can use pivot_table
# pivot your data frame
p = df.pivot_table(values='mins_spent', index='Customer_id',
columns=['month', 'event_type'], aggfunc=np.sum)
# flatten multi indexed columns with list comprehension
p.columns = ['_'.join(col) for col in p.columns]
CM_live CM1_live CM2_catchup CM2_live
Customer_id
1 10 10 20 30
2 45 30 20 20
You can create a new column (key) by concatenating columns month and event_type, and then use pivot() to reshape your data.
(df.assign(key = lambda d: d['month'] + '_' + d['event_type'])
.pivot(
index='Customer_id',
columns='key',
values='mins_spent'
))

Alternative to looping? Vectorisation, cython?

I have a pandas dataframe something like the below:
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023 etc
ref1 100 2020 5 10 0 0 0 0 0
ref2 20 2028 2 5 0 0 0 0 0
ref3 30 2021 7 16 0 0 0 0 0
ref4 40 2025 9 18 0 0 0 0 0
ref5 10 2022 4 30 0 0 0 0 0
The 'Total' column shows how many of a product needs to be delivered.
'First_yr_Del' tells you how many will be delivered in the first year. After this the delivery rate reverts to 'Del_rate' - a flat rate that can be applied each year until all products are delivered.
The 'Year to Use' column tells you the first year column to begin delivery from.
EXAMPLE: Ref1 has 100 to deliver. It will start delivering in 2020 and will deliver 5 in the first year, and 10 each year after that until all 100 are accounted for.
Any ideas how to go about this?
I thought i might use something like the below to reference which columns to use in turn, but i'm not even sure if that's helpful or not as it will depend on the solution (in the proper version, base_date.year is defined as the first column in the table - 2019):
start_index_for_slice = df.columns.get_loc(base_date.year)
end_index_for_slice = start_index_for_slice+no_yrs_to_project
df.columns[start_index_for_slice:end_index_for_slice]
I'm pretty new to python and aren't sure if i'm getting ahead of myself a bit...
The way i would think to go about it would be to use a for loop, or something using iterrows, but other posts seem to say this is a bad idea and i should be using vectorisation, cython or lambdas. Of those 3 i've only managed a very simple lambda so far. The others are a bit of a mystery to me since the solution seems to suggest doing one action after another until complete.
Any and all help appreciated!
Thanks
EDIT: Example expected output below (I edited some of the dates so you can better see the logic):
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023etc
ref1 100 2020 5 10 0 5 10 10 10
ref2 20 2021 2 5 0 0 2 5 5
ref3 30 2021 7 16 0 0 7 16 7
ref4 40 2019 9 18 9 18 13 0 0
ref5 10 2020 4 30 0 4 6 0 0
Here's another option, which separates the calculation of the rates/years matrix and appends it to the input df later on. Still does looping in the script itself (not "externalized" to some numpy / pandas function). Should be fine for 5k rows I'd guesstimate.
import pandas as pd
import numpy as np
# def gen_df1():
# create the inital df without years/rates
df = pd.DataFrame({'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2021, 2021, 2019, 2020],
'First_Year_Del': [5, 2, 7, 9, 10],
'Del_rate': [10, 5, 16, 18, 30]})
# get number of rates + remainder
n, r = np.divmod((df['Total']-df['First_Year_Del']), df['Del_rate'])
# get the year of the last rate considering all rows
max_year = np.max(n + r.astype(np.bool) + df['Yr_to_Use'])
# get the offsets for the start of delivery, year zero is 2019
offset = df['Yr_to_Use'] - 2019
# subtracting the year zero lets you use this as an index...
# get a year index; this determines the the columns that will be created
yrs = np.arange(2019, max_year+1)
# prepare a n*m array to hold the rates for all years, initalize with all zero
out = np.zeros((df['Total'].shape[0], yrs.shape[0]))
# n: number of rows of the df, m: number of years where rates will have to be payed
# calculate the rates for each year and insert them into the output array
for i in range(df['Total'].shape[0]):
# concatenate: year of the first rate, all yearly rates, a final rate if there was a remainder
if r[i]: # if rest is not zero, append it as well
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]], [r[i]]])
else: # rest is zero, skip it
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]]])
# insert the rates at the apropriate location of the output array:
out[i, offset[i]:offset[i]+rates.shape[0]] = rates
# add the years/rates matrix to the original df
df = pd.concat([df, pd.DataFrame(out, columns=yrs.astype(str))], axis=1, sort=False)
You can accomplish this using two user-defined function and apply method
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'id': ['ref1','ref2','ref3','ref4','ref5'],
'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2028, 2021, 2025, 2022],
'First_Year_Del': [5,2,7,9,4],
'Del_rate':[10,5,16,18,30]})
def f(r):
'''
Computes values per year and respective year
'''
n = (r['Total'] - r['First_Year_Del'])//r['Del_rate']
leftover = (r['Total'] - r['First_Year_Del'])%r['Del_rate']
r['values'] = [r['First_Year_Del']] + [r['Del_rate'] for _ in range(n)] + [leftover]
r['years'] = np.arange(r['Yr_to_Use'], r['Yr_to_Use'] + len(r['values']))
return r
df = df.apply(f, axis=1)
def get_year_range(r):
'''
Computes min and max year for each row
'''
r['y_min'] = min(r['years'])
r['y_max'] = max(r['years'])
return r
df = df.apply(get_year_range, axis=1)
y_min = df['y_min'].min()
y_max = df['y_max'].max()
#Initialize each year value to zero
for year in range(y_min, y_max+1):
df[year] = 0
def expand(r):
'''
Update value for each year
'''
for v, y in zip(r['values'], r['years']):
r[y] = v
return r
# Apply and drop temporary columns
df = df.apply(expand, axis=1).drop(['values', 'years', 'y_min', 'y_max'], axis=1)

Sample dataframe with number of records sampled per hour predefined

I have to sample a dataframe (df1) and I have another dataframe (df2) that tells me how many records I should retrieve from each hour of the day.
For example,
df1:
Hour number
0. 00 A
1. 00 B
2. 00 C
3. 01 D
4. 01 A
5. 01 B
6. 01 D
df2:
Hour number
0. 00 1
1. 01 2
So that in the end, I would get for example, record number 1 for midnight and records 3 and 5 for 1 am (or any other combination so long as it respects the number in df2)
The thing is that I need to write this in a function in order for me to call this inside another function.
So far I have
def sampling(frame):
return np.random.choice(frame.index)
but I am failing to add the constraints of the df2.
Could anybody help?
First we add the number of samples required as a new column using merge and the apply sample to each group of Hour values. Finally we remove the added column by returning all but the last column:
def sampling(df1, df2):
return df1.merge(df2, on='Hour').groupby('Hour').apply(lambda x: x.sample(x.Number[0])).reset_index(0,True).iloc[:,:-1]
df1 = pd.DataFrame({'Hour': [0,0,0,1,1,1,1], 'Value': list('ABCDABD')})
df2 = pd.DataFrame({'Hour': [0,1], 'Number': [1,2]})
sampling(df1, df2)
Result:
Hour Value
2 0 C
4 1 A
5 1 B

How to take values in the column as the columns in the DataFrame in pandas

My current DataFrame is:
Term value
Name
A 1 35
A 2 40
A 3 50
B 1 20
B 2 45
B 3 50
I want to get a dataframe as:
Term 1 2 3
Name
A 35 40 50
B 20 45 50
How can i get it?I've tried using pivot_table but i didn't get my expected output.Is there any way to get my expected output?
Use:
df = df.set_index('Term', append=True)['value'].unstack()
Or:
df = pd.pivot(df.index, df['Term'], df['value'])
print (df)
Term 1 2 3
Name
A 35 40 50
B 20 45 50
EDIT: If duplicates in pairs Name with Term is necessary aggretion, e.g. sum or mean:
df = df.groupby(['Name','Term'])['value'].sum().unstack(fill_value=0)

Resources