Pandas - Merging a grouped column to another dataframe - python-3.x

One of my dataframes contains columns
WR K ID
SP-RS-001 K001
SP-RS-001 K002
SP-RS-001 K006
SP-RS-002 K002
SP-RS-002 K007
SP-RS-002 K008
and the other has [EDIT]
U Code CO Code K ID
C001 C001.01 K001
C001 C001.02 K002
C001 C001.03 K006
C002 C002.01 K001
C002 C002.02 K006
I need another column in this dataframe which gives
U Code K ID WR
C001 K001, K002, K006 SP-RS-001, SP-RS-002
C002 K001, K006 SP-RS-001
C003 K002, K007 SP-RS-001, SP-RS-002
How can I do that? Thanks! :)

First of all I'm assuming C003 entry was a mistake (in your original question), I believe the following will work for you. It wasn't apparent which type of merge you wanted, so I assumed an inner merge.
Load Dataframe:
df1 = pd.DataFrame({'WR': ['SP-RS-001', 'SP-RS-001', 'SP-RS-001', 'SP-RS-002', 'SP-RS-002', 'SP-RS-002'],
'K_ID': ['K001', 'K002', 'K006', 'K002', 'K007', 'K008']})
df2 = pd.DataFrame({'U_Code': ['C001', 'C001', 'C001', 'C002', 'C002'],
'C0_Code': ['C001.01', 'C001.02', 'C001.03', 'C002.01', 'C002.02'],
'K_ID': ['K001', 'K002', 'K006', 'K001', 'K006']})
Merge on K_ID:
df = df2.merge(df1, on='K_ID', how='inner')[['U_Code', 'K_ID', 'WR']]
This gives us:
and finally, a groupby on U_CODE with the following aggregating function:
def f(x):
return pd.Series(dict(K_ID = ', '.join(x['K_ID'].unique()),
WR = ', '.join(x['WR'].unique())))
df = df.groupby(['U_Code']).apply(f)
Which gives us:
Hope this helps.

I think you are looking for this:
df3 = df1.merge(df2, on = 'K ID')
df4 =df3.groupby('U Code')['K ID','WR'].agg({'K ID': lambda x: ','.join(x), 'WR': lambda x: ','.join(x)})

Related

Transpose or consolidate Dataframe

Got a tricky situation. I tried my best via Pivot or other methods but gave up. Please help if possible.
I like to take a value = 1 from each column and populate the Date in that part.
After the above map, the 'Date' field is no more needed. So I am ok to delete that
My sample dataset:
df1 = pd.DataFrame({'Patient': ['John','John','John','Smith','Smith','Smith'],
'Date': [20200101, 20200102, 20200105,20220101, 20220102, 20220105],
'Ibrufen': ['NaN','NaN',1,'NaN','NaN',1],
'Tylenol': [1, 'NaN','NaN',1, 'NaN','NaN'],
})
My desired output:
df2 = pd.DataFrame({'Patient': ['Jonh','Smith'],
'Ibrufen': ['20200105','20220105'],
'Tylenol': ['20200101','20220101'],
'Steroid': ['20200102','20220102'],
})
A possible solution, based on the idea of first creating an auxiliary column containing, for each row, the corresponding medicine:
df1['aux'] = df1.apply(lambda x:
'Ibrufen' if (x['Ibrufen'] == 1) else
'Tylenol' if (x['Tylenol'] == 1) else
'Steroid', axis=1)
(df1.pivot(index='Patient', columns='aux', values='Date')
.reset_index()
.rename_axis(None, axis=1))
Output:
Patient Ibrufen Steroid Tylenol
0 John 20200105 20200102 20200101
1 Smith 20220105 20220102 20220101

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

How to create multiple dataframes using multiple functions

I quite often write a function to return different dataframes based on the parameters I enter. Here's an example dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='M'), 10000)})
I then created a function to perform sub-totals for me like this:
def some_fun(DF1, agg_column, myList=[], *args):
y = pd.concat([
DF1.assign(**{x:'[Total]' for x in myList[i:]})\
.groupby(myList).agg(sumz = (agg_column,'sum')) for i in range(1,len(myList)+1)]).sort_index().unstack(0)
return y
I then write out lists that I'll pass as arguments to the function:
list_one = [pd.Grouper(key='Date',freq='A'),'Category','Product']
list_two = [pd.Grouper(key='Date',freq='A'),'Category','Sub-Category','Sub-Category-2']
list_three = [pd.Grouper(key='Date',freq='A'),'Sub-Category','Product']
I then have to run each list through my function creating new dataframes:
df1 = some_fun(df,'Units_Sold',list_one)
df2 = some_fun(df,'Dollars_Sold',list_two)
df3 = some_fun(df,'Units_Sold',list_three)
I then use a function to write each of these dataframes to an Excel worksheet. This is just an example - I perform this same exercise 10+ times.
My question - is there a better way to perform this task than to write out df1, df2, df3 with the function information applied? Should I be looking at using a dictionary or some other data type to do this my pythonically with a function?
A dictionary would be my first choice:
variations = ([('Units Sold', list_one), ('Dollars_Sold',list_two),
..., ('Title', some_list)])
df_variations = {}
for i, v in enumerate(variations):
name = v[0]
data = v[1]
df_variations[i] = some_fun(df, name, data)
You might further consider setting the keys to unique / helpful titles for the variations, that goes beyond something like 'Units Sold', which isn't unique in your case.
IIUC,
as Thomas has suggested we can use a dictionary to parse through your data, but with some minor modifications to your function, we can use the dictionary to hold all the required data then pass that through to your function.
the idea is to pass two types of keys, the list of columns and the arguments to your pd.Grouper call.
data_dict = {
"Units_Sold": {"key": "Date", "freq": "A"},
"Dollars_Sold": {"key": "Date", "freq": "A"},
"col_list_1": ["Category", "Product"],
"col_list_2": ["Category", "Sub-Category", "Sub-Category-2"],
"col_list_3": ["Sub-Category", "Product"],
}
def some_fun(dataframe, agg_col, dictionary,column_list, *args):
key = dictionary[agg_col]["key"]
frequency = dictionary[agg_col]["freq"]
myList = [pd.Grouper(key=key, freq=frequency), *dictionary[column_list]]
y = (
pd.concat(
[
dataframe.assign(**{x: "[Total]" for x in myList[i:]})
.groupby(myList)
.agg(sumz=(agg_col, "sum"))
for i in range(1, len(myList) + 1)
]
)
.sort_index()
.unstack(0)
)
return y
Test.
df1 = some_fun(df,'Units_Sold',data_dict,'col_list_3')
print(df1)
sumz
Date 2016-12-31 2017-12-31 2018-12-31
Sub-Category Product
X Product 1 18308 17839 18776
Product 2 18067 19309 18077
Product 3 17943 19121 17675
[Total] 54318 56269 54528
Y Product 1 20699 18593 18103
Product 2 18642 19712 17122
Product 3 17701 19263 20123
[Total] 57042 57568 55348
Z Product 1 19077 17401 19138
Product 2 17207 21434 18817
Product 3 18405 17300 17462
[Total] 54689 56135 55417
[Total] [Total] 166049 169972 165293
as you want to automate the writing of the 10x worksheets, we can again do that with a dictionary call over your function:
matches = {'Units_Sold': ['col_list_1','col_list_3'],
'Dollars_Sold' : ['col_list_2']}
then a simple for loop to write all the files to a single excel sheet, change this to match your required behavior.
writer = pd.ExcelWriter('finished_excel_file.xlsx')
for key,value in matches.items():
for items in value:
dataframe = some_fun(df,k,data_dict,items)
dataframe.to_excel(writer,f'{key}_{items}')
writer.save()

How to apply a function with multiple arguments to a specific column in Pandas?

I'm trying to apply a function to a specific column in this dataframe
datetime PM2.5 PM10 SO2 NO2
0 2013-03-01 7.125000 10.750000 11.708333 22.583333
1 2013-03-02 30.750000 42.083333 36.625000 66.666667
2 2013-03-03 76.916667 120.541667 61.291667 81.000000
3 2013-03-04 22.708333 44.583333 22.854167 46.187500
4 2013-03-06 223.250000 265.166667 116.236700 142.059383
5 2013-03-07 263.375000 316.083333 97.541667 147.750000
6 2013-03-08 221.458333 297.958333 69.060400 120.092788
I'm trying to apply this function(below) to a specific column(PM10) of the above dataframe:
range1 = [list(range(0,50)),list(range(51,100)),list(range(101,200)),list(range(201,300)),list(range(301,400)),list(range(401,2000))]
def c1_c2(x,y):
for a in y:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
Where "x" can be any column and "y" = Range1
Available Options
df.PM10.apply(c1_c2,args(df.PM10,range1),axis=1)
df.PM10.apply(c1_c2)
I've tried these couple of available options and none of them seems to be working. Any suggestions?
Not sure what the expected output is from the function. But to get the function getting called you can try the following
from functools import partial
df.PM10.apply(partial(c1_c2, y=range1))
Update:
Ok, I think I understand a little better. This should work, but 'range1' is a list of lists of integers. Your data doesn't have integers and the new column comes up empty. I created another list based on your initial data that works. See below:
df = pd.read_csv('pm_data.txt', header=0)
range1= [[7.125000,10.750000,11.708333,22.583333],list(range(0,50)),list(range(51,100)),list(range(101,200)),
list(range(201,300)),list(range(301,400)),list(range(401,2000))]
def c1_c2(x,y):
for a in y:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
df['function']=df.PM10.apply(lambda x: c1_c2(x,range1))
print(df.head(10))
datetime PM2.5 PM10 SO2 NO2 new_column function
0 2013-03-01 7.125000 10.750000 11.708333 22.583333 25.750000 16.458333
1 2013-03-02 30.750000 42.083333 36.625000 66.666667 2.104167 NaN
2 2013-03-03 76.916667 120.541667 61.291667 81.000000 6.027083 NaN
3 2013-03-04 22.708333 44.583333 22.854167 46.187500 2.229167 NaN
4 2013-03-06 223.250000 265.166667 116.236700 142.059383 13.258333 NaN
5 2013-03-07 263.375000 316.083333 97.541667 147.750000 15.804167 NaN
6 2013-03-08 221.458333 297.958333 69.060400 120.092788 14.897917 NaN
Only the first item in 'function' had a match because it came from your initial data because of 'if x in a'.
Old Code:
I'm also not sure what you are doing. But you can use a lambda to modify columns or create new ones.
Like this,
import pandas as pd
I created a data file to import from the data you posted above:
datetime,PM2.5,PM10,SO2,NO2
2013-03-01,7.125000,10.750000,11.708333,22.583333
2013-03-02,30.750000,42.083333,36.625000,66.666667
2013-03-03,76.916667,120.541667,61.291667,81.000000
2013-03-04,22.708333,44.583333,22.854167,46.187500
2013-03-06,223.250000,265.166667,116.236700,142.059383
2013-03-07,263.375000,316.083333,97.541667,147.750000
2013-03-08,221.458333,297.958333,69.060400,120.092788
Here is how I import it,
df = pd.read_csv('pm_data.txt', header=0)
and create a new column and apply a function to the data in 'PM10'
df['new_column'] = df['PM10'].apply(lambda x: x+15 if x < 30 else x/20)
which yields,
datetime PM2.5 PM10 SO2 NO2 new_column
0 2013-03-01 7.125000 10.750000 11.708333 22.583333 25.750000
1 2013-03-02 30.750000 42.083333 36.625000 66.666667 2.104167
2 2013-03-03 76.916667 120.541667 61.291667 81.000000 6.027083
3 2013-03-04 22.708333 44.583333 22.854167 46.187500 2.229167
4 2013-03-06 223.250000 265.166667 116.236700 142.059383 13.258333
5 2013-03-07 263.375000 316.083333 97.541667 147.750000 15.804167
6 2013-03-08 221.458333 297.958333 69.060400 120.092788 14.897917
Let me know if this helps.
"I've tried these couple of available options and none of them seems to be working..."
What do you mean by this? What's your output, are you getting errors or what?
I see a couple of problems:
range1 lists contain int while your column values are float, so c1_c2() will return None.
if the data types were the same within range1 and columns, c1_c2() will return None when value is not in range1.
Below is how I would do it, assuming the data-types match:
def c1_c2(x):
range1 = [list of lists]
for a in range1:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
return x # returns the original value if not in range1
df.PM10.apply(c1_c2)

How can I create a new dataframe with the products of another dataframe which has been grouped?

Input:
dfB=dfA.groupby('labelA').labelB.nlargest(3)
Output:
labelA
G 5309 415004880.00
6016 268492764.00
5570 191452396.00
PG 6687 486295561.00
5943 400738009.00
5987 368061265.00
PG-13 6380 936662225.00
6391 652270625.00
5723 623357910.00
R 6616 363070709.00
6184 350126372.00
5569 254464305.00
Name: labelB, dtype: float64
I now would like to create a new data frame, to which I can visualise, containing the mean of each group(G, PG, PG-13, R). I tried the following, however, as shown below, the output is the mean of all 4 groups combined.
Input:
barB.mean()
Output:
442499751.75
dfB = dfA.groupby('labelA').labelB.apply(lambda x: x.nlargest(3).mean())
You can use apply to chain the mean-function to nlargest.

Resources