pandas create a flag when merging two dataframes - python-3.x

I have two df - df_a and df_b,
# df_a
number cur
1000 USD
2000 USD
3000 USD
# df_b
number amount deletion
1000 0.0 L
1000 10.0 X
1000 10.0 X
2000 20.0 X
2000 20.0 X
3000 0.0 L
3000 0.0 L
I want to left merge df_a with df_b,
df_a = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df_a.fillna(value={'amount':0}, inplace=True)
but also create a flag called deleted in the result df_a, that has three possible values - full, partial and none;
full - if all rows associated with a particular number value, have deletion = L;
partial - if some rows associated with a particular number value, have deletion = L;
none - no rows associated with a particular number value, have deletion = L;
Also when doing the merge, rows from df_b with deletion = L should not be considered; so the result looks like,
number amount deletion deleted cur
1000 10.0 X partial USD
1000 10.0 X partial USD
2000 20.0 X none USD
2000 20.0 X none USD
3000 0.0 NaN full USD
I am wondering how to achieve that.

Idea is compare deletion column and aggregate all and
any, create helper dictionary and last map for new column:
g = df_b['deletion'].eq('L').groupby(df_b['number'])
m1 = g.any()
m2 = g.all()
d1 = dict.fromkeys(m1.index[m1 & ~m2], 'partial')
d2 = dict.fromkeys(m2.index[m2], 'full')
#join dictionries together
d = {**d1, **d2}
print (d)
{1000: 'partial', 3000: 'full'}
df = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df['deleted'] = df['number'].map(d).fillna('none')
print (df)
number cur amount deletion deleted
0 1000 USD 10.0 X partial
1 1000 USD 10.0 X partial
2 2000 USD 20.0 X none
3 2000 USD 20.0 X none
4 3000 USD NaN NaN full
For specify column none, if want create dictionary for it:
d1 = dict.fromkeys(m1.index[m1 & ~m2], 'partial')
d2 = dict.fromkeys(m2.index[m2], 'full')
d3 = dict.fromkeys(m2.index[~m1], 'none')
d = {**d1, **d2, **d3}
print (d)
{1000: 'partial', 3000: 'full', 2000: 'none'}
df = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df['deleted'] = df['number'].map(d)
print (df)
number cur amount deletion deleted
0 1000 USD 10.0 X partial
1 1000 USD 10.0 X partial
2 2000 USD 20.0 X none
3 2000 USD 20.0 X none
4 3000 USD NaN NaN full

Related

How to correspondence of unique values ​between 2 tables?

I am fairly new to Python and I am trying to create a new function to work on my project.
The function will aim to detect which unique value is present in another column of another table.
At first, the function seeks to keep only the unique values ​​of the two tables, then merges them into a new dataframe
It's the rest that gets complicated because I would like to return which row and on which table my value is missing
If you have any other leads or thought patterns, I'm also interested.
Here is my code :
def correspondance_cle(df1, df2, col):
df11 = pd.DataFrame(df1[col].unique())
df11.columns= [col]
df11['test1'] = 1
df21 = pd.DataFrame(df2[col].unique())
df21.columns= [col]
df21['test2'] = 1
df3 = pd.merge(df11, df21, on=col, how='outer')
df3 = df3.loc[(fk['test1'].isna() == True) | (fk['test2'].isna() == True),:]
df3.info()
for row in df3[col]:
if df3['test1'].isna() == True:
print(row, "is not in df1")
else:
print(row, 'is not in df2')
Thanks to everyone who took the time to read the post.
First use outer join with remove duplicates by Series.drop_duplicates and Series.reset_index for avoid removed original indices:
df1 = pd.DataFrame({'a':[1,2,5,5]})
df2 = pd.DataFrame({'a':[2,20,5,8]})
col = 'a'
df = (df1[col].drop_duplicates().reset_index()
.merge(df2[col].drop_duplicates().reset_index(),
indicator=True,
how='outer',
on=col))
print (df)
index_x a index_y _merge
0 0.0 1 NaN left_only
1 1.0 2 0.0 both
2 2.0 5 2.0 both
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only
Then filter rows by helper column _merge:
print (df[df['_merge'].eq('left_only')])
index_x a index_y _merge
0 0.0 1 NaN left_only
print (df[df['_merge'].eq('right_only')])
index_x a index_y _merge
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only

Divide values of rows based on condition which are of running count

Sample of the table for 1 id, exists multiple id in the original df.
id
legend
date
running_count
101
X
24-07-2021
3
101
Y
24-07-2021
5
101
X
25-07-2021
4
101
Y
25-07-2021
6
I want to create a new column where I have to perform division of the running_count on the basis of the id, legend and date - (X/Y) for the date 24-07-2021 for a particular id and so on.
How shall I perform the calculation?
If there is same order X, Y for each id is possible use:
df['new'] = df['running_count'].div(df.groupby(['id','date'])['running_count'].shift(-1))
print (df)
id legend date running_count new
0 101 X 24-07-2021 3 0.600000
1 101 Y 24-07-2021 5 NaN
2 101 X 25-07-2021 4 0.666667
3 101 Y 25-07-2021 6 NaN
If possible change ouput:
df1 = df.pivot(index=['id','date'], columns='legend', values='running_count')
df1['new'] = df1['X'].div(df1['Y'])
df1 = df1.reset_index()
print (df1)
legend id date X Y new
0 101 24-07-2021 3 5 0.600000
1 101 25-07-2021 4 6 0.666667

Mean imputation based on certain Conditions

I have the below dataframe,
Category Value
A 100
A -
B -
C 50
D 200
D 400
D -
As you can see, there are some values which have the hyphen symbol '-'. I want to replace those hyphons with the means of the corresponding category.
In the example, there are two entries for "A" - One row with value 100 and other with hyphen. So the mean would be 100 itself. For B, since there are no valid values, the mean would be the mean of the entire column which would be (100+50+200+400/4 = 187.5). For C, no changes and for D, the hyphen will be replaced by 300 (same logic as for "A").
Output:
Category Value
A 100
A 100
B 187.5
C 50
D 200
D 400
D 300
Try:
df = df.replace("-", np.nan)
df["Value"] = pd.to_numeric(df["Value"])
avg = df["Value"].mean()
df["Value"] = df["Value"].fillna(
df.groupby("Category")["Value"].transform(
lambda x: avg if x.isna().all() else x.mean()
)
)
print(df)
Prints:
Category Value
0 A 100.0
1 A 100.0
2 B 187.5
3 C 50.0
4 D 200.0
5 D 400.0
6 D 300.0

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.
#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing},
index=data.columns.values)
find_missing(data_final).head(5)
What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.
The data set looks like this
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
In the output I would just want to see :
missing_count missing_ratio
C 4 0.66
D 4 0.66
and not the columns A and B as there are no missing values there
Use DataFrame.isna with DataFrame.sum
to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.
new_df = (df.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(df))
.loc[df.isna().any()] )
print(new_df)
We can also use pd.concat instead DataFrame.assign
count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
count.div(len(df))
.rename('missing_ratio')],axis = 1)
.loc[count.ne(0)])
Output
missing_count missing_ratio
A 1 0.166667
C 4 0.666667
D 4 0.666667
IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.
a = df.isnull().sum(axis=0)
b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)
missing_df = pd.DataFrame({'missing_vals' : a,
'missing_ratio' : b})
print(missing_df)
missing_vals ratio
A 1 0.17
B 0 0.00
C 4 0.67
D 4 0.67
you can filter out columns that don't have any missing vals
missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals ratio
A 1 0.17
C 4 0.67
D 4 0.67
You can also use concat:
s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)
missing_count missing_ratio
A 1 0.166667
B 0 0.000000
C 4 0.666667
D 4 0.666667

How to use pandas df column value in if-else expression to calculate additional columns

I am trying to calculate additional metrics from existing pandas dataframe by using an if/else condition on existing column values.
if(df['Sell_Ind']=='N').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.CurrentPrice, axis=1).astype(float).round(2)
elif(df['Sell_Ind']=='Y').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.Sold_price, axis=1).astype(float).round(2)
else:
df['MarketValue'] = df.apply(lambda row: 0)
For the if condition the MarketValue is calculated correctly but for the elif condition, its not giving the correct value.
Can anyone point me as what wrong I am doing in this code.
I think you need numpy.select, apply can be removed and multiple columns by mul:
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
Sample:
df = pd.DataFrame({'Sold_price':[7,8,9,4,2,3],
'SharesUnits':[1,3,5,7,1,0],
'CurrentPrice':[5,3,6,9,2,4],
'Sell_Ind':list('NNYYTT')})
#print (df)
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
print (df)
CurrentPrice Sell_Ind SharesUnits Sold_price MarketValue
0 5 N 1 7 5.0
1 3 N 3 8 9.0
2 6 Y 5 9 45.0
3 9 Y 7 4 28.0
4 2 T 1 2 0.0
5 4 T 0 3 0.0

Resources