I need to match the unique value of rows from one dataset to the columns matching in another dataset and provide the dataframe - python-3.x

Below is the dataframe example where id is the index
df:
id
A
B
C
1
False
False
NA
2
True
False
NA
3
False
True
True
df2:
A
B
C
D
True
False
NA
True
False
True
False
False
False
True
True
True
False
True
True
True
False
True
True
True
False
True
True
True
False
True
True
True
False
True
True
True
Output:
Here we are matching the unique row if the id of df matches with the columns of df2 and has true
values in df2 columns then sum it per id of df and provide the data frame of the same index and ignoring d column in df2
id
A
B
C
Sum of matched true values in columns of df2
1
False
False
NA
0
2
True
False
NA
2
3
False
True
True
6
match_df = try_df.merge(df, on= list_new , how='outer',suffixes=('', '_y'))
match_df.drop(match_df.filter(regex='_y$').columns, axis=1, inplace=True)
df_grouped = match_df.groupby('CIS Sub Controls')[list_new].agg(['sum', 'count'])
df_final = pd.concat([df_grouped['col1']['sum'], df_grouped['col2']['sum'], df_grouped['col3']['sum'], df_grouped['col4']['sum'], df_grouped['col1']['count'], df_grouped['col2']['count'], df_grouped['col3']['count'], df_grouped['col4']['count']], axis=1).join(df_grouped.index)
This is not how it goes

You can use value_counts and merge:
cols = df1.columns.intersection(df2.columns)
out = (df1.merge(df2[cols].value_counts(dropna=False).reset_index(name='sum'),
how='left')
.fillna({'sum': 0}, downcast='infer')
)
Output:
id A B C sum
0 1 False False NaN 0
1 2 True False NaN 1
2 3 False True True 6

Related

I need to find the count values using a specific row that matches the columns for the other dataframe the values which are true

For example I have two dataframes
df = [{'id':1, 'A': 'False', 'B' : 'False' , 'C':'NA'},{'id':2, 'A': 'True', 'B' : 'False' , 'C':'NA'}]
df2 = [{'A':'True', 'B': 'False' , 'C':'NA', 'D': 'True'}]
idea should be to count the values based on the row of df which matches, df2 columns
df:
id
A
B
C
1
False
False
NA
2
True
False
NA
df2:
A
B
C
D
True
False
NA
True
Output:
id
A
B
C
Count
1
False
False
NA
0
2
True
False
NA
1
I tried something like
for i in range(columns):
x = action_value_counts_df.columns[i]
if compare_column.equals(action_value_counts_df[x]):
print(x, 'Matched')
else:
print(x,'Not Matched')
This code did not help
Merge df, df2 using overlapping columns.
Count overlapping columns.
Merge count result DataFrame into df.
Replace NA with 0 in the 'Count' column.
import pandas as pd
df = pd.DataFrame([
{'id':1, 'A': 'False', 'B' : 'False' , 'C':'NA'},
{'id':2, 'A': 'True', 'B' : 'False' , 'C':'NA'}])
df2 = pd.DataFrame([
{'A':'True', 'B': 'False' , 'C':'NA', 'D': 'True'}])
# 1
match_df = pd.merge(left=df, right=df2, on=['A', 'B', 'C'], how='inner')
match_df = match_df.assign(Count=1)
"""
id A B C D Count
0 2 True False NA True 1
"""
# 2
match_count_df = match_df.groupby(['A', 'B', 'C'], as_index=False).count()
match_count_df = match_count_df[['A', 'B', 'C', 'Count']]
"""
A B C Count
0 True False NA 1
"""
# 3
result_df = pd.merge(left=df, right=match_count_df, how='left')
"""
id A B C Count
0 1 False False NA NaN
1 2 True False NA 1.0
"""
# 4
result_df.loc[:, 'Count'] = result_df['Count'].fillna(0)
"""
id A B C Count
0 1 False False NA 0.0
1 2 True False NA 1.0
"""
You can compare two dataframes rowwise with arranged columns (from df to df2) and ignored NaN values (as they are not comparable):
df.assign(Count=df.set_index('id').apply(lambda x: (x.dropna() == df2[x.index].squeeze().dropna()).all() * 1, axis=1).values)
id A B C Count
0 1 False False NaN 0
1 2 True False NaN 1

PANDAS/Python check if the value from 2 datasets is equal and change the 1&0 to True or False

I want to check if the value in both datasets is equal. But the datasets are not in the same order so need to loop through the datasets.
Dataset 1 contract : enter image description here
Part number
H50
H51
H53
ID001
1
1
1
ID002
1
1
1
ID003
0
1
0
ID004
1
1
1
ID005
1
1
1
data 2 anx : enter image description here
So the partnumber are not in the same order, but to check the value the partnumber needs to be equal from each file. Then if the part nr is the same, check if the Hcolumn is the same too. If both partnumber and the H(header)nr are the same, check if the value is the same.
Part number
H50
H51
H53
ID001
1
1
1
ID003
0
0
1
ID004
0
1
1
ID002
1
0
1
ID005
1
1
1
Expecting outcome:
If the value 1==1 or 0 == 0 from both dataset -> change to TRUE.
If the value = 1 in dataset1 but = 0 in dataset2 -> change the value to FALSE. and safe all the rows that contains FALSE value into an excel file name "Not in contract"
If the value = 0 in dataset1 but 1 in dataset2 -> change the value to FALSE
Example expected outcome
Part number
H50
H51
H53
ID001
TRUE
TRUE
TRUE
ID002
TRUE
FALSE
TRUE
ID003
TRUE
FALSE
FALSE
ID004
FALSE
TRUE
TRUE
ID005
TRUE
TRUE
TRUE
df_merged = df1.merge(df2, on='Part number')
a = df_merged[df_merged.columns[df_merged.columns.str.contains('_x')]]
b = df_merged[df_merged.columns[df_merged.columns.str.contains('_y')]]
out = pd.concat([df_merged['Part number'], pd.DataFrame(a.values == b.values, columns=df1.columns[1:4])], axis=1)
out
Part number H50 H51 H53
0 ID001 True True True
1 ID002 True False True
2 ID003 True False False
3 ID004 False True True
4 ID005 True True True

Python: Compare 2 pandas dataframe with unequal number of rows

Need to compare two pandas dataframe with unequal number of rows and generate a new df with True for matching records and False for non matching and missing records.
df1:
date x y
0 2022-11-01 4 5
1 2022-11-02 12 5
2 2022-11-03 11 3
df2:
date x y
0 2022-11-01 4 5
1 2022-11-02 11 5
expected df_output:
date x y
0 True True True
1 False False False
2 False False False
Code:
df1 = pd.DataFrame({'date':['2022-11-01', '2022-11-02', '2022-11-03'],'x':[4,12,11],'y':[5,5,3]})
df2 = pd.DataFrame({'date':['2022-11-01', '2022-11-02'],'x':[4,11],'y':[5,5]})
df_output = pd.DataFrame(np.where(df1 == df2, True, False), columns=df1.columns)
print(df_output)
Error: ValueError: Can only compare identically-labeled DataFrame objects
You can use:
# cell to cell equality
# comparing by date
df3 = df1.eq(df1[['date']].merge(df2, on='date', how='left'))
# or to compare by index
# df3 = df1.eq(df2, axis=1)
# if you also want to turn a row to False if there is any False
df3 = (df3.T & df3.all(axis=1)).T
Output:
date x y
0 True True True
1 False False False
2 False False False

How to change the 'True' boolean to 'False' boolean incase we have only one 'True' boolean between two 'False' boalean in a column in dataframe

I have unknown number of dataframes. The number and location of 'True' boolean are unknown in a column "label_number_hours" in dataframes. There is a possibility to be unlimited numbers of 'True' booleans between two 'False' booleans in a column "label_number_hours" in dataframes. I am looking to change the 'True' boolean to 'False' boolean in this column if the number of 'True' boolean is only one, for example, False - True - False, I want to be False - False - False.
This is an example of one of dataframe I have:
df =
label_number_hours some_other_column
0 True 0.174998
1 False 0.235088
2 True 0.076127
3 True 0.817929
4 True 0.781144
5 False 0.904597
6 True 0.703006
7 False 0.923654
8 True 0.261100
9 True 0.803631
10 False 0.149026
This is the dataframe which I am looking for:
df =
label_number_hours some_other_column
0 True 0.174998
1 False 0.235088
2 True 0.076127
3 True 0.817929
4 True 0.781144
5 False 0.904597
6 False 0.703006
7 False 0.923654
8 True 0.261100
9 True 0.803631
10 False 0.149026
This is the code:
falses_idx, = np.where(~df["label_number_hours"])
if falses_idx.size > 0:
df.iloc[falses_idx[0]:falses_idx[-1], df.columns.get_loc("label_number_hours")] = False
This is the result:
label_number_hours some_other_column
0 True 0.174998
1 False 0.235088
2 False 0.076127
3 False 0.817929
4 False 0.781144
5 False 0.904597
6 False 0.703006
7 False 0.923654
8 False 0.261100
9 False 0.803631
10 False 0.149026
I need really to your help

pandas create a column based on values in another column which selected as conditions

I have the following df,
id match_type amount negative_amount
1 exact 10 False
1 exact 20 False
1 name 30 False
1 name 40 False
1 amount 15 True
1 amount 15 True
2 exact 0 False
2 exact 0 False
I want to create a column 0_amount_sum that indicates (boolean) if the amount sum is <= 0 or not for each id of a particular match_type, e.g. the following is the result df;
id match_type amount 0_amount_sum negative_amount
1 exact 10 False False
1 exact 20 False False
1 name 30 False False
1 name 40 False False
1 amount 15 True True
1 amount 15 True True
2 exact 0 True False
2 exact 0 True False
for id=1 and match_type=exact, the amount sum is 30, so 0_amount_sum is False. The code is as follows,
df = df.loc[df.match_type=='exact']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
df = df.loc[df.match_type=='name']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
df = df.loc[df.match_type=='amount']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
I am wondering if there is a better way/more efficient to do that, especially when the values of match_type is unknown, so the code can automatically enumerate all the possible values and then do the calculation accordingly.
I believe need groupby by 2 Series (columns) instead filtering:
df['0_amount_sum_'] = ((df.amount * np.where(df.negative_amount, -1, 1))
.groupby([df['id'], df['match_type']])
.transform('sum')
.le(0))
id match_type amount negative_amount 0_amount_sum_
0 1 exact 10 False False
1 1 exact 20 False False
2 1 name 30 False False
3 1 name 40 False False
4 1 amount 15 True True
5 1 amount 15 True True
6 2 exact 0 False True
7 2 exact 0 False True

Resources