PANDAS/Python check if the value from 2 datasets is equal and change the 1&0 to True or False - python-3.x

I want to check if the value in both datasets is equal. But the datasets are not in the same order so need to loop through the datasets.
Dataset 1 contract : enter image description here
Part number
H50
H51
H53
ID001
1
1
1
ID002
1
1
1
ID003
0
1
0
ID004
1
1
1
ID005
1
1
1
data 2 anx : enter image description here
So the partnumber are not in the same order, but to check the value the partnumber needs to be equal from each file. Then if the part nr is the same, check if the Hcolumn is the same too. If both partnumber and the H(header)nr are the same, check if the value is the same.
Part number
H50
H51
H53
ID001
1
1
1
ID003
0
0
1
ID004
0
1
1
ID002
1
0
1
ID005
1
1
1
Expecting outcome:
If the value 1==1 or 0 == 0 from both dataset -> change to TRUE.
If the value = 1 in dataset1 but = 0 in dataset2 -> change the value to FALSE. and safe all the rows that contains FALSE value into an excel file name "Not in contract"
If the value = 0 in dataset1 but 1 in dataset2 -> change the value to FALSE
Example expected outcome
Part number
H50
H51
H53
ID001
TRUE
TRUE
TRUE
ID002
TRUE
FALSE
TRUE
ID003
TRUE
FALSE
FALSE
ID004
FALSE
TRUE
TRUE
ID005
TRUE
TRUE
TRUE

df_merged = df1.merge(df2, on='Part number')
a = df_merged[df_merged.columns[df_merged.columns.str.contains('_x')]]
b = df_merged[df_merged.columns[df_merged.columns.str.contains('_y')]]
out = pd.concat([df_merged['Part number'], pd.DataFrame(a.values == b.values, columns=df1.columns[1:4])], axis=1)
out
Part number H50 H51 H53
0 ID001 True True True
1 ID002 True False True
2 ID003 True False False
3 ID004 False True True
4 ID005 True True True

Related

Update a pandas dataframe

I have a pandas dataframe with multiple columns, I have to update a column with true or false based on a condition. Example the column names are price and result, if price column has promotion as value then result column should be updated as true or else false.
Please help me with this.
Given this df:
price result
0 promotion 0
1 1 0
2 4 0
3 3 0
You can do so:
df['result'] = np.where(df['price'] == 'promotion', True, False)
Output:
price result
0 promotion True
1 1 False
2 4 False
3 3 False
Lets suppose the dataframe looks like this:
price result
0 0 False
1 1 False
2 2 False
3 promotion False
4 3 False
5 promotion False
You can create two boolean arrays. The first one will have 'True' value at the indices where you want to set the 'True' value in result column and the second one will have 'True' values at the indices where you want to set 'False' value in result column.
Here is the code:
index_true = (df['price'] == 'promotion')
index_false = (df['price'] != 'promotion')
df.loc[index_true, 'result'] = True
df.loc[index_false, 'result'] = False
The resultant dataframe will look like this:
price result
0 0 False
1 1 False
2 2 False
3 promotion True
4 3 False
5 promotion True

pandas create a column based on values in another column which selected as conditions

I have the following df,
id match_type amount negative_amount
1 exact 10 False
1 exact 20 False
1 name 30 False
1 name 40 False
1 amount 15 True
1 amount 15 True
2 exact 0 False
2 exact 0 False
I want to create a column 0_amount_sum that indicates (boolean) if the amount sum is <= 0 or not for each id of a particular match_type, e.g. the following is the result df;
id match_type amount 0_amount_sum negative_amount
1 exact 10 False False
1 exact 20 False False
1 name 30 False False
1 name 40 False False
1 amount 15 True True
1 amount 15 True True
2 exact 0 True False
2 exact 0 True False
for id=1 and match_type=exact, the amount sum is 30, so 0_amount_sum is False. The code is as follows,
df = df.loc[df.match_type=='exact']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
df = df.loc[df.match_type=='name']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
df = df.loc[df.match_type=='amount']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
I am wondering if there is a better way/more efficient to do that, especially when the values of match_type is unknown, so the code can automatically enumerate all the possible values and then do the calculation accordingly.
I believe need groupby by 2 Series (columns) instead filtering:
df['0_amount_sum_'] = ((df.amount * np.where(df.negative_amount, -1, 1))
.groupby([df['id'], df['match_type']])
.transform('sum')
.le(0))
id match_type amount negative_amount 0_amount_sum_
0 1 exact 10 False False
1 1 exact 20 False False
2 1 name 30 False False
3 1 name 40 False False
4 1 amount 15 True True
5 1 amount 15 True True
6 2 exact 0 False True
7 2 exact 0 False True

pandas assign columns values depend on another column in the df

I have the following df,
id a_id b_id
1 25 50
1 25 50
2 26 51
2 26 51
3 25 52
3 28 52
3 28 52
I have the following code to assign a_id and b_id to -1, based on how many rows each of them has for each id value in the df; if each of a_id or b_id value has exactly the same rows/sub-df as a specific value of id has, those rows of a_id and b_id get -1;
cluster_ids = df.loc[df['id'] > -1]['id'].unique()
types = ['a_id', 'b_id']
for cluster_id in cluster_ids:
rows = df.loc[df['id'] == cluster_id]
for type in types:
ids = rows[type].values
match_rows = df.loc[df[type] == ids[0]]
if match_rows.equals(rows):
df.loc[match_rows.index, type] = -1
so the result df will look like,
id a_id b_id
1 25 -1
1 25 -1
2 -1 -1
2 -1 -1
3 25 -1
3 28 -1
3 28 -1
I am wondering if there a more efficient way to do it.
one_value_for_each_id = df.groupby('id').transform(lambda x: len(set(x)) == 1)
a_id b_id
0 True True
1 True True
2 True True
3 True True
4 False True
5 False True
6 False True
one_id_for_each_value = pd.DataFrame({
col: df.groupby(col).id.transform(lambda x: len(set(x)) == 1)
for col in ['a_id', 'b_id']
})
a_id b_id
0 False True
1 False True
2 True True
3 True True
4 False True
5 True True
6 True True
one_to_one_relationship = one_id_for_each_value & one_value_for_each_id
# Set all values that satisfy the one-to-one relationship to `-1`
df.loc[one_to_one_relationship.a_id, 'a_id'] = -1
df.loc[one_to_one_relationship.b_id, 'b_id'] = -1
a_id b_id
0 25 -1
1 25 -1
2 -1 -1
3 -1 -1
4 25 -1
5 28 -1
6 28 -1

How to delete the entire row if any of its value is 0 in pandas

In the below example I only want to retain the row 1 and 2
I want to delete all the rows which has 0 anywhere across the column:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
3 3 3 0 3 3
4 0 4 0 0 0
5 5 5 5 5 0
the output should read like below:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
I have tried:
df.loc[(df!=0).any(axis=1)]
But it deletes the row only if all of its corresponding columns are 0
You are really close, need DataFrame.all for check all Trues per row:
df = df.loc[(df!=0).all(axis=1)]
print (df)
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
Details:
print (df!=0)
kt b tt mky depth
1 True True True True True
2 True True True True True
3 True True False True True
4 False True False False False
5 True True True True False
print ((df!=0).all(axis=1))
1 True
2 True
3 False
4 False
5 False
dtype: bool
Alternative solution with any for check at least one True for row with changed mask df == 0 and inversing by ~:
df = df.loc[~(df==0).any(axis=1)]

delete rows based on first N columns

I have a datafame:
import pandas as pd
df= pd.DataFrame({'date':['2017-12-31','2018-02-01','2018-03-01'],'type':['Asset','Asset','Asset'],'Amount':[1,0,0],'Amount1':[1,0,0],'Ted':[1,0,0]})
df
I want to delete rows where the first three columns are 0. I don't want to use the name of the column as it changes. In this case, I want to delete the 2nd and 3rd rows.
Use boolean indexing:
df = df[df.iloc[:, :3].ne(0).any(axis=1)]
#alternative solution with inverting mask by ~
#df = df[~df.iloc[:, :3].eq(0).all(axis=1)]
print (df)
Amount Amount1 Ted date type
0 1 1 1 2017-12-31 Asset
Detail:
First select N columns by iloc:
print (df.iloc[:, :3])
Amount Amount1 Ted
0 1 1 1
1 0 0 0
2 0 0 0
Compare by ne (!=):
print (df.iloc[:, :3].ne(0))
Amount Amount1 Ted
0 True True True
1 False False False
2 False False False
Get all rows at least one True per row by any:
print (df.iloc[:, :3].ne(0).any(axis=1))
0 True
1 False
2 False
dtype: bool

Resources