How to create new columns in dataframe based on conditional matches on another dataframe? - python-3.x

Situation
I have two dataframes df1 that holds some information about cars:
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000]
}
and df2 that holds media types corresponding to the cars in df1:
images = {'Brand': ['Honda Civic','Honda Civic','Honda Civic','Toyota Corolla','Toyota Corolla','Audi A4'],
'MediaType': ['A','B','C','A','B','C']
}
Expected result
In result I wanna create an overview in df1 that tells if there is a media type available for the car or not:
result = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000],
'MediaTypeA' : [True,True,False,False],
'MediaTypeB' : [True,True,False,False],
'MediaTypeC' : [False,False,False,True]
}
How can I realize this?
I already could check if a Brand from df1 exists in df2, what tells me there is or there is no media type available at all:
df1['check'] = df1['Brand'].isin(df2['Brand'])
but I am not sure how to glue it with the check for the special media types.

Use get_dummies for indicators, create unique index by max and add to first DataFrame by DataFrame.join, last replace missing values:
df11 = pd.get_dummies(df2.set_index('Brand')['MediaType'], dtype=bool).max(level=0)
df = df1.join(df11, on='Brand').fillna(False)
print (df)
Brand Price A B C
0 Honda Civic 22000 True True True
1 Toyota Corolla 25000 True True False
2 Ford Focus 27000 False False False
3 Audi A4 35000 False False True
If possible some missing values in df1 then need DataFrame.reindex with fill_value=False:
df22 = pd.get_dummies(df2.set_index('Brand')['MediaType'], dtype=bool).max(level=0)
df = df1.join(df22.reindex(df1['Brand'].unique(), fill_value=False), on='Brand')
print (df)
Brand Price A B C
0 Honda Civic 22000 True True True
1 Toyota Corolla 25000 True True False
2 Ford Focus 27000 False False False
3 Audi A4 35000 False False True

Related

Python: Compare 2 pandas dataframe with unequal number of rows

Need to compare two pandas dataframe with unequal number of rows and generate a new df with True for matching records and False for non matching and missing records.
df1:
date x y
0 2022-11-01 4 5
1 2022-11-02 12 5
2 2022-11-03 11 3
df2:
date x y
0 2022-11-01 4 5
1 2022-11-02 11 5
expected df_output:
date x y
0 True True True
1 False False False
2 False False False
Code:
df1 = pd.DataFrame({'date':['2022-11-01', '2022-11-02', '2022-11-03'],'x':[4,12,11],'y':[5,5,3]})
df2 = pd.DataFrame({'date':['2022-11-01', '2022-11-02'],'x':[4,11],'y':[5,5]})
df_output = pd.DataFrame(np.where(df1 == df2, True, False), columns=df1.columns)
print(df_output)
Error: ValueError: Can only compare identically-labeled DataFrame objects
You can use:
# cell to cell equality
# comparing by date
df3 = df1.eq(df1[['date']].merge(df2, on='date', how='left'))
# or to compare by index
# df3 = df1.eq(df2, axis=1)
# if you also want to turn a row to False if there is any False
df3 = (df3.T & df3.all(axis=1)).T
Output:
date x y
0 True True True
1 False False False
2 False False False

update columns based on id pandas

df_2:
order_id date amount name interval is_sent
123 2020-01-02 3 white today false
456 NaT 2 blue weekly false
789 2020-10-11 0 red monthly false
135 2020-6-01 3 orange weekly false
I am merging two dataframes locating when a date is greater than the previous result as well as looking to see if a data type has changed:
df_1['date'] = pd.to_datetime(df_1['date'])
df_2['date'] = pd.to_datetime(df_2['date'])
res = df_1.merge(df_2, on='order_id', suffixes=['_orig', ''])
m = res['date'].gt(res['date_orig']) | (res['date_orig'].isnull() & res['date'].notnull())
changes_df = res.loc[m, ['order_id', 'date', 'amount', 'name', 'interval', 'is_sent']]
After locating all my entities I am changing changes_df['is_sent'] to true:
changes_df['is_sent'] = True
after the above is ran changes_df is:
order_id date amount name interval is_sent
123 2020-01-03 3 white today true
456 2020-12-01 2 blue weekly true
135 2020-6-02 3 orange weekly true
I want to then update only the values in df_2['date'] and df_2['is_sent'] to equal changes_df['date'] and changes_df['is_sent']
Any insight is greatly appreciated.
Let us try update with set_index
cf = changes_df[['order_id','date','is_sent']].set_index('order_id')
df_2 = df_2.set_index('order_id')
df_2.update(cf)
df_2.reset_index(inplace=True)
df_2
order_id date amount name interval is_sent
0 123 2020-01-03 3 white today True
1 456 2020-12-01 2 blue weekly True
2 789 2020-10-11 0 red monthly False
3 135 2020-6-02 3 orange weekly True
df3 = df2.combine_first(
cap_df1).reindex(df.index)
This is my solution

Update a pandas dataframe

I have a pandas dataframe with multiple columns, I have to update a column with true or false based on a condition. Example the column names are price and result, if price column has promotion as value then result column should be updated as true or else false.
Please help me with this.
Given this df:
price result
0 promotion 0
1 1 0
2 4 0
3 3 0
You can do so:
df['result'] = np.where(df['price'] == 'promotion', True, False)
Output:
price result
0 promotion True
1 1 False
2 4 False
3 3 False
Lets suppose the dataframe looks like this:
price result
0 0 False
1 1 False
2 2 False
3 promotion False
4 3 False
5 promotion False
You can create two boolean arrays. The first one will have 'True' value at the indices where you want to set the 'True' value in result column and the second one will have 'True' values at the indices where you want to set 'False' value in result column.
Here is the code:
index_true = (df['price'] == 'promotion')
index_false = (df['price'] != 'promotion')
df.loc[index_true, 'result'] = True
df.loc[index_false, 'result'] = False
The resultant dataframe will look like this:
price result
0 0 False
1 1 False
2 2 False
3 promotion True
4 3 False
5 promotion True

find index of row element in pandas

If you have a df:
apple banana carrot
a 1 2 3
b 2 3 1
c 0 0 1
To find the index for the columns where a cell is equal to 0 is df[df['apple']==0].index
but can you transpose this so you find the index of row c where it is 0?
Basically I need to drop the columns where c==0 and would like to do this in one line by row rather than by each column.
If want test row c and get all columns if 0:
c = df.columns[df.loc['c'] == 0]
print (c)
Index(['apple', 'banana'], dtype='object')
If want test all rows:
c1 = df.columns[df.eq(0).any()]
print (c1)
Index(['apple', 'banana'], dtype='object')
If need remove columns if 0 in any row:
df = df.loc[:, df.ne(0).all()]
print (df)
carrot
a 3
b 1
c 1
Detail/explanation:
First compare all values of DataFrame by ne (!=):
print (df.ne(0))
apple banana carrot
a True True True
b True True True
c False False True
Then get all rows if all True rows:
print (df.ne(0).all())
apple False
banana False
carrot True
dtype: bool
Last filter by DataFrame.loc:
print (df.loc[:, df.loc['c'].ne(0)])
carrot
a 3
b 1
c 1
If need test only c row solution is similar, only first select c row by loc and omit all:
df = df.loc[:, df.loc['c'].ne(0)]
Yes you can, df.T[df.T['c']==0]

delete rows based on first N columns

I have a datafame:
import pandas as pd
df= pd.DataFrame({'date':['2017-12-31','2018-02-01','2018-03-01'],'type':['Asset','Asset','Asset'],'Amount':[1,0,0],'Amount1':[1,0,0],'Ted':[1,0,0]})
df
I want to delete rows where the first three columns are 0. I don't want to use the name of the column as it changes. In this case, I want to delete the 2nd and 3rd rows.
Use boolean indexing:
df = df[df.iloc[:, :3].ne(0).any(axis=1)]
#alternative solution with inverting mask by ~
#df = df[~df.iloc[:, :3].eq(0).all(axis=1)]
print (df)
Amount Amount1 Ted date type
0 1 1 1 2017-12-31 Asset
Detail:
First select N columns by iloc:
print (df.iloc[:, :3])
Amount Amount1 Ted
0 1 1 1
1 0 0 0
2 0 0 0
Compare by ne (!=):
print (df.iloc[:, :3].ne(0))
Amount Amount1 Ted
0 True True True
1 False False False
2 False False False
Get all rows at least one True per row by any:
print (df.iloc[:, :3].ne(0).any(axis=1))
0 True
1 False
2 False
dtype: bool

Resources