DataFrame merge for on specific columns - python-3.x

I have a basic question on dataframe merge. After I merge two dataframe , is there a way to pick only few columns in the result.
For Example:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on=['key1', 'key2'])
RESULT :
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
None
Is there a way I can chose only column 'C' from 'right' dataframe and 'A' column from left dataframe? For example, I would like my result to be like:
A key1 key2 C
0 A0 K0 K0 C0
1 A2 K1 K0 C1
2 A2 K1 K0 C2
None

Sure, first filter necessary columns + columns used for join:
result = pd.merge(left[['A','key1', 'key2']],
right[['C','key1', 'key2']],
on=['key1', 'key2'])
Or:
keys = ['key1', 'key2']
result = pd.merge(left[['A'] + keys], right[['C'] + keys], on=keys)

mergeDF = pd.merge(left['key1','key2','A'], right[['key1','key2','C']], on=['key1', 'key2'])

Related

Pandas Dataframe array entries as rows [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 6 months ago.
Use pd.DataFrame.explode() method: How to unnest (explode) a column in a pandas DataFrame, into multiple rows
I have a pandas.Dataframe structure with arrays as entries which I would like to disaggregate each of the entries into a long format.
Below is the code to reproduce what I am looking for. StackOverflow is asking me to put more detain into this draft because it is mostly code, but it is mostly code because it allows the reader to reproduce the issue more clearly.
import pandas as pd
import numpy as np
date = '08-30-2022'
ids = ['s1', 's2']
g1 = ['b1', 'b2']
g2 = ['b1', 'b3', 'b4']
g_ls = [g1, g2]
v1 = [2.0, 2.5]
v2 = [3.2, np.nan, 3.7]
v_ls = [v1, v2]
dict = {
'date': [date] * len(ids),
'ids': ids,
'group': g_ls,
'values': v_ls
}
df_in = pd.DataFrame.from_dict(dict)
dict_out = {
'date': [date] * 5,
'ids': ['s1', 's1', 's2', 's2', 's2'],
'group': ['b1', 'b2', 'b1', 'b3', 'b4'],
'values': [2.0, 2.5, 3.2, np.nan, 3.7]
}
desired_df = pd.DataFrame.from_dict(dict_out)
Have:
date ids group values
0 08-30-2022 s1 [b1, b2] [2.0, 2.5]
1 08-30-2022 s2 [b1, b3, b4] [3.2, nan, 3.7]
Want:
date ids group values
0 08-30-2022 s1 b1 2.0
1 08-30-2022 s1 b2 2.5
2 08-30-2022 s2 b1 3.2
3 08-30-2022 s2 b3 NaN
4 08-30-2022 s2 b4 3.7
Try with
df = df_in.explode(['group','values'])
Out[173]:
date ids group values
0 08-30-2022 s1 b1 2.0
0 08-30-2022 s1 b2 2.5
1 08-30-2022 s2 b1 3.2
1 08-30-2022 s2 b3 NaN
1 08-30-2022 s2 b4 3.7

Check if dictionaries are equal in df

I have a df, in which a column contains dictionaries:
a b c d
0 a1 b1 c1 {0.0: 'a', 1.0: 'b'}
1 a2 b2 c2 NaN
2 a3 b3 c3 {0.0: 'cs', 1.0: 'ef', 2.0: 'efg'}
and another dict:
di = {0.0: 'a', 1.0: 'b'}
I want to add a new column with 'yes' in it when d=di, and 'no' or 'NaN' or just empty when it's not. I tried the below but it doesnt work:
df.loc[df['d'] == di, 'e'] = 'yes'
The result would be:
a b c d e
0 a1 b1 c1 {0.0: 'a', 1.0: 'b'} yes
1 a2 b2 c2 NaN
2 a3 b3 c3 {0.0: 'cs', 1.0: 'ef', 2.0: 'efg'}
Anyone being able to help me here?
Thank you in advance!
You can try
df['new'] = df['d'].eq(di).map({True:'yes',False:''})

How to reorganize/restructure values in a dataframe with no column header by refering to a master dataframe in python?

Master Dataframe:
B
D
E
b1
d1
e1
b2
d2
e2
b3
d3
d4
d5
Dataframe with no column name:
b1
d3
e1
d2
b2
e2
e1
d5
e1
How do i convert the dataframe above to something like in the table below (with column names) by refering to master dataframe?
B
D
E
b1
d3
e1
d2
b2
e2
e1
d5
e1
Thank you in advance for your help!
One way would be to make a mapping dict, then reindex each row:
# Mapping dict
d = {}
for k, v in df.to_dict("list").items():
d.update(**dict.fromkeys(set(v) - {np.nan}, k))
# or pandas approach
d = df.melt().dropna().set_index("value")["variable"].to_dict()
def reorganize(ser):
data = [i for i in ser if pd.notna(i)]
ind = [d.get(i, i) for i in data]
return pd.Series(data, index=ind)
df2.apply(reorganize, axis=1)
Output:
B D E
0 b1 NaN NaN
1 NaN d3 e1
2 NaN d2 NaN
3 b2 NaN e2
4 NaN NaN e1
5 NaN d5 e1
It's not a beautiful answer, but I think I was able to do it by using .loc. I don't think you need to use Master Dataframe.
import pandas as pd
df = pd.DataFrame({'col1': ['b1', 'd3', 'd2', 'b2', 'e1', 'd5'],
'col2': ['', 'e1', '', 'e2', '', 'e1']},
columns=['col1', 'col2'])
df
# col1 col2
# 0 b1
# 1 d3 e1
# 2 d2
# 3 b2 e2
# 4 e1
# 5 d5 e1
df_reshaped = pd.DataFrame()
for index, row in df.iterrows():
for col in df.columns:
i = row[col]
j = i[0] if i != '' else ''
if j != '':
df_reshaped.loc[index, j] = i
df_reshaped.columns = df_reshaped.columns.str.upper()
df_reshaped
# B D E
# 0 b1 NaN NaN
# 1 NaN d3 e1
# 2 NaN d2 NaN
# 3 b2 NaN e2
# 4 NaN NaN e1
# 5 NaN d5 e1

How to subset a DataFrame based on similar column names

How to subset similar columns in pandas based on keywords like A B C D. Now I have taken this as example is there any better way if new columns were given and logic should work.
df
A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 D3 D4
1 a x 1 a x 3 c 7 d s 4
2 b 5 2 b 5 4 d s c 7 d
3 c 7 3 c 7 1 a x 1 a x
4 d s 4 d s b 5 2 b s 7
You can use pandas.Index.groupby
groups = df.columns.groupby(df.columns.str[0])
#{'A': ['A1', 'A2', 'A3'],
# 'B': ['B1', 'B2', 'B3'],
# 'C': ['C1', 'C2'],
# 'D': ['D1', 'D2', 'D3', 'D4']}
Then you can access data this way:
df[groups['B']]
# B1 B2 B3
#0 1 a x
#1 2 b 5
#2 3 c 7
#3 4 d s
Keep in mind groups is a dict, so you can use any dict method too.
Another approach can be to use df.columns in conjuction with str.contains
a_col_lst = df.columns[df.columns.str.contains('A')]
b_col_lst = df.columns[df.columns.str.contains('B')]
df_A = df.loc[a_col_lst]
df_B = df.loc[b_col_lst]
You can apply regex as well within columns.str.contains
You could use filter along with a regex pattern, e.g.
df_A = df.filter(regex=(r'^A.*'))
You could also use select along with startswith:
df_A = df.select(lambda col: col.startswith('A'), axis=1)

get distinct columns dataframe

Hello how can i do to only the lines where val is different in the 2 dataframes.
Notice that i can have id1 or id2 or both as below.
d2 = {'id1': ['X22', 'X13',np.nan,'X02','X14'],'id2': ['Y1','Y2','Y3','Y4',np.nan],'VAL1':[1,0,2,3,0]}
F1 = pd.DataFrame(data=d2)
d2 = {'id1': ['X02', 'X13',np.nan,'X22','X14'],'id2': ['Y4','Y2','Y3','Y1','Y22'],'VAL2':[1,0,4,3,1]}
F2 = pd.DataFrame(data=d2)
Expected Output
d2 = {'id1': ['X02',np.nan,'X22','X14'],'id2': ['Y4','Y3','Y1',np.nan],'VAL1':[3,2,1,0],'VAL2':[1,4,3,1]}
F3 = pd.DataFrame(data=d2)
First merge by all columns with left_on and right_on parameters, then filter out both rows and remove missing values by reshape by stack with unstack:
df=pd.merge(F1, F2, left_on=['id1','id2','VAL2'],
right_on=['id1','id2','VAL1'], how="outer", indicator=True)
df=(df[df['_merge'] !='both']
.set_index(['id1','id2'])
.drop('_merge', 1)
.stack()
.unstack()
.reset_index())
print (df)
id1 id2 VAL2 VAL1
0 X02 Y4 3 1
1 X22 Y1 1 3
F1.merge(F2,how='left',left_on=['id1','id2'],right_on=['id1','id2'])\
.query("VAL1!=VAL2")

Resources