get distinct columns dataframe - python-3.x

Hello how can i do to only the lines where val is different in the 2 dataframes.
Notice that i can have id1 or id2 or both as below.
d2 = {'id1': ['X22', 'X13',np.nan,'X02','X14'],'id2': ['Y1','Y2','Y3','Y4',np.nan],'VAL1':[1,0,2,3,0]}
F1 = pd.DataFrame(data=d2)
d2 = {'id1': ['X02', 'X13',np.nan,'X22','X14'],'id2': ['Y4','Y2','Y3','Y1','Y22'],'VAL2':[1,0,4,3,1]}
F2 = pd.DataFrame(data=d2)
Expected Output
d2 = {'id1': ['X02',np.nan,'X22','X14'],'id2': ['Y4','Y3','Y1',np.nan],'VAL1':[3,2,1,0],'VAL2':[1,4,3,1]}
F3 = pd.DataFrame(data=d2)

First merge by all columns with left_on and right_on parameters, then filter out both rows and remove missing values by reshape by stack with unstack:
df=pd.merge(F1, F2, left_on=['id1','id2','VAL2'],
right_on=['id1','id2','VAL1'], how="outer", indicator=True)
df=(df[df['_merge'] !='both']
.set_index(['id1','id2'])
.drop('_merge', 1)
.stack()
.unstack()
.reset_index())
print (df)
id1 id2 VAL2 VAL1
0 X02 Y4 3 1
1 X22 Y1 1 3

F1.merge(F2,how='left',left_on=['id1','id2'],right_on=['id1','id2'])\
.query("VAL1!=VAL2")

Related

Pandas Dataframe merge on two different key to get original data

Question title might be confusing but here is the example of what I intends to perform.
Below is one the main dataframe with request data
d = {'ID':['A1','A2','A3','A4'],'ID2': ['B1','B2','B3','B4'],'B':[-1,5,6,7000],'ExtD':['CA','CB','CC','CD']}
df = pd.DataFrame(data=d)
df
Now, Response might be based on ID or ID2 column and looks like this -
d = {'RetID':['A1','A2','B3','B4'],'C':[1.3,5.4,4.5,1.3]}
df2 = pd.DataFrame(data=d)
df2
where RetID could be ID or ID2 from the request along with additional data C. Once response is received I need to merge it back with original dataframe to get data ExtD.
the solution I have come up with is to do -
df2 = df2.merge(df[['ID','ExtD',]],'left',left_on=['RetID'],right_on=['ID'])
df2 = df2.merge(df[['ID2','ExtD']],'left',left_on=['RetID'],right_on=['ID2'],suffixes = ('_d1','_d2'))
df2.rename({'ExtD_d1':'ExtD'},axis=1,inplace=True)
df2.loc[df2['ExtD'].isnull(),'ExtD'] = df2['ExtD_d2']
df2.drop({'ID2','ExtD_d2'},axis=1,inplace=True)
so expected output is,
res = {'RetID':['A1','A2','B3','B4'],'C':[1.3,5.4,4.5,1.3],'ExtD':['CA','CB','CC','CD']}
df2= pd.DataFrame(data=res)
df2
EDIT2: updated requirement tweak.
res = {'RetID':['A1','A2','B1','B2'],'C':[1.3,5.4,4.5,1.3],'ExtD':['CA','CB','CC','CD'],'ID':['A1','A2','A3','A4'],'ID2': ['B1','B2','B3','B4']}
Is there an efficient way to do this ? There might be more than 2 IDs - ID, ID2, ID3 and more than one column to join from the reqest dataframe. TIA.
EDIT: Fixed the typo.
Use melt to transform your first dataframe then merge with the second:
tmp = df.melt('ExtD', value_vars=['ID', 'ID2'], value_name='RetID')
df2 = df2.merge(tmp[['ExtD', 'RetID']])
>>> df2
RetID C ExtD
0 A1 1.3 CA
1 A2 5.4 CB
2 B1 4.5 CA
3 B2 1.3 CB
>>> tmp
ExtD variable RetID
0 CA ID A1
1 CB ID A2
2 CC ID A3
3 CD ID A4
4 CA ID2 B1
5 CB ID2 B2
6 CC ID2 B3
7 CD ID2 B4
Update
What if I need to merge ID and ID2 columns as well?
df2 = df2.merge(df[['ID', 'ID2', 'ExtD']], on='ExtD')
>>> df2
RetID C ExtD ID ID2
0 A1 1.3 CA A1 B1
1 A2 5.4 CB A2 B2
2 B3 4.5 CC A3 B3
3 B4 1.3 CD A4 B4

pd dataframe from lists and dictionary using series

I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help
you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)
Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN

How to add column name to cell in pandas dataframe?

How do I take a normal data frame, like the following:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
and produce a dataframe where the column name is added to the cell in the frame, like the following:
d = {'col1': ['col1=1', 'col1=2'], 'col2': ['col2=3', 'col2=4']}
df = pd.DataFrame(data=d)
df
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
Any help is appreciated.
Make a new DataFrame containing the col*= strings, then add it to the original df with its values converted to strings. You get the desired result because addition concatenates strings:
>>> pd.DataFrame({col:str(col)+'=' for col in df}, index=df.index) + df.astype(str)
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
You can use apply to set column name in cells and then join them with '=' and the values.
df.apply(lambda x: x.index+'=', axis=1)+df.astype(str)
Out[168]:
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
You can try this
df.ne(0).mul(df.columns)+'='+df.astype(str)
Out[1118]:
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4

pandas merge with condition

I have two dataframe and want to merge it based on max of another column
df1:
C2
A
B
C
df2:
C1 C2 val
X A 100
Y A 50.5
Z A 60
E B 90
F B 45
G C 100
I tried,
df3 = df1.merge(df2, on='C2', how='inner')['val'].max()
I get the error, AttributeError: 'numpy.float64' object has no attribute 'head'
The val column has only numbers. How should I modify this and Why do I encounter this error ?
The expected output is:
df3:
C2 C1 val
A X 100
B E 90
C G 100
Thanks in advance.
I think you need merge by left join:
df3 = df2.merge(df1, on='C2', how='left')
And then groupby with idxmax for indices of max values per groups and select rows by loc:
df3 = df3.loc[df3.groupby('C2')['val'].idxmax()]
Or use sort_values with drop_duplicates:
df3 = df3.sort_values(['C2', 'val']).drop_duplicates('C2', keep='last')
print (df3)
C1 C2 val
0 X A 100.0
3 E B 90.0
5 G C 100.0
Why do I encounter this error ?
Problem is you get scalar - max value of column val:
df3 = df1.merge(df2, on='C2', how='inner')['val'].max()
print (df3)
100.0
So if use print (df3.head()) it failed.

Pandas add empty rows as filler

Given the following data frame:
import pandas as pd
DF = pd.DataFrame({'COL1': ['A', 'A','B'],
'COL2' : [1,2,1],
'COL3' : ['X','Y','X']})
DF
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
I would like to have an additional row for COL1 = 'B' so that both values (COL1 A and B) are represented by the COL3 values X and Y, with a 0 for COL2 in the generated row.
The desired result is as follows:
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y
This is just a simplified example, but I need a calculation that could handle many such instances (and not just inserting the row in interest manually).
Thanks in advance!
UPDATE:
For a generalized scenario where there are many different combinations of values under 'COL1' and 'COL3', this works but is probably not nearly as efficient as it can be:
#Get unique set of COL3
COL3SET = set(DF['COL3'])
#Get unique set of COL1
COL1SET = set(DF['COL1'])
#Get all possible combinations of unique sets
import itertools
COMB=[]
for combination in itertools.product(COL1SET, COL3SET):
COMB.append(combination)
#Create dataframe from new set:
UNQ = pd.DataFrame({'COMB':COMB})
#Split tuples into columns
new_col_list = ['COL1unq','COL3unq']
for n,col in enumerate(new_col_list):
UNQ[col] = UNQ['COMB'].apply(lambda COMB: COMB[n])
UNQ = UNQ.drop('COMB',axis=1)
#Merge original data frame with unique set data frame
DF = pd.merge(DF,UNQ,left_on=['COL1','COL3'],right_on=['COL1unq','COL3unq'],how='outer')
#Fill in empty values of COL1 and COL3 where they did not have records
DF['COL1'] = DF['COL1unq']
DF['COL3'] = DF['COL3unq']
#Replace 'NaN's in column 2 with zeros
DF['COL2'].fillna(0, inplace=True)
#Get rid of COL1unq and COL3unq
DF.drop(['COL1unq','COL3unq'],axis=1, inplace=True)
DF
Something like this?
col1_b_vals = set(DF.loc[DF.COL1 == 'B', 'COL3'])
col1_not_b_col3_vals = set(DF.loc[DF.COL1 != 'B', 'COL3'])
missing_vals = col1_not_b_col3_vals.difference(col1_b_vals)
missing_rows = DF.loc[(DF.COL1 != 'B') & (DF.COL3.isin(missing_vals)), :]
missing_rows['COL1'] = 'B'
missing_rows['COL2'] = 0
>>> pd.concat([DF, missing_rows], ignore_index=True)
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y

Resources