get distinct columns dataframe

get distinct columns dataframe - python-3.x

Hello how can i do to only the lines where val is different in the 2 dataframes.
Notice that i can have id1 or id2 or both as below.
d2 = {'id1': ['X22', 'X13',np.nan,'X02','X14'],'id2': ['Y1','Y2','Y3','Y4',np.nan],'VAL1':[1,0,2,3,0]}
F1 = pd.DataFrame(data=d2)
d2 = {'id1': ['X02', 'X13',np.nan,'X22','X14'],'id2': ['Y4','Y2','Y3','Y1','Y22'],'VAL2':[1,0,4,3,1]}
F2 = pd.DataFrame(data=d2)
Expected Output
d2 = {'id1': ['X02',np.nan,'X22','X14'],'id2': ['Y4','Y3','Y1',np.nan],'VAL1':[3,2,1,0],'VAL2':[1,4,3,1]}
F3 = pd.DataFrame(data=d2)

First merge by all columns with left_on and right_on parameters, then filter out both rows and remove missing values by reshape by stack with unstack:
df=pd.merge(F1, F2, left_on=['id1','id2','VAL2'],
right_on=['id1','id2','VAL1'], how="outer", indicator=True)
df=(df[df['_merge'] !='both']
.set_index(['id1','id2'])
.drop('_merge', 1)
.stack()
.unstack()
.reset_index())
print (df)
id1 id2 VAL2 VAL1
0 X02 Y4 3 1
1 X22 Y1 1 3

F1.merge(F2,how='left',left_on=['id1','id2'],right_on=['id1','id2'])\
.query("VAL1!=VAL2")

Related

Pandas Dataframe merge on two different key to get original data

Question title might be confusing but here is the example of what I intends to perform.
Below is one the main dataframe with request data
d = {'ID':['A1','A2','A3','A4'],'ID2': ['B1','B2','B3','B4'],'B':[-1,5,6,7000],'ExtD':['CA','CB','CC','CD']}
df = pd.DataFrame(data=d)
df
Now, Response might be based on ID or ID2 column and looks like this -
d = {'RetID':['A1','A2','B3','B4'],'C':[1.3,5.4,4.5,1.3]}
df2 = pd.DataFrame(data=d)
df2
where RetID could be ID or ID2 from the request along with additional data C. Once response is received I need to merge it back with original dataframe to get data ExtD.
the solution I have come up with is to do -
df2 = df2.merge(df[['ID','ExtD',]],'left',left_on=['RetID'],right_on=['ID'])
df2 = df2.merge(df[['ID2','ExtD']],'left',left_on=['RetID'],right_on=['ID2'],suffixes = ('_d1','_d2'))
df2.rename({'ExtD_d1':'ExtD'},axis=1,inplace=True)
df2.loc[df2['ExtD'].isnull(),'ExtD'] = df2['ExtD_d2']
df2.drop({'ID2','ExtD_d2'},axis=1,inplace=True)
so expected output is,
res = {'RetID':['A1','A2','B3','B4'],'C':[1.3,5.4,4.5,1.3],'ExtD':['CA','CB','CC','CD']}
df2= pd.DataFrame(data=res)
df2
EDIT2: updated requirement tweak.
res = {'RetID':['A1','A2','B1','B2'],'C':[1.3,5.4,4.5,1.3],'ExtD':['CA','CB','CC','CD'],'ID':['A1','A2','A3','A4'],'ID2': ['B1','B2','B3','B4']}
Is there an efficient way to do this ? There might be more than 2 IDs - ID, ID2, ID3 and more than one column to join from the reqest dataframe. TIA.
EDIT: Fixed the typo.

Use melt to transform your first dataframe then merge with the second:
tmp = df.melt('ExtD', value_vars=['ID', 'ID2'], value_name='RetID')
df2 = df2.merge(tmp[['ExtD', 'RetID']])
>>> df2
RetID C ExtD
0 A1 1.3 CA
1 A2 5.4 CB
2 B1 4.5 CA
3 B2 1.3 CB
>>> tmp
ExtD variable RetID
0 CA ID A1
1 CB ID A2
2 CC ID A3
3 CD ID A4
4 CA ID2 B1
5 CB ID2 B2
6 CC ID2 B3
7 CD ID2 B4
Update
What if I need to merge ID and ID2 columns as well?
df2 = df2.merge(df[['ID', 'ID2', 'ExtD']], on='ExtD')
>>> df2
RetID C ExtD ID ID2
0 A1 1.3 CA A1 B1
1 A2 5.4 CB A2 B2
2 B3 4.5 CC A3 B3
3 B4 1.3 CD A4 B4

pd dataframe from lists and dictionary using series

I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help

you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)

Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN

You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN

How to add column name to cell in pandas dataframe?

How do I take a normal data frame, like the following:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
and produce a dataframe where the column name is added to the cell in the frame, like the following:
d = {'col1': ['col1=1', 'col1=2'], 'col2': ['col2=3', 'col2=4']}
df = pd.DataFrame(data=d)
df
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
Any help is appreciated.

Make a new DataFrame containing the col*= strings, then add it to the original df with its values converted to strings. You get the desired result because addition concatenates strings:
>>> pd.DataFrame({col:str(col)+'=' for col in df}, index=df.index) + df.astype(str)
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4

You can use apply to set column name in cells and then join them with '=' and the values.
df.apply(lambda x: x.index+'=', axis=1)+df.astype(str)
Out[168]:
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4

You can try this
df.ne(0).mul(df.columns)+'='+df.astype(str)
Out[1118]:
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4

pandas merge with condition

I have two dataframe and want to merge it based on max of another column
df1:
C2
A
B
C
df2:
C1 C2 val
X A 100
Y A 50.5
Z A 60
E B 90
F B 45
G C 100
I tried,
df3 = df1.merge(df2, on='C2', how='inner')['val'].max()
I get the error, AttributeError: 'numpy.float64' object has no attribute 'head'
The val column has only numbers. How should I modify this and Why do I encounter this error ?
The expected output is:
df3:
C2 C1 val
A X 100
B E 90
C G 100
Thanks in advance.

I think you need merge by left join:
df3 = df2.merge(df1, on='C2', how='left')
And then groupby with idxmax for indices of max values per groups and select rows by loc:
df3 = df3.loc[df3.groupby('C2')['val'].idxmax()]
Or use sort_values with drop_duplicates:
df3 = df3.sort_values(['C2', 'val']).drop_duplicates('C2', keep='last')
print (df3)
C1 C2 val
0 X A 100.0
3 E B 90.0
5 G C 100.0
Why do I encounter this error ?
Problem is you get scalar - max value of column val:
df3 = df1.merge(df2, on='C2', how='inner')['val'].max()
print (df3)
100.0
So if use print (df3.head()) it failed.

Pandas add empty rows as filler

Given the following data frame:
import pandas as pd
DF = pd.DataFrame({'COL1': ['A', 'A','B'],
'COL2' : [1,2,1],
'COL3' : ['X','Y','X']})
DF
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
I would like to have an additional row for COL1 = 'B' so that both values (COL1 A and B) are represented by the COL3 values X and Y, with a 0 for COL2 in the generated row.
The desired result is as follows:
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y
This is just a simplified example, but I need a calculation that could handle many such instances (and not just inserting the row in interest manually).
Thanks in advance!
UPDATE:
For a generalized scenario where there are many different combinations of values under 'COL1' and 'COL3', this works but is probably not nearly as efficient as it can be:
#Get unique set of COL3
COL3SET = set(DF['COL3'])
#Get unique set of COL1
COL1SET = set(DF['COL1'])
#Get all possible combinations of unique sets
import itertools
COMB=[]
for combination in itertools.product(COL1SET, COL3SET):
COMB.append(combination)
#Create dataframe from new set:
UNQ = pd.DataFrame({'COMB':COMB})
#Split tuples into columns
new_col_list = ['COL1unq','COL3unq']
for n,col in enumerate(new_col_list):
UNQ[col] = UNQ['COMB'].apply(lambda COMB: COMB[n])
UNQ = UNQ.drop('COMB',axis=1)
#Merge original data frame with unique set data frame
DF = pd.merge(DF,UNQ,left_on=['COL1','COL3'],right_on=['COL1unq','COL3unq'],how='outer')
#Fill in empty values of COL1 and COL3 where they did not have records
DF['COL1'] = DF['COL1unq']
DF['COL3'] = DF['COL3unq']
#Replace 'NaN's in column 2 with zeros
DF['COL2'].fillna(0, inplace=True)
#Get rid of COL1unq and COL3unq
DF.drop(['COL1unq','COL3unq'],axis=1, inplace=True)
DF

Something like this?
col1_b_vals = set(DF.loc[DF.COL1 == 'B', 'COL3'])
col1_not_b_col3_vals = set(DF.loc[DF.COL1 != 'B', 'COL3'])
missing_vals = col1_not_b_col3_vals.difference(col1_b_vals)
missing_rows = DF.loc[(DF.COL1 != 'B') & (DF.COL3.isin(missing_vals)), :]
missing_rows['COL1'] = 'B'
missing_rows['COL2'] = 0
>>> pd.concat([DF, missing_rows], ignore_index=True)
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

get distinct columns dataframe - python-3.x

F1.merge(F2,how='left',left_on=['id1','id2'],right_on=['id1','id2'])\ .query("VAL1!=VAL2")

Related

Pandas Dataframe merge on two different key to get original data

pd dataframe from lists and dictionary using series

How to add column name to cell in pandas dataframe?

pandas merge with condition

Pandas add empty rows as filler

Categories

Resources