Filling NA s when using pd.merge - python-3.x

I have two data frames and I want to merge them on common columns as seen below. There is also a new column in the second data frame.
dummy_data1 = {'id': ['1', '2', '3', '4'],'name': ['A', 'C', 'E', 'G'],
'year':['2012','2012','2012','2012']}
df1 = pd.DataFrame(dummy_data1, columns = ['id', 'name', 'year'])
dummy_data2 = {
'id': ['1', '2', '3', '7',],
'name': ['A', 'C', 'E', 'P'],
'ADDRESS': ['X', 'Y', 'Z', 'P'],'year':['2013','2013','2013','2013']}
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'name','ADDRESS','year'])
when I merge these two data frames with
df_merge = pd.merge(df1, df2, on=['name','id','year'],how='outer')
I get NaN s for some rows because of the newly added column, as expected:
enter image description here
My question is about the NaN s, is there a way to just repeat the data for the NaN if the data for that id is available in the other data frame. So for index 0, it brings 'X' instead of the NaNs, for index 1, 'Y' and so forth. I just want to assume that 'Address' for different years doesn't change.
Thanks!

I would suggest pandas merge ordered and use a backward fill
merge ordered works for sorted data; as such, I would advise before using it to sort the data. In your case, it already is.
pd.merge_ordered(df1,df2).bfill()
id name year ADDRESS
0 1 A 2012 X
1 1 A 2013 X
2 2 C 2012 Y
3 2 C 2013 Y
4 3 E 2012 Z
5 3 E 2013 Z
6 4 G 2012 P
7 7 P 2013 P

Related

Concatenate Each cell in column A with Column B in Python DataFrame

Need help in concatenating each row of a column with other column of a dataframe
Input:
Output
Use itertools.product in list comprehension:
from itertools import product
L = [''.join(x) for x in product(df['Col1'], df['Col2'])]
#alternative
L = [a + b for a, b in product(df['Col1'], df['Col2'])]
df = pd.DataFrame({'Col3':L})
print (df)
Col3
0 AE
1 AF
2 AG
3 BE
4 BF
5 BG
6 CE
7 CF
8 CG
Or cross join solution with helper column a:
df1 = df.assign(a=1)
df1 = df1.merge(df1, on='a')
df = (df1['Col1_x'] + df1['Col2_y']).to_frame('Col3')
Remark: it's easier to help if you copy the code for creating the input rather than images such as:
import pandas as pd
df=pd.DataFrame([['A', 'B', 'C', 'D'],['E', 'F', 'G', 'H']], columns=['col1', 'col2'])
Solution: least effort is the itertools library
from itertools import product
lst1 = ['A', 'B', 'C', 'D']
lst2 = ['E', 'F', 'G', 'H']
reslst = list(product(lst1, lst2))
or as dataframe series:
reslst = list(product(df['col1'].values, df['col2'].values))
print(reslst)
Note: as you know the result is a list which is n**2 long and hence can not be assigned to the original dataframe.

fill a new column in a pandas dataframe from the value of another dataframe [duplicate]

This question already has an answer here:
Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe
(1 answer)
Closed 4 years ago.
I have two dataframes :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5, 1]})
col1 col2 col3
0 a c 1
1 b c 2
2 a d 3
3 a d 4
4 b c 5
5 h i 1
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'f'], 'col2': ['c', 'c', 'd', 'k'], 'col3': [12, 23, 45, 78]})
col1 col2 col3
0 a c 12
1 b c 23
2 a d 45
3 f k 78
and I'd like to build a new column in the first one according to the values of col1 and col2 that can be found in the second one. That is this new one :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5],'col4' : [12, 23, 45, 45, 23]})
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23
5 h i 1 NaN
How am I able to do that ?
Tks for your attention :)
Edit : it has been adviced to look for the answer in this subject Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe but it is not the same question.
In here, not only the ID does not exist since it is splitted in col1 and col2 but above all, although being unique in the second dataframe, it is not unique in the first one. This is why I think that neither a merge nor a join can be the answer to this.
Edit2 : In addition, couples col1 and col2 of df1 may not be present in df2, in this case NaN is awaited in col4, and couples col1 and col2 of df2 may not be needed in df1. To illustrate these cases, I addes some rows in both df1 and df2 to show how it could be in the worst case scenario
You could also use map like
In [130]: cols = ['col1', 'col2']
In [131]: df1['col4'] = df1.set_index(cols).index.map(df2.set_index(cols)['col3'])
In [132]: df1
Out[132]:
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23

Looking for an analogue to pd.DataFrame.drop_duplicates() where order does not matter

I would like to use something similar to dropping the duplicates of a DataFrame. I would like columns' order not to matter. What I mean is that the function shuold consider a row consisting of the entries 'a', 'b' to be identical to a row consisting of the entries 'b', 'a'. For example, given
df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['a', 'b'], ['b', 'a']])
0 1
0 a b
1 c d
2 a b
3 b a
I would like to obtain:
0 1
0 a b
1 c d
where the preference is for efficiency, as I run this on a huge dataset within a groupby operation.
Call np.sort first, and then drop duplicates.
df[:] = np.sort(df.values, axis=1)
df.drop_duplicates()
0 1
0 a b
1 c d

Pandas Replace All But Middle Values per Category of a Level with Blank

Given the following pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I know that I can access the values of each level like so:
In [128]:
table.index.get_level_values('B')
Out[128]:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
In [129]:
table.index.get_level_values('A')
Out[129]:
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
Next, I'd like to replace all values in each of the outer levels with blank ('') save for the middle or n/2+1 values.
So that:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
becomes:
Index(['x', '', 'y', '', 'z', 'x', 'y', 'z', ''], dtype='object', name='B')
and
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
becomes:
Index(['', '', 'a', '', '', '', 'b', '', ''], dtype='object', name='A')
Ultimately, I will attempt to use these as secondary and tertiary y-axis labels in a Matplotlib horizontal bar, something chart like this (though some of my labels may be shifted up):
Finally took the time to figure this out...
#First, get the values of the index level.
A=table.index.get_level_values(0)
#Next, convert the values to a data frame.
ndf = pd.DataFrame({'A2':A.values})
#Next, get the count of rows per group.
ndf['A2Count']=ndf.groupby('A2')['A2'].transform(lambda x: x.count())
#Next, get the position based on the logic in the question.
ndf['A2Pos']=ndf['A2Count'].apply(lambda x: x/2 if x%2==0 else (x+1)/2)
#Next, order the rows per group.
ndf['A2GpOrdr']=ndf.groupby('A2').cumcount()+1
#And finally, create the column to use for plotting this level's axis label.
ndf['A2New']=ndf.apply(lambda x: x['A2'] if x['A2GpOrdr']==x['A2Pos'] else "",axis=1)
ndf
A2 A2Count A2Pos A2GpOrdr A2New
0 a 5 3.0 1
1 a 5 3.0 2
2 a 5 3.0 3 a
3 a 5 3.0 4
4 a 5 3.0 5
5 b 4 2.0 1
6 b 4 2.0 2 b
7 b 4 2.0 3
8 b 4 2.0 4

Pandas Get All Values from Multiindex levels

Given the following pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I'd like to access each value of 'C' (or level 2) as a list to use for plotting.
I'd like to do the same for 'A' and 'B' (levels 0 and 1) in such a way that it preserves spacing so that I can use those lists as well. I'm ultimately trying to use them to create something like this via plotting:
Here's the question from which this one stemmed.
Thanks in advance!
You can use get_level_values to get the index values at a specific level from a multi-index:
In [127]:
table.index.get_level_values('C')
Out[127]:
Index(['a', 'b', 'a', 'b', 'a', 'a', 'b', 'a', 'b'], dtype='object', name='C')
In [128]:
table.index.get_level_values('B')
Out[128]:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
In [129]:
table.index.get_level_values('A')
Out[129]:
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
get_level_values accepts an int param for the level or a label
Note that for the higher levels, the values are repeated to correspond with the index length at the lowest level, for display purposes you don't see this

Resources