How to create new rows for entries that do not exist, in pandas - python-3.x

I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'cat': ['a', 'a', 'a', 'b'], 'br': [1,2,2,3], 'ch': ['A', 'A', 'B', 'C'],
'value': [10,20,30,40]})
For every cat and br, I want to add the ch that is missing with value 0
My final dataframe should look like this:
foo_final = pd.DataFrame({'cat': ['a', 'a', 'a', 'b', 'a', 'a', 'a', 'b', 'b'],
'br': [1,2,2,3, 1, 1, 2, 3, 3],
'ch': ['A', 'A', 'B','C','B', 'C', 'C', 'A', 'B'],
'value': [10,20,30,40, 0,0, 0,0,0]})

Use DataFrame.set_index
for Multiindex and then DataFrame.unstack with DataFrame.stack:
foo = foo.set_index(['cat','br','ch']).unstack(fill_value=0).stack().reset_index()
print (foo)
cat br ch value
0 a 1 A 10
1 a 1 B 0
2 a 1 C 0
3 a 2 A 20
4 a 2 B 30
5 a 2 C 0
6 b 3 A 0
7 b 3 B 0

Related

fill a new column in a pandas dataframe from the value of another dataframe [duplicate]

This question already has an answer here:
Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe
(1 answer)
Closed 4 years ago.
I have two dataframes :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5, 1]})
col1 col2 col3
0 a c 1
1 b c 2
2 a d 3
3 a d 4
4 b c 5
5 h i 1
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'f'], 'col2': ['c', 'c', 'd', 'k'], 'col3': [12, 23, 45, 78]})
col1 col2 col3
0 a c 12
1 b c 23
2 a d 45
3 f k 78
and I'd like to build a new column in the first one according to the values of col1 and col2 that can be found in the second one. That is this new one :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5],'col4' : [12, 23, 45, 45, 23]})
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23
5 h i 1 NaN
How am I able to do that ?
Tks for your attention :)
Edit : it has been adviced to look for the answer in this subject Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe but it is not the same question.
In here, not only the ID does not exist since it is splitted in col1 and col2 but above all, although being unique in the second dataframe, it is not unique in the first one. This is why I think that neither a merge nor a join can be the answer to this.
Edit2 : In addition, couples col1 and col2 of df1 may not be present in df2, in this case NaN is awaited in col4, and couples col1 and col2 of df2 may not be needed in df1. To illustrate these cases, I addes some rows in both df1 and df2 to show how it could be in the worst case scenario
You could also use map like
In [130]: cols = ['col1', 'col2']
In [131]: df1['col4'] = df1.set_index(cols).index.map(df2.set_index(cols)['col3'])
In [132]: df1
Out[132]:
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23

Change values in pandas dataframe based on values in certain columns

How can I convert this first dataframe to the one below it? Based on different scenarios of the first three columns matching, I want to change the values in the rest of the columns.
import pandas as pd
df = pd.DataFrame([['foo', 'foo', 'bar', 'a', 'b', 'c', 'd'], ['bar', 'foo', 'bar', 'a', 'b', 'c', 'd'],
['spa', 'foo', 'bar', 'a', 'b', 'c', 'd']], columns=['col1', 'col2', 'col3', 's1', 's2', 's3', 's4'])
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar a b c d
1 bar foo bar a b c d
2 spa foo bar a b c d
If col1 = col2, I want to change all a's to 2, all b's and c's to 1, and all d's to 0. This is row 1 in my example df.
If col1 = col3, I want to change all a's to 0, all b's and c's to 1, and all d's to 2. This is row 2 in my example df.
If col1 != col2/col3, I want to delete the row and add 1 to a counter so I have a total of deleted rows. This is row 3 in my example df.
So my final dataframe would look like this, with counter = 1:
df = pd.DataFrame([['foo', 'foo', 'bar', '2', '1', '1', '0'], ['bar', 'foo', 'bar', '0', '1', '1', '2']],
columns=['col1', 'col2', 'col3', 's1', 's2', 's3', 's4'])
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar 2 1 1 0
1 bar foo bar 0 1 1 2
I was reading that using df.iterrows is slow and so there must be a way to do this on the whole df at once, but my original idea was:
for row in df.iterrows:
if (row["col1"] == row["col2"]):
df.replace(to_replace=['a'], value='2', inplace=True)
df.replace(to_replace=['b', 'c'], value='1', inplace=True)
df.replace(to_replace=['d'], value='0', inplace=True)
elif (row["col1"] == row["col3"]):
df.replace(to_replace=['a'], value='0', inplace=True)
df.replace(to_replace=['b', 'c'], value='1', inplace=True)
df.replace(to_replace=['d'], value='2', inplace=True)
else:
(delete row, add 1 to counter)
The original df is massive, so speed is important to me. I'm hoping it's possible to do the conversions on the whole dataframe without iterrows. Even if it's not possible, I could use help getting the syntax right for the iterrows.
You can remove rows by boolean indexing first:
m1 = df["col1"] == df["col2"]
m2 = df["col1"] == df["col3"]
m = m1 | m2
Get number of removed rows by sum of chained condition m1 and m2 with inverting by ~:
counter = (~m).sum()
print (counter)
1
df = df[m].copy()
print (df)
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar a b c d
1 bar foo bar a b c d
and then replace with dictionary by condition:
d1 = {'a':2,'b':1,'c':1,'d':0}
d2 = {'a':0,'b':1,'c':1,'d':2}
m1 = df["col1"] == df["col2"]
#replace all columns without col1-col3
cols = df.columns.difference(['col1','col2','col3'])
df.loc[m1, cols] = df.loc[m1, cols].replace(d1)
df.loc[~m1, cols] = df.loc[~m1, cols].replace(d2)
print (df)
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar 2 1 1 0
1 bar foo bar 0 1 1 2
Timings:
In [138]: %timeit (jez(df))
872 ms ± 6.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [139]: %timeit (hb(df))
1.33 s ± 9.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Setup:
np.random.seed(456)
a = ['foo','bar', 'spa']
b = list('abcd')
N = 100000
df1 = pd.DataFrame(np.random.choice(a, size=(N, 3))).rename(columns=lambda x: 'col{}'.format(x+1))
df2 = pd.DataFrame(np.random.choice(b, size=(N, 20))).rename(columns=lambda x: 's{}'.format(x+1))
df = df1.join(df2)
#print (df.head())
def jez(df):
m1 = df["col1"] == df["col2"]
m2 = df["col1"] == df["col3"]
m = m1 | m2
counter = (~m).sum()
df = df[m].copy()
d1 = {'a':2,'b':1,'c':1,'d':0}
d2 = {'a':0,'b':1,'c':1,'d':2}
m1 = df["col1"] == df["col2"]
cols = df.columns.difference(['col1','col2','col3'])
df.loc[m1, cols] = df.loc[m1, cols].replace(d1)
df.loc[~m1, cols] = df.loc[~m1, cols].replace(d2)
return df
def hb(df):
counter = 0
df[df.col1 == df.col2] = df[df.col1 == df.col2].replace(['a', 'b', 'c', 'd'], [2,1,1,0])
df[df.col1 == df.col3] = df[df.col1 == df.col3].replace(['a', 'b', 'c', 'd'], [0,1,1,2])
index_drop =df[((df.col1 != df.col3) & (df.col1 != df.col2))].index
counter = counter + len(index_drop)
df = df.drop(index_drop)
return df
You can use:
import pandas as pd
df = pd.DataFrame([['foo', 'foo', 'bar', 'a', 'b', 'c', 'd'], ['bar', 'foo', 'bar', 'a', 'b', 'c', 'd'],
['spa', 'foo', 'bar', 'a', 'b', 'c', 'd']], columns=['col1', 'col2', 'col3', 's1', 's2', 's3', 's4'])
counter = 0
#
df[df.col1 == df.col2] = df[df.col1 == df.col2].replace(['a', 'b', 'c', 'd'], [2,1,1,0])
df[df.col1 == df.col3] = df[df.col1 == df.col3].replace(['a', 'b', 'c', 'd'], [0,1,1,2])
index_drop =df[((df.col1 != df.col3) & (df.col1 != df.col2))].index
counter = counter + len(index_drop)
df = df.drop(index_drop)
print(df)
print(counter)
Output:
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar 2 1 1 0
1 bar foo bar 0 1 1 2
1 # counter

Remove exact rows and frequency of rows of a data.frame where certain column values match with column values of another data.frame in python 3

Consider the following two data.frames created using pandas in python 3:
a1 = pd.DataFrame(({'NO': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8'],
'A': [1, 2, 3, 4, 5, 2, 4, 2],
'B': ['a', 'b', 'c', 'd', 'e', 'b', 'd', 'b']}))
a2 = pd.DataFrame(({'NO': ['d9', 'd10', 'd11', 'd12'],
'A': [1, 2, 3, 2],
'B': ['a', 'b', 'c', 'b']}))
I would like to remove the exact rows of a1 that are in a2 wherever the values of columns 'A' an 'B' are the same (except for the 'NO' column) so that the result should be:
A B NO
4 d d4
5 e d5
4 d d7
2 b d8
Is there any built-in function in pandas or any other library in python 3 to get this result?

Pandas Replace All But Middle Values per Category of a Level with Blank

Given the following pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I know that I can access the values of each level like so:
In [128]:
table.index.get_level_values('B')
Out[128]:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
In [129]:
table.index.get_level_values('A')
Out[129]:
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
Next, I'd like to replace all values in each of the outer levels with blank ('') save for the middle or n/2+1 values.
So that:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
becomes:
Index(['x', '', 'y', '', 'z', 'x', 'y', 'z', ''], dtype='object', name='B')
and
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
becomes:
Index(['', '', 'a', '', '', '', 'b', '', ''], dtype='object', name='A')
Ultimately, I will attempt to use these as secondary and tertiary y-axis labels in a Matplotlib horizontal bar, something chart like this (though some of my labels may be shifted up):
Finally took the time to figure this out...
#First, get the values of the index level.
A=table.index.get_level_values(0)
#Next, convert the values to a data frame.
ndf = pd.DataFrame({'A2':A.values})
#Next, get the count of rows per group.
ndf['A2Count']=ndf.groupby('A2')['A2'].transform(lambda x: x.count())
#Next, get the position based on the logic in the question.
ndf['A2Pos']=ndf['A2Count'].apply(lambda x: x/2 if x%2==0 else (x+1)/2)
#Next, order the rows per group.
ndf['A2GpOrdr']=ndf.groupby('A2').cumcount()+1
#And finally, create the column to use for plotting this level's axis label.
ndf['A2New']=ndf.apply(lambda x: x['A2'] if x['A2GpOrdr']==x['A2Pos'] else "",axis=1)
ndf
A2 A2Count A2Pos A2GpOrdr A2New
0 a 5 3.0 1
1 a 5 3.0 2
2 a 5 3.0 3 a
3 a 5 3.0 4
4 a 5 3.0 5
5 b 4 2.0 1
6 b 4 2.0 2 b
7 b 4 2.0 3
8 b 4 2.0 4

Pandas Get All Values from Multiindex levels

Given the following pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I'd like to access each value of 'C' (or level 2) as a list to use for plotting.
I'd like to do the same for 'A' and 'B' (levels 0 and 1) in such a way that it preserves spacing so that I can use those lists as well. I'm ultimately trying to use them to create something like this via plotting:
Here's the question from which this one stemmed.
Thanks in advance!
You can use get_level_values to get the index values at a specific level from a multi-index:
In [127]:
table.index.get_level_values('C')
Out[127]:
Index(['a', 'b', 'a', 'b', 'a', 'a', 'b', 'a', 'b'], dtype='object', name='C')
In [128]:
table.index.get_level_values('B')
Out[128]:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
In [129]:
table.index.get_level_values('A')
Out[129]:
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
get_level_values accepts an int param for the level or a label
Note that for the higher levels, the values are repeated to correspond with the index length at the lowest level, for display purposes you don't see this

Resources