Pandas Replace All But Middle Values per Category of a Level with Blank - python-3.x

Given the following pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I know that I can access the values of each level like so:
In [128]:
table.index.get_level_values('B')
Out[128]:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
In [129]:
table.index.get_level_values('A')
Out[129]:
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
Next, I'd like to replace all values in each of the outer levels with blank ('') save for the middle or n/2+1 values.
So that:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
becomes:
Index(['x', '', 'y', '', 'z', 'x', 'y', 'z', ''], dtype='object', name='B')
and
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
becomes:
Index(['', '', 'a', '', '', '', 'b', '', ''], dtype='object', name='A')
Ultimately, I will attempt to use these as secondary and tertiary y-axis labels in a Matplotlib horizontal bar, something chart like this (though some of my labels may be shifted up):

Finally took the time to figure this out...
#First, get the values of the index level.
A=table.index.get_level_values(0)
#Next, convert the values to a data frame.
ndf = pd.DataFrame({'A2':A.values})
#Next, get the count of rows per group.
ndf['A2Count']=ndf.groupby('A2')['A2'].transform(lambda x: x.count())
#Next, get the position based on the logic in the question.
ndf['A2Pos']=ndf['A2Count'].apply(lambda x: x/2 if x%2==0 else (x+1)/2)
#Next, order the rows per group.
ndf['A2GpOrdr']=ndf.groupby('A2').cumcount()+1
#And finally, create the column to use for plotting this level's axis label.
ndf['A2New']=ndf.apply(lambda x: x['A2'] if x['A2GpOrdr']==x['A2Pos'] else "",axis=1)
ndf
A2 A2Count A2Pos A2GpOrdr A2New
0 a 5 3.0 1
1 a 5 3.0 2
2 a 5 3.0 3 a
3 a 5 3.0 4
4 a 5 3.0 5
5 b 4 2.0 1
6 b 4 2.0 2 b
7 b 4 2.0 3
8 b 4 2.0 4

Related

Create a new column in a dataframe, based on Groupby and values in a separate column

I have a df like so:
df = pd.DataFrame({'Info': ['A','B','C', 'D', 'E'], 'Section':['1','1', '2', '2', '3']})
I want to be able to create a new column, like 'Unique_Info', like so:
df = pd.DataFrame({'Info': ['A','B','C', 'D', 'E'], 'Section':['1','1', '2', '2', '3'],
'Unique_Info':[['A', 'B'], ['A', 'B'], ['C', 'D'], ['C', 'D'], ['E']]})
So a list is created with all unique values from the Info column, belonging to that section. So Section=1, hence ['A', 'B'].
I assume groupby is the most convenient way, and I've used the following:
df['Unique_Info'] = df.groupby('Section').agg({'Info':'unique'})
Any ideas where I'm going wrong?
df.groupby().agg returns a series with different indexing, which is the Section number. You should use map to assign back to your dataframe:
s = df.groupby('Section')['Info'].agg('unique')
df['Unique_Info'] = df['Section'].map(s)
Output:
Info Section Unique_Info
0 A 1 [A, B]
1 B 1 [A, B]
2 C 2 [C, D]
3 D 2 [C, D]
4 E 3 [E]
Use df.merge and df.agg:
In [1531]: grp = df.groupby('Section')['Info'].agg(list).reset_index()
In [1535]: df.merge(grp, on='Section').rename(columns={'Info_y': 'unique'})
Out[1535]:
Info_x Section unique
0 A 1 [A, B]
1 B 1 [A, B]
2 C 2 [C, D]
3 D 2 [C, D]
4 E 3 [E]

How to create new rows for entries that do not exist, in pandas

I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'cat': ['a', 'a', 'a', 'b'], 'br': [1,2,2,3], 'ch': ['A', 'A', 'B', 'C'],
'value': [10,20,30,40]})
For every cat and br, I want to add the ch that is missing with value 0
My final dataframe should look like this:
foo_final = pd.DataFrame({'cat': ['a', 'a', 'a', 'b', 'a', 'a', 'a', 'b', 'b'],
'br': [1,2,2,3, 1, 1, 2, 3, 3],
'ch': ['A', 'A', 'B','C','B', 'C', 'C', 'A', 'B'],
'value': [10,20,30,40, 0,0, 0,0,0]})
Use DataFrame.set_index
for Multiindex and then DataFrame.unstack with DataFrame.stack:
foo = foo.set_index(['cat','br','ch']).unstack(fill_value=0).stack().reset_index()
print (foo)
cat br ch value
0 a 1 A 10
1 a 1 B 0
2 a 1 C 0
3 a 2 A 20
4 a 2 B 30
5 a 2 C 0
6 b 3 A 0
7 b 3 B 0

Filling NA s when using pd.merge

I have two data frames and I want to merge them on common columns as seen below. There is also a new column in the second data frame.
dummy_data1 = {'id': ['1', '2', '3', '4'],'name': ['A', 'C', 'E', 'G'],
'year':['2012','2012','2012','2012']}
df1 = pd.DataFrame(dummy_data1, columns = ['id', 'name', 'year'])
dummy_data2 = {
'id': ['1', '2', '3', '7',],
'name': ['A', 'C', 'E', 'P'],
'ADDRESS': ['X', 'Y', 'Z', 'P'],'year':['2013','2013','2013','2013']}
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'name','ADDRESS','year'])
when I merge these two data frames with
df_merge = pd.merge(df1, df2, on=['name','id','year'],how='outer')
I get NaN s for some rows because of the newly added column, as expected:
enter image description here
My question is about the NaN s, is there a way to just repeat the data for the NaN if the data for that id is available in the other data frame. So for index 0, it brings 'X' instead of the NaNs, for index 1, 'Y' and so forth. I just want to assume that 'Address' for different years doesn't change.
Thanks!
I would suggest pandas merge ordered and use a backward fill
merge ordered works for sorted data; as such, I would advise before using it to sort the data. In your case, it already is.
pd.merge_ordered(df1,df2).bfill()
id name year ADDRESS
0 1 A 2012 X
1 1 A 2013 X
2 2 C 2012 Y
3 2 C 2013 Y
4 3 E 2012 Z
5 3 E 2013 Z
6 4 G 2012 P
7 7 P 2013 P

Remove exact rows and frequency of rows of a data.frame where certain column values match with column values of another data.frame in python 3

Consider the following two data.frames created using pandas in python 3:
a1 = pd.DataFrame(({'NO': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8'],
'A': [1, 2, 3, 4, 5, 2, 4, 2],
'B': ['a', 'b', 'c', 'd', 'e', 'b', 'd', 'b']}))
a2 = pd.DataFrame(({'NO': ['d9', 'd10', 'd11', 'd12'],
'A': [1, 2, 3, 2],
'B': ['a', 'b', 'c', 'b']}))
I would like to remove the exact rows of a1 that are in a2 wherever the values of columns 'A' an 'B' are the same (except for the 'NO' column) so that the result should be:
A B NO
4 d d4
5 e d5
4 d d7
2 b d8
Is there any built-in function in pandas or any other library in python 3 to get this result?

Pandas Get All Values from Multiindex levels

Given the following pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I'd like to access each value of 'C' (or level 2) as a list to use for plotting.
I'd like to do the same for 'A' and 'B' (levels 0 and 1) in such a way that it preserves spacing so that I can use those lists as well. I'm ultimately trying to use them to create something like this via plotting:
Here's the question from which this one stemmed.
Thanks in advance!
You can use get_level_values to get the index values at a specific level from a multi-index:
In [127]:
table.index.get_level_values('C')
Out[127]:
Index(['a', 'b', 'a', 'b', 'a', 'a', 'b', 'a', 'b'], dtype='object', name='C')
In [128]:
table.index.get_level_values('B')
Out[128]:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
In [129]:
table.index.get_level_values('A')
Out[129]:
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
get_level_values accepts an int param for the level or a label
Note that for the higher levels, the values are repeated to correspond with the index length at the lowest level, for display purposes you don't see this

Resources