Pandas Get All Values from Multiindex levels - python-3.x

Given the following pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I'd like to access each value of 'C' (or level 2) as a list to use for plotting.
I'd like to do the same for 'A' and 'B' (levels 0 and 1) in such a way that it preserves spacing so that I can use those lists as well. I'm ultimately trying to use them to create something like this via plotting:
Here's the question from which this one stemmed.
Thanks in advance!

You can use get_level_values to get the index values at a specific level from a multi-index:
In [127]:
table.index.get_level_values('C')
Out[127]:
Index(['a', 'b', 'a', 'b', 'a', 'a', 'b', 'a', 'b'], dtype='object', name='C')
In [128]:
table.index.get_level_values('B')
Out[128]:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
In [129]:
table.index.get_level_values('A')
Out[129]:
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
get_level_values accepts an int param for the level or a label
Note that for the higher levels, the values are repeated to correspond with the index length at the lowest level, for display purposes you don't see this

Related

Concatenate Each cell in column A with Column B in Python DataFrame

Need help in concatenating each row of a column with other column of a dataframe
Input:
Output
Use itertools.product in list comprehension:
from itertools import product
L = [''.join(x) for x in product(df['Col1'], df['Col2'])]
#alternative
L = [a + b for a, b in product(df['Col1'], df['Col2'])]
df = pd.DataFrame({'Col3':L})
print (df)
Col3
0 AE
1 AF
2 AG
3 BE
4 BF
5 BG
6 CE
7 CF
8 CG
Or cross join solution with helper column a:
df1 = df.assign(a=1)
df1 = df1.merge(df1, on='a')
df = (df1['Col1_x'] + df1['Col2_y']).to_frame('Col3')
Remark: it's easier to help if you copy the code for creating the input rather than images such as:
import pandas as pd
df=pd.DataFrame([['A', 'B', 'C', 'D'],['E', 'F', 'G', 'H']], columns=['col1', 'col2'])
Solution: least effort is the itertools library
from itertools import product
lst1 = ['A', 'B', 'C', 'D']
lst2 = ['E', 'F', 'G', 'H']
reslst = list(product(lst1, lst2))
or as dataframe series:
reslst = list(product(df['col1'].values, df['col2'].values))
print(reslst)
Note: as you know the result is a list which is n**2 long and hence can not be assigned to the original dataframe.

How to create new rows for entries that do not exist, in pandas

I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'cat': ['a', 'a', 'a', 'b'], 'br': [1,2,2,3], 'ch': ['A', 'A', 'B', 'C'],
'value': [10,20,30,40]})
For every cat and br, I want to add the ch that is missing with value 0
My final dataframe should look like this:
foo_final = pd.DataFrame({'cat': ['a', 'a', 'a', 'b', 'a', 'a', 'a', 'b', 'b'],
'br': [1,2,2,3, 1, 1, 2, 3, 3],
'ch': ['A', 'A', 'B','C','B', 'C', 'C', 'A', 'B'],
'value': [10,20,30,40, 0,0, 0,0,0]})
Use DataFrame.set_index
for Multiindex and then DataFrame.unstack with DataFrame.stack:
foo = foo.set_index(['cat','br','ch']).unstack(fill_value=0).stack().reset_index()
print (foo)
cat br ch value
0 a 1 A 10
1 a 1 B 0
2 a 1 C 0
3 a 2 A 20
4 a 2 B 30
5 a 2 C 0
6 b 3 A 0
7 b 3 B 0

Filling NA s when using pd.merge

I have two data frames and I want to merge them on common columns as seen below. There is also a new column in the second data frame.
dummy_data1 = {'id': ['1', '2', '3', '4'],'name': ['A', 'C', 'E', 'G'],
'year':['2012','2012','2012','2012']}
df1 = pd.DataFrame(dummy_data1, columns = ['id', 'name', 'year'])
dummy_data2 = {
'id': ['1', '2', '3', '7',],
'name': ['A', 'C', 'E', 'P'],
'ADDRESS': ['X', 'Y', 'Z', 'P'],'year':['2013','2013','2013','2013']}
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'name','ADDRESS','year'])
when I merge these two data frames with
df_merge = pd.merge(df1, df2, on=['name','id','year'],how='outer')
I get NaN s for some rows because of the newly added column, as expected:
enter image description here
My question is about the NaN s, is there a way to just repeat the data for the NaN if the data for that id is available in the other data frame. So for index 0, it brings 'X' instead of the NaNs, for index 1, 'Y' and so forth. I just want to assume that 'Address' for different years doesn't change.
Thanks!
I would suggest pandas merge ordered and use a backward fill
merge ordered works for sorted data; as such, I would advise before using it to sort the data. In your case, it already is.
pd.merge_ordered(df1,df2).bfill()
id name year ADDRESS
0 1 A 2012 X
1 1 A 2013 X
2 2 C 2012 Y
3 2 C 2013 Y
4 3 E 2012 Z
5 3 E 2013 Z
6 4 G 2012 P
7 7 P 2013 P

Remove exact rows and frequency of rows of a data.frame where certain column values match with column values of another data.frame in python 3

Consider the following two data.frames created using pandas in python 3:
a1 = pd.DataFrame(({'NO': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8'],
'A': [1, 2, 3, 4, 5, 2, 4, 2],
'B': ['a', 'b', 'c', 'd', 'e', 'b', 'd', 'b']}))
a2 = pd.DataFrame(({'NO': ['d9', 'd10', 'd11', 'd12'],
'A': [1, 2, 3, 2],
'B': ['a', 'b', 'c', 'b']}))
I would like to remove the exact rows of a1 that are in a2 wherever the values of columns 'A' an 'B' are the same (except for the 'NO' column) so that the result should be:
A B NO
4 d d4
5 e d5
4 d d7
2 b d8
Is there any built-in function in pandas or any other library in python 3 to get this result?

Pandas Replace All But Middle Values per Category of a Level with Blank

Given the following pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I know that I can access the values of each level like so:
In [128]:
table.index.get_level_values('B')
Out[128]:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
In [129]:
table.index.get_level_values('A')
Out[129]:
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
Next, I'd like to replace all values in each of the outer levels with blank ('') save for the middle or n/2+1 values.
So that:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
becomes:
Index(['x', '', 'y', '', 'z', 'x', 'y', 'z', ''], dtype='object', name='B')
and
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
becomes:
Index(['', '', 'a', '', '', '', 'b', '', ''], dtype='object', name='A')
Ultimately, I will attempt to use these as secondary and tertiary y-axis labels in a Matplotlib horizontal bar, something chart like this (though some of my labels may be shifted up):
Finally took the time to figure this out...
#First, get the values of the index level.
A=table.index.get_level_values(0)
#Next, convert the values to a data frame.
ndf = pd.DataFrame({'A2':A.values})
#Next, get the count of rows per group.
ndf['A2Count']=ndf.groupby('A2')['A2'].transform(lambda x: x.count())
#Next, get the position based on the logic in the question.
ndf['A2Pos']=ndf['A2Count'].apply(lambda x: x/2 if x%2==0 else (x+1)/2)
#Next, order the rows per group.
ndf['A2GpOrdr']=ndf.groupby('A2').cumcount()+1
#And finally, create the column to use for plotting this level's axis label.
ndf['A2New']=ndf.apply(lambda x: x['A2'] if x['A2GpOrdr']==x['A2Pos'] else "",axis=1)
ndf
A2 A2Count A2Pos A2GpOrdr A2New
0 a 5 3.0 1
1 a 5 3.0 2
2 a 5 3.0 3 a
3 a 5 3.0 4
4 a 5 3.0 5
5 b 4 2.0 1
6 b 4 2.0 2 b
7 b 4 2.0 3
8 b 4 2.0 4

Resources