Pandas pivot_table: Defining Columns - python-3.x

I'm new to pandas and python in general. I'm pulling data from an Access database and creating a pivot table.
PTable = TRep.pivot_table(values = ['Students'],
index = ['GradeLevel', 'Class'],
columns = ['Grade'],
aggfunc='count', fill_value=0, margins=True, dropna=True,
margins_name='Grand Total')
Grade will always be A, B, C, D, F - And I want the resulting pivot table to always show columns for those 5 grades even if there are 0 students with that grade.
Currently, if the list of students pulled from Access does not contain a student receiving a C (for example), the resulting pivot table will have the C column omitted.
Is there a way to define constant columns in a pivot table?

What I have tried:
This is my sample data:
GradeLevel Class Student Grade
0 I 1 AAA A
1 I 2 BBB B
2 I 2 CCC D
3 I 3 DDD E
4 I 4 EEE A
5 II 1 FFF B
6 II 2 GGG A
7 II 3 HHH B
8 II 4 KKK D
9 II 1 LLL D
10 II 2 MMM E
11 III 1 NNN E
12 III 2 OOO A
13 III 2 PPP A
14 III 3 QQQ A
Change Grade column to category.
df["Grade"] = df["Grade"].astype('category')
Set the level of category of Grade column.
df["Grade"] = df["Grade"].cat.set_categories(["A", "B", "C", "D", "E"])
Pivot the data:
df.pivot_table(values = ["Student"], index = ["GradeLevel", "Class"],
columns = ["Grade"], aggfunc='count', fill_value=0,
margins=True, dropna=False, margins_name='Grand Total')
Result:
Student
Grade A B C D E Grand Total
GradeLevel Class
I 1 1 0 0 0 0 1.0
2 0 1 0 1 0 2.0
3 0 0 0 0 1 1.0
4 1 0 0 0 0 1.0
II 1 0 1 0 1 0 2.0
2 1 0 0 0 1 2.0
3 0 1 0 0 0 1.0
4 0 0 0 1 0 1.0
III 1 0 0 0 0 1 1.0
2 2 0 0 0 0 2.0
3 1 0 0 0 0 1.0
4 0 0 0 0 0 NaN
Grand Total 6 3 0 3 3 15.0
But from the pivot table we still see the NaN value. So to remove that NaN value:
(df.pivot_table(values = ["Student"], index = ["GradeLevel", "Class"],
columns = ["Grade"], aggfunc='count', fill_value=0,
margins=True, dropna=False, margins_name='Grand Total')).dropna()
Result:
Student
Grade A B C D E Grand Total
GradeLevel Class
I 1 1 0 0 0 0 1.0
2 0 1 0 1 0 2.0
3 0 0 0 0 1 1.0
4 1 0 0 0 0 1.0
II 1 0 1 0 1 0 2.0
2 1 0 0 0 1 2.0
3 0 1 0 0 0 1.0
4 0 0 0 1 0 1.0
III 1 0 0 0 0 1 1.0
2 2 0 0 0 0 2.0
3 1 0 0 0 0 1.0
Grand Total 6 3 0 3 3 15.0
Hope it is useful...

I suppose you can always put some 'touch-up' onto the df once it's been created. For example, you can add a column and fill it up with nan i.e. df['C'] = np.nan

Simply convert the grades column to categorical, specifying all possible values.
TRep[‘Grade’] = pd.Categorical(TRep[‘Grade’], [‘A’, ‘B’, ‘C’, ‘D’, ‘F’])
Then pass dropna=False to pivot_table and it’ll include all the columns.

Related

Returning column header corresponding to matched value - follow up

Sheet 1
Name
Gender
w
1
e
1
r
2
t
4
y
6
u
2
i
NoMatch
q
1
w
1
e
1
r
2
Sheet 2
Name
Male 1
Female 2
other 3
other 4
other 5
Donotknow 6
w
1
0
0
0
0
0
a
0
0
0
0
0
1
q
1
0
0
0
0
0
r
0
1
0
0
0
0
e
1
0
0
0
0
0
t
0
0
0
1
1
0
y
0
0
0
0
0
1
u
0
1
0
0
0
0
I am using this formula in Sheet 1 under Gender:
=IFERROR(INDEX({1,2,3,4,5,6},MATCH(1,INDEX(Sheet2!$B$2:$G$9,MATCH(A2,Sheet2!$A$2:$A$9,0),0),0)),"No Match")
if I have 2 matches... For example t: columns other 4 and other 5 have 1. I would like the output to be 4;5 in the same cell.
How can I modify the formula to reflect that?
I am using an older excel - Filter command does not work.

Vectorized way of using the previous row value based on the condition

I have a pandas dataframe as below. I want to perform the below condition:
if Column 'A' is 1 then update the value of column 'F' with the previous value of 'F'. This can be done row by row iteration but it is not efficient way of doing that. I want a vectorized method of doing that.
df = pd.DataFrame({'A':[1,1,1, 0, 0, 0, 1, 0, 0], 'C':[1,1,1, 0, 0, 0, 1, 1, 1], 'D':[1,1,1, 0, 0, 0, 1, 1, 1],
'F':[2,0,0, 0, 0, 1, 1, 1, 1]})
df
A C D F
0 1 1 1 2
1 1 1 1 0
2 1 1 1 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
My desired output:
A C D F
0 1 1 1 2
1 1 1 1 2
2 1 1 1 2
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
I tried the below code, but it doesnot work because when I use shift, it doesnot take the updated previous row.
df['F'] = df.groupby(['A'])['F'].shift(1)
df
A C D F
0 1 1 1 NaN
1 1 1 1 2.0
2 1 1 1 0.0
3 0 0 0 NaN
4 0 0 0 0.0
5 0 0 0 0.0
6 1 1 1 0.0
7 0 1 1 1.0
8 0 1 1 1.0
transform('first')
df.F.groupby(df.A.rsub(1).cumsum()).transform('first')
0 2
1 2
2 2
3 0
4 0
5 1
6 1
7 1
8 1
Name: F, dtype: int64
Assign to column 'F'
df.assign(F=df.F.groupby(df.A.rsub(1).cumsum()).transform('first'))
A C D F
0 1 1 1 2
1 1 1 1 2
2 1 1 1 2
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
we also know how to do it without groupby:
where=df['A'].eq(1)&df['A'].ne(df['A'].shift())
df['F']=df['F'].where(where).ffill().mask(df['A'].ne(1),df['F'])
print(df)
A C D F
0 1 1 1 2.0
1 1 1 1 2.0
2 1 1 1 2.0
3 0 0 0 0.0
4 0 0 0 0.0
5 0 0 0 1.0
6 1 1 1 1.0
7 0 1 1 1.0
8 0 1 1 1.0

Combine columns in pandas to create a new column

Hello I am working on pandas dataframe and I want to create a column combining multiple columns and applying condition on them and I am looking for a smart way to do it.
Suppose the data frame looks as
A B C D
1 0 0 0
0 1 0 0
0 0 1 0
1 0 1 0
1 1 1 0
0 0 1 1
My output column should be as below
A B C D Output_col
1 0 0 0 A
0 1 0 0 B
0 0 1 0 C
1 0 1 0 A_C
1 1 1 0 A_B_C
0 0 1 1 C_D
I can certainly achieve this using below code but then I have to do it for every column.
test['Output_col'] = test.A.apply(lambda x: A if x > 0 else 0)
I was wondering if there is a way where I could achieve this without applying to every column if I have very large number of columns.
Thanks in advance !!
Use DataFrame.apply + join.
Select column names using x.index(
note that axis = 1 is used) + boolean indexing with Series.eq to filter the selected columns :
test['Output_col']=test.apply(lambda x: '_'.join(x.index[x.eq(1)]),axis=1)
print(test)
A B C D Output_col
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 1 0 1 0 A_C
4 1 1 1 0 A_B_C
5 0 0 1 1 C_D
To apply only a list of columns:
my_list_columns=['enter element of your list']
test['Output_col']=test[my_list_columns].apply(lambda x: '_'.join(x.index[x.eq(1)]),axis=1)
print(test)
case to all columns is 0
my_list_columns=['A','B','C','D']
df['Output_col']=df[my_list_columns].apply(lambda x: '_'.join(x.index[x.eq(1)]) if x.eq(1).any() else 'no_value',axis=1)
print(df)
A B C D Output_col
0 1 0 0 0 A
1 0 0 0 0 no_value
2 0 0 1 0 C
3 1 0 1 0 A_C
4 1 0 1 0 A_C
5 0 0 1 1 C_D
Edit: for a subset of columns (I use method 2)
cols = ['A', 'B']
df1 = df[cols]
s = df1.columns + '-'
df['Output_col'] = df1.dot(s).str[:-1]
Out[54]:
A B C D Output_col
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0
3 1 0 1 0 A
4 1 1 1 0 A-B
5 0 0 1 1
Try this combination of str.replace and dot
df['Output_col'] = df.dot(df.columns).str.replace(r'(?<!^)(?!$)','-')
Out[32]:
A B C D Output_col
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 1 0 1 0 A-C
4 1 1 1 0 A-B-C
5 0 0 1 1 C-D
If you feel uneasy with regex pattern. You may try this way without using str.replace
s = df.columns + '-'
df['Output_col'] = df.dot(s).str[:-1]
Out[50]:
A B C D Output_col
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 1 0 1 0 A-C
4 1 1 1 0 A-B-C
5 0 0 1 1 C-D
This builds off a solution provided by #Jezrael : link
df['Output_col'] = df.dot(df.columns.str.cat(['_']*len(df.columns),sep='')).str.strip('_')
A B C D Output_col
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 1 0 1 0 A_C
4 1 1 1 0 A_B_C
5 0 0 1 1 C_D

python dataframe counter on a column

I column x in dataframe has only 0 and 1. I want to create variable y which starts counting zeros and resets when when 1 comes in x. I'm getting an error "The truth value of a Series is ambiguous."
count=1
countList=[0]
for x in df['x']:
if df['x'] == 0:
count = count + 1
df['y']= count
else:
df['y'] = 1
count = 1
First dont loop in pandas, because slow, if exist some vectorized solution.
I think need count consecutive 0 values:
df = pd.DataFrame({'x':[1,0,0,1,1,0,1,0,0,0,1,1,0,0,0,0,1]})
a = df['x'].eq(0)
b = a.cumsum()
df['y'] = (b-b.mask(a).ffill().fillna(0).astype(int))
print (df)
x y
0 1 0
1 0 1
2 0 2
3 1 0
4 1 0
5 0 1
6 1 0
7 0 1
8 0 2
9 0 3
10 1 0
11 1 0
12 0 1
13 0 2
14 0 3
15 0 4
16 1 0
Detail + explanation:
#compare by zero
a = df['x'].eq(0)
#cumulative sum of mask
b = a.cumsum()
#replace Trues to NaNs
c = b.mask(a)
#forward fill NaNs
d = b.mask(a).ffill()
#First NaNs to 0 and cast to integers
e = b.mask(a).ffill().fillna(0).astype(int)
#subtract from cumulative sum Series
y = b - e
df = pd.concat([df['x'], a, b, c, d, e, y], axis=1, keys=('x','a','b','c','d','e', 'y'))
print (df)
x a b c d e y
0 0 True 1 NaN NaN 0 1
1 0 True 2 NaN NaN 0 2
2 0 True 3 NaN NaN 0 3
3 1 False 3 3.0 3.0 3 0
4 1 False 3 3.0 3.0 3 0
5 0 True 4 NaN 3.0 3 1
6 1 False 4 4.0 4.0 4 0
7 0 True 5 NaN 4.0 4 1
8 0 True 6 NaN 4.0 4 2
9 0 True 7 NaN 4.0 4 3
10 1 False 7 7.0 7.0 7 0
11 1 False 7 7.0 7.0 7 0
12 0 True 8 NaN 7.0 7 1
13 0 True 9 NaN 7.0 7 2
14 0 True 10 NaN 7.0 7 3
15 0 True 11 NaN 7.0 7 4
16 1 False 11 11.0 11.0 11 0

Sorting dataframe and creating new columns based on the rank of element

I have the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
'id': [1, 1, 1, 1, 2, 2,2, 2, 3, 3, 3, 3],
'name': ['A', 'B', 'C', 'D','A', 'B','C', 'D', 'A', 'B','C', 'D'],
'Value': [1, 2, 3, 4, 5, 6, 0, 2, 4, 6, 3, 5]
},
columns=['name','id','Value'])`
I can sort the data using id and value as shown below:
df.sort_values(['id','Value'],ascending = [True,False])
The table that I print will be appearing as follow:
name id Value
D 1 4
C 1 3
B 1 2
A 1 1
B 2 6
A 2 5
D 2 2
C 2 0
B 3 6
D 3 5
A 3 4
C 3 3
I would like to create 4 new columns (Rank1, Rank2, Rank3, Rank4) if element in the column name is highest value, the column Rank1 will be assign as 1 else 0. if element in the column name is second highest value, he column Rank2 will be assign as 1 else 0.
Same for Rank3 and Rank4.
How could I do that?
Thanks.
Zep
Use:
df = df.join(pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
print (df)
name id Value Rank1 Rank2 Rank3 Rank4
3 D 1 4 1 0 0 0
2 C 1 3 0 1 0 0
1 B 1 2 0 0 1 0
0 A 1 1 0 0 0 1
5 B 2 6 1 0 0 0
4 A 2 5 0 1 0 0
7 D 2 2 0 0 1 0
6 C 2 0 0 0 0 1
9 B 3 6 1 0 0 0
11 D 3 5 0 1 0 0
8 A 3 4 0 0 1 0
10 C 3 3 0 0 0 1
Details:
For count per groups use GroupBy.cumcount, then add 1:
print (df.groupby('id').cumcount().add(1))
3 1
2 2
1 3
0 4
5 1
4 2
7 3
6 4
9 1
11 2
8 3
10 4
dtype: int64
For indicator columns use get_dumes with add_prefix:
print (pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
Rank1 Rank2 Rank3 Rank4
3 1 0 0 0
2 0 1 0 0
1 0 0 1 0
0 0 0 0 1
5 1 0 0 0
4 0 1 0 0
7 0 0 1 0
6 0 0 0 1
9 1 0 0 0
11 0 1 0 0
8 0 0 1 0
10 0 0 0 1
This does not require a prior sort
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(np.argsort).rsub(4)
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1
More dynamic
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(lambda x: len(x) - np.argsort(x))
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1

Resources