How to have a cross tabulation for categorical data in Pandas (Python)? - python-3.x

I have the following code for example.
df = pd.DataFrame(dtype="category")
df["Gender"]=np.random.randint(2, size=100)
df["Q1"] = np.random.randint(3, size=100)
df["Q2"] = np.random.randint(3, size=100)
df["Q3"] = np.random.randint(3, size=100)
df[["Gender", "Q1", "Q2", "Q3"]] = df[["Gender", "Q1", "Q2", "Q3"]].astype('category')
pd.pivot_table(data=df,index=["Gender"])
I want to have a pivot table with percentages over gender for all the other columns. Infact, like the follwing.
How to achieve this?
The above code gives an error saying that
No numeric types to aggregate
I dont have any numerical columns. I just want to find the frequency in each category under male and female and find the percentage of them over male and female respectively.

As suggested by your question, you can use the pd.crosstab to make the cross tabulation you need.
You just need to do a quick preprocessing with your data, which is to melt and convert Q columns to rows (see details below):
df = df.melt(id_vars='Gender',
value_vars=['Q1', 'Q2', 'Q3'],
var_name='Question', value_name='Answer' )
Then you can use pd.crosstab and calculate percentage as needed (here the percentage for each Question per Gender per Answer is shown)
pd.crosstab(df.Question, columns=[df.Gender, df.Answer]).apply(lambda row: row/row.sum(), axis=1)
Gender 0 1
Answer 0 1 2 0 1 2
Question
Q1 0.13 0.18 0.18 0.13 0.19 0.19
Q2 0.09 0.21 0.19 0.22 0.13 0.16
Q3 0.19 0.10 0.20 0.16 0.18 0.17
Details
df.head()
Gender Q1 Q2 Q3
0 1 0 2 0
1 1 0 0 1
2 0 2 0 2
3 0 0 2 0
4 0 1 1 1
df.melt().head()
Gender Question Answer
0 1 Q1 0
1 1 Q1 0
2 0 Q1 2
3 0 Q1 0
4 0 Q1 1

Related

Pandas dict application over several columns, handling non matches values

I have a dataframe like:
ìmport pandas as pd
df = pd.DataFrame({
"age" : [20,22,20,23],
"name" : ["A","B","C","A"],
"addres" : ["add1","add2","add3","add4"],
"job" : ["C","E","C","D"],
"score" : [0.44,0.43,0.25,0.36]
})
categors = ["name","addres","job"]
df:
age name addres job score
0 20 A add1 C 0.44
1 22 B add2 E 0.43
2 20 C add3 C 0.25
3 23 A add4 D 0.36
and I have a dict like this:
mapping_dict = {
"name" : {"A":0, "B": 1},
"addres" : {"add1": 0, "add2": 1, "add3":2},
"job" : {"A":0, "B":1, "C": 2, "D":3}
}
I would likt to apply this dictionary to their match column, so I can do this:
df[categors].replace(mapping_dict,inplace=True)
or
df[categors] = df[categors].replace(mapping_dict)
it's the same because they return the same problem:
name addres job
0 0 0 2
1 1 1 E
2 C 2 2
3 0 add4 3
The problem is that non matched values (like add4 in column addres or C in column name or E in column job) are not handled to be replaced with any argument of .replace function. I need those values to be mapped to -1
So, to achieve this, we can make a loop:
for column in categors:
df[column] = df[column].map(mapping_dict[column])
df
age name addres job score
0 20 0.0 0.0 2.0 0.44
1 22 1.0 1.0 NaN 0.43
2 20 NaN 2.0 2.0 0.25
3 23 0.0 NaN 3.0 0.36
and solve NaN with .fillna(-1) or better one
for column in categors:
l = lambda x: mapping_dict[column].get(x,-1)
df[column] = df[column].apply(l)
df
age name addres job score
0 20 0 0 2 0.44
1 22 1 1 -1 0.43
2 20 -1 2 2 0.25
3 23 0 -1 3 0.36
So, I know how to do the stuff, this has been proved.
My problems are:
I don't think that pandas was made to loop over columns, but vectorized functions applications over columns.
My real dataframe is large enought to need a vectorized function, and mapping_dict is large enough too.
So, If I could use apply with axis=1 and get the column name column_name somehow like pandas.Series.column_name, I could do something like :
df[columns].apply(lambda x: mapping_dict[x.column_name].get(x,-1) , axis=1)
By now, I think that an inherited class who owns from pandas.dataframe all its properties and adds x.column_name is the "Cannon to kill a mosquito" solution.
So do you know any fast, one line solution for this?
Add a "catch-all" to the end of the dict for each column. This was anything not matched becomes -1:
mapping_dict = {
"name" : {"A":0, "B": 1, ".*":-1},
"addres" : {"add1": 0, "add2": 1, "add3":2, ".*":-1},
"job" : {"A":0, "B":1, "C": 2, "D":3, ".*":-1}
}
Then just include the regex parameter in your replace.
df.replace(mapping_dict, inplace=True, regex=True)
age name addres job score
0 20 0 0 2 0.44
1 22 1 1 -1 0.43
2 20 -1 2 2 0.25
3 23 0 -1 3 0.36

How to find the index of a row for a particular value in a particular column and then create a new column with that starting point?

Example of dataframe
What I'm trying to do with my dataframe...
Locate the first 0 value in a certain column (G in the example photo).
Create a new column (Time) with the value 0 lining up on the same row with the same 0 value in column (G).
And then each row after the 0 in column (Time) +(1/60) until the end of the data.
And -(1/60) before the 0 in (Time) column until the beginning of data.
What is the best method to achieve this?
Any advice would be appreciated. Thank you.
Pretty straight forward
identify index of row that contains the value you are looking for
then construct an array where start is negative, zero will be index row and end at value for end of series
import numpy as np
df = pd.DataFrame({"Time":np.full(25, 0), "G":[i if i>0 else 0 for i in range(10,-15,-1)]})
# find the where value is zero
idx = df[df["G"]==0].index[0]
intv = round(1/60, 3)
# construct a numpy array from a range of values (start, stop, increment)
df["Time"] = np.arange(-idx*intv, (len(df)-idx)*intv, intv)
df.loc[idx, "Time"] = 0 # just remove any rounding at zero point
print(df.to_string(index=False))
output
Time G
-0.170 10
-0.153 9
-0.136 8
-0.119 7
-0.102 6
-0.085 5
-0.068 4
-0.051 3
-0.034 2
-0.017 1
0.000 0
0.017 0
0.034 0
0.051 0
0.068 0
0.085 0
0.102 0
0.119 0
0.136 0
0.153 0
0.170 0
0.187 0
0.204 0
0.221 0
0.238 0

Creating a new column into a dataframe based on conditions

For the dataframe df :
dummy_data1 = {'category': ['White', 'Black', 'Hispanic','White'],
'Pop':['75','85','90','100'],'White_ratio':[0.6,0.4,0.7,0.35],'Black_ratio':[0.3,0.2,0.1,0.45], 'Hispanic_ratio':[0.1,0.4,0.2,0.20] }
df = pd.DataFrame(dummy_data1, columns = ['category', 'Pop','White_ratio', 'Black_ratio', 'Hispanic_ratio'])
I want to add a new column to this data frame,'pop_n', by first checking the category, and then multiplying the value in 'Pop' by the corresponding ratio value in the columns. For the first row,
the category is 'White' so it should multiply 75 with 0.60 and put 45 in pop_n column.
I thought about writing something like :
df['pop_n']= (df['Pop']*df['White_ratio']).where(df['category']=='W')
this works but just for one category.
I will appreciate any helps with this.
Thanks.
Using DataFrame.filter and DataFrame.lookup:
First we use filter to get the columns with ratio in the name. Then split and keep the first word before the underscore only.
Finally we use lookup to match the category values to these columns.
# df['Pop'] = df['Pop'].astype(int)
df2 = df.filter(like='ratio').rename(columns=lambda x: x.split('_')[0])
df['pop_n'] = df2.lookup(df.index, df['category']) * df['Pop']
category Pop White_ratio Black_ratio Hispanic_ratio pop_n
0 White 75 0.60 0.30 0.1 45.0
1 Black 85 0.40 0.20 0.4 17.0
2 Hispanic 90 0.70 0.10 0.2 18.0
3 White 100 0.35 0.45 0.2 35.0
Locate the columns that have underscores in their names:
to_rename = {x: x.split("_")[0] for x in df if "_" in x}
Find the matching factors:
stack = df.rename(columns=to_rename)\
.set_index('category').stack()
factors = stack[map(lambda x: x[0]==x[1], stack.index)]\
.reset_index(drop=True)
Multiply the original data by the factors:
df['pop_n'] = df['Pop'].astype(int) * factors
# category Pop White_ratio Black_ratio Hispanic_ratio pop_n
#0 White 75 0.60 0.30 0.1 45
#1 Black 85 0.40 0.20 0.4 17
#2 Hispanic 90 0.70 0.10 0.2 18
#3 White 100 0.35 0.45 0.2 35

Python LIfe Expectancy

Trying to use panda to calculate life expectanc with complex equations.
Multiply or divide column by column is not difficult to do.
My data is
A b
1 0.99 1000
2 0.95 =0.99*1000=990
3 0.93 = 0.95*990
Field A is populated and field be has only the 1000
Field b (b2) = A1*b1
Tried shift function, got result for b2 only and the rest zeros any help please thanks mazin
IIUC, if you're starting with:
>>> df
A b
0 0.99 1000.0
1 0.95 NaN
2 0.93 NaN
Then you can do:
df.loc[df.b.isnull(),'b'] = (df.A.cumprod()*1000).shift()
>>> df
A b
0 0.99 1000.0
1 0.95 990.0
2 0.93 940.5
Or more generally:
df['b'] = (df.A.cumprod()*df.b.iloc[0]).shift().fillna(df.b.iloc[0])

Sorting pivot table (multi index)

I'm trying to sort a pivot table's values in descending order after putting two "row labels" (Excel term) on the pivot.
sample data:
x = pd.DataFrame({'col1':['a','a','b','c','c', 'a','b','c', 'a','b','c'],
'col2':[ 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'col3':[ 1,.67,0.5, 2,.65, .75,2.25,2.5, .5, 2,2.75]})
print(x)
col1 col2 col3
0 a 1 1.00
1 a 1 0.67
2 b 1 0.50
3 c 1 2.00
4 c 1 0.65
5 a 2 0.75
6 b 2 2.25
7 c 2 2.50
8 a 3 0.50
9 b 3 2.00
10 c 3 2.75
To create the pivot, I'm using the following function:
pt = pd.pivot_table(x, index = ['col1', 'col2'], values = 'col3', aggfunc = np.sum)
print(pt)
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 1 0.50
2 2.25
3 2.00
c 1 2.65
2 2.50
3 2.75
In words, this variable pt is first sorted by col1, then by values of col2 within col1 then by col3 within all of those. This is great, but I would like to sort by col3 (the values) while keeping the groups that were broken out in col2 (this column can be any order and shuffled around).
The target output would look something like this (col3 in descending order with any order in col2 with that group of col1):
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 2 2.25
3 2.00
1 0.50
c 3 2.75
1 2.65
2 2.50
I have tried the code below, but this just sorts the entire pivot table values and loses the grouping (I'm looking for sorting within the group).
pt.sort_values(by = 'col3', ascending = False)
For guidance, a similar question was asked (and answered) here, but I was unable to get a successful output with the provided output:
Pandas: Sort pivot table
The error I get from that answer is ValueError: all keys need to be the same shape
You need reset_index for DataFrame, then sort_values by col1 and col3 and last set_index for MultiIndex:
df = df.reset_index()
.sort_values(['col1','col3'], ascending=[True, False])
.set_index(['col1','col2'])
print (df)
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 2 2.25
3 2.00
1 0.50
c 3 2.75
1 2.65
2 2.50

Resources