Python how to count boolean in dataframe and compute percentage - python-3.x

I would like to do 2 manupilation with pandas but I cannot manage to do it.
I have a dataframe that looks like this :
df = pd.DataFrame({'Name': {0: 'Eric', 1: 'Mattieu',2: 'Eric',3: 'Mattieu', 4: 'Mattieu',5: 'Eric',6:'Mattieu',7:'Franck',8:'Franck',9:'Jack',10:'Jack'},
'Value': {0: False, 1:False,2:True,3:False, 4:True,5: True,6: False,7:True,8:True,9:False,10:False},
})
df=df.sort_values(["Name"])
df
output:
Name Value
0 Eric False
2 Eric True
5 Eric True
7 Franck True
8 Franck True
9 Jack False
10 Jack False
1 Mattieu False
3 Mattieu False
4 Mattieu True
6 Mattieu False
Manupilation 1 : I would like to have the Number of True , False, Total value and the mean of True Value for each Name, like this :
Name Nbr True Nbr False Total Value Mean (True/(False+True))
0 Eric 2 1 3 0.75
1 Franck 2 0 2 1.00
2 Jack 0 2 2 0.00
3 Mattieu 1 3 4 0.25
Manupilation 2: I would like to get a group by mean of the column "Mean (True/(False+True))" grouped by "Total value", like this :
Group by Total Value Mean of grouped Total Value
0 2 0.50
1 3 0.75
2 4 0.25
Thanks in advance for your help

First one can be done with crosstab
s1 = pd.crosstab(df['Name'], df['Value'], margins=True).drop('All').assign(Mean = lambda x : x[True]/x['All'])
Out[266]:
Value False True All Mean
Name
Eric 1 2 3 0.666667
Franck 0 2 2 1.000000
Jack 2 0 2 0.000000
Mattieu 3 1 4 0.250000
2nd dataframe do with groupby
s2 = s1.groupby('All').apply(lambda x : sum(x[True])/sum(x['All'])).reset_index(name='Mean of ALL')
Out[274]:
All Mean of ALL
0 2 0.500000
1 3 0.666667
2 4 0.250000

Related

Checking for specific value change between columns in pandas

I've got 4 columns with numeric values between 1 and 4, and I'm trying to see which rows change from a value of 1 to a value of 4 progressing from column a to column d within those 4 columns. Currently I'm pulling the difference between each of the columns and looking for a value of 3. Is there a better way to do this?
Here's what I'm looking for (with 0's in place of nan):
ID a b c d check
1 1 0 1 4 True
2 1 0 1 1 False
3 1 1 1 4 True
4 1 3 3 4 True
5 0 0 1 4 True
6 1 2 3 3 False
7 1 0 0 4 True
8 1 4 4 4 True
9 1 4 3 4 True
10 1 4 1 1 True
You can just do cummax
col = ['a','b','c','d']
s = df[col].cummax(1)
df['new'] = s[col[:3]].eq(1).any(1) & s[col[-1]].eq(4)
Out[523]:
0 True
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 True
dtype: bool
You can try compare the index of 4 and 1 in apply
cols = ['a', 'b', 'c', 'd']
def get_index(lst, num):
return lst.index(num) if num in lst else -1
df['Check'] = df[cols].apply(lambda row: get_index(row.tolist(), 4) > get_index(row.tolist(), 1), axis=1)
print(df)
ID a b c d check Check
0 1 1 0 1 4 True True
1 2 1 0 1 1 False False
2 3 1 1 1 4 True True
3 4 1 3 3 4 True True
4 5 0 0 1 4 True True
5 6 1 2 3 3 False False
6 7 1 0 0 4 True True
7 8 1 4 4 4 True True
8 9 1 4 3 4 True True

pandas compare 1 row value with every other row value and create a matrix

DF in hand
Steps I want to perform:
compare A001 data with A002, A003,...A00N
for every value that matches raise a counter by 1
do not increment the count if NA
repeat for row A002 with all other rows
create a matrix using the index with total count of matching values
DF creation:
data = {'name':['A001', 'A002', 'A003',
'A004','A005','A006','A007','A008'],
'Q1':[2,1,1,1,2,1,1,5],
'Q2':[4,4,4,2,4,2,5,4]
'Q3':[2,2,3,2,2,3,2,2]
'Q4':[5,3,5,2,3,2,4,5]
'Q5':[2,2,3,2,2,2,2,2]}
df = pd.DataFrame(data)
df.at[7, 'Q3'] = None
desired output
thanks in advance.
IIUC,
df = pd.DataFrame({'name':['A001', 'A002', 'A003', 'A004','A005','A006','A007','A008'],
'Q1':[2,1,1,1,2,1,1,5],
'Q2':[4,4,4,2,4,2,5,4],
'Q3':[2,2,3,2,2,3,2,2],
'Q4':[5,3,5,2,3,2,4,5],
'Q5':[2,2,3,2,2,2,2,2]})
dfm = df.merge(df, how='cross').set_index(['name_x','name_y'])
dfm.columns = dfm.columns.str.split('_', expand=True)
df_out = dfm.stack(0).apply(pd.to_numeric, errors='coerce').diff(axis=1).eq(0).sum(axis=1).groupby(level=[0,1]).sum().unstack()
output:
name_y A001 A002 A003 A004 A005 A006 A007 A008
name_x
A001 5 3 2 2 4 1 2 4
A002 3 5 2 3 4 2 3 3
A003 2 2 5 1 1 2 1 2
A004 2 3 1 5 2 4 3 2
A005 4 4 1 2 5 1 2 3
A006 1 2 2 4 1 5 2 1
A007 2 3 1 3 2 2 5 2
A008 4 3 2 2 3 1 2 5

Match pandas column values and headers across dataframes

I have 3 files that I am reading into dataframes (https://pastebin.com/v7BnSH3s)
map_df: Maps data_file headers to codes_df headers
Field Name Code Name
Gender gender_codes
Race race_codes
Ethnicity ethnicity_codes
code_df: Valid codes
gender_codes race_codes ethnicity_codes
1 1 1
2 2 2
3 3 3
4 4 4
NaN NaN 5
NaN NaN 6
NaN NaN 7
data_df: the actual data that needs to be checked against the codes
Name Gender Race Ethnicity
Alex 99 1 7
Cindy 2 4 5
Tom 1 99 1
Problem:
I need to confirm that each value in every column of data_df is a valid code. If not, I need to write the Name, the invalid value and the column header label as a new column. So my example data_df would yield the following dataframe for the gender_codes check:
result_df:
Name Value Column
Alex 99 Gender
Background:
My actual data file has over 100 columns.
A code column can map to multiple columns in the data_df.
I am currently not using the map_df other than to know which columns map to
which codes. However, if I can incorporate this into my script, that would be
ideal.
What I've tried:
I am currently sending each code column to a list, removing the nan string, performing the lookup with loc and isin, then setting up the result_df...
# code column to list
gender_codes = codes_df["gender_codes"].tolist()
# remove nan string
gender_codes = [gender_codes
for gender_codes in gender_codes
if str(gender_codes) != "nan"]
# check each value against code list
result_df = data_df.loc[(~data_df.Gender.isin(gender_codes))]
result_df = result_df.filter(items = ["Name","Gender"])
result_df.rename(columns = {"Gender":"Value"}, inplace = True)
result_df['Column'] = 'Gender'
This works but obviously is extremely primitive and won't scale with my dataset. I'm hoping to find an iterative and pythonic approach to this problem.
EDIT:
Modified Dataset with np.nan
https://pastebin.com/v7BnSH3s
Boolean indexing
I'd reformat your data into different forms
m = dict(map_df.itertuples(index=False))
c = code_df.T.stack().groupby(level=0).apply(set)
ddf = data_df.melt('Name', var_name='Column', value_name='Value')
ddf[[val not in c[col] for val, col in zip(ddf.Value, ddf.Column.map(m))]]
Name Column Value
0 Alex Gender 99
5 Tom Race 99
Details
m # Just a dictionary with the same content as `map_df`
{'Gender': 'gender_codes',
'Race': 'race_codes',
'Ethnicity': 'ethnicity_codes'}
c # Series of membership sets
ethnicity_codes {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}
gender_codes {1.0, 2.0, 3.0, 4.0}
race_codes {1.0, 2.0, 3.0, 4.0}
dtype: object
ddf # Melted dataframe to help match the final output
Name Column Value
0 Alex Gender 99
1 Cindy Gender 2
2 Tom Gender 1
3 Alex Race 1
4 Cindy Race 4
5 Tom Race 99
6 Alex Ethnicity 7
7 Cindy Ethnicity 5
8 Tom Ethnicity 1
You will need to preprocess your dataframes and define a validation function. Something like below:
1. Preprocessing
# call melt() to convert columns to rows
mcodes = codes_df.melt(
value_vars=list(codes_df.columns),
var_name='Code Name',
value_name='Valid Code').dropna()
mdata = data_df.melt(
id_vars='Name',
value_vars=list(data_df.columns[1:]),
var_name='Column',
value_name='Value')
validation_df = mcodes.merge(map_df, on='Code Name')
Out:
mcodes:
Code Name Valid Code
0 gender_codes 1
1 gender_codes 2
7 race_codes 1
8 race_codes 2
9 race_codes 3
10 race_codes 4
14 ethnicity_codes 1
15 ethnicity_codes 2
16 ethnicity_codes 3
17 ethnicity_codes 4
18 ethnicity_codes 5
19 ethnicity_codes 6
20 ethnicity_codes 7
mdata:
Name Column Value
0 Alex Gender 99
1 Cindy Gender 2
2 Tom Gender 1
3 Alex Race 1
4 Cindy Race 4
5 Tom Race 99
6 Alex Ethnicity 7
7 Cindy Ethnicity 5
8 Tom Ethnicity 1
validation_df:
Code Name Valid Code Field Name
0 gender_codes 1 Gender
1 gender_codes 2 Gender
2 race_codes 1 Race
3 race_codes 2 Race
4 race_codes 3 Race
5 race_codes 4 Race
6 ethnicity_codes 1 Ethnicity
7 ethnicity_codes 2 Ethnicity
8 ethnicity_codes 3 Ethnicity
9 ethnicity_codes 4 Ethnicity
10 ethnicity_codes 5 Ethnicity
11 ethnicity_codes 6 Ethnicity
12 ethnicity_codes 7 Ethnicity
2. Validation Function
def isValid(row):
valid_list = validation_df[validation_df['Field Name'] == row.Column]['Valid Code'].tolist()
return row.Value in valid_list
3. Validation
mdata['isValid'] = mdata.apply(isValid, axis=1)
result = mdata[mdata.isValid == False]
Out:
result:
Name Column Value isValid
0 Alex Gender 99 False
5 Tom Race 99 False
m, df1 = dict(map_df.values), data_df.set_index('Name')
df1[df1.apply(lambda x:~x.isin(code_df[m[x.name]]))].stack().reset_index()
Out:
Name level_1 0
0 Alex Gender 99.0
1 Tom Race 99.0

How to delete the entire row if any of its value is 0 in pandas

In the below example I only want to retain the row 1 and 2
I want to delete all the rows which has 0 anywhere across the column:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
3 3 3 0 3 3
4 0 4 0 0 0
5 5 5 5 5 0
the output should read like below:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
I have tried:
df.loc[(df!=0).any(axis=1)]
But it deletes the row only if all of its corresponding columns are 0
You are really close, need DataFrame.all for check all Trues per row:
df = df.loc[(df!=0).all(axis=1)]
print (df)
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
Details:
print (df!=0)
kt b tt mky depth
1 True True True True True
2 True True True True True
3 True True False True True
4 False True False False False
5 True True True True False
print ((df!=0).all(axis=1))
1 True
2 True
3 False
4 False
5 False
dtype: bool
Alternative solution with any for check at least one True for row with changed mask df == 0 and inversing by ~:
df = df.loc[~(df==0).any(axis=1)]

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

Resources