Groupby count in pandas of two categorical column - pandas

Groupby count in pandas of two categorical column - pandas - python-3.x

I have df as shown below.
df
ID Type Status Age
1 2 P 23
2 1 P 28
3 1 F 33
4 3 P 48
5 1 F 23
6 2 P 28
7 2 F 23
8 3 P 38
From the above I would like to perform groupby count of Status based on Type
Expected output:
Type Status Frequency
1 F 2
1 P 1
2 F 1
2 P 2
3 F 0
3 P 2
I tried below code
df.groupby('Type').agg('Status': 'size').\
sort_values(ascending = False).reset_index()

I think you want value_counts:
df.groupby('Type').Status.value_counts().reset_index(name='Frequency')
Output:
Type Status Frequency
0 1 F 2
1 1 P 1
2 2 P 2
3 2 F 1
4 3 P 2
Or replace reset_index with unstack to get the missing groups:
df.groupby('Type').Status.value_counts().unstack(fill_value=0)
Output:
Status F P
Type
1 2 1
2 1 2
3 0 2
Note: df.groupby('Type').Status.value_counts() is somewhat equivalent to df.groupby(['Type,'Status']).size().

Let us try crosstab
pd.crosstab(df.Type, df.Status)
Out[268]:
Status F P
Type
1 2 1
2 1 2
3 0 2
pd.crosstab(df.Type, df.Status).stack().reset_index(name = 'freq')
Out[273]:
Type Status freq
0 1 F 2
1 1 P 1
2 2 F 1
3 2 P 2
4 3 F 0
5 3 P 2

Related

Complete DataFrame with missing steps python

I have a pandas data frame which misses some rows. It actually has the following format:
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
...
An observation should pass through every steps. I mean id ==1 has step 1 and 3 but misses step 2 (which I don't want). id==2 has step 1 and 2 and there is no step 3 and this is fine because there is no gap. id ==5 has step 2 but doesn't have step 1 so I am missing a line there.
I need to add some rows to complete the steps, I would keep var1 var2 and id as the same.
I would like to obtain this df :
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
1 2 a h
5 1 f m
...
It would be awesome if anyone could help with a smooth solution

You can try pivot the table then ffill and bfill:
(df.pivot(index='id', columns='step')
.groupby(level=0, axis=1)
.apply(lambda x: x.ffill().bfill())
.stack()
.reset_index()
)
Output:
id step var1 var2
0 1 1 a h
1 1 2 e l
2 1 3 d k
3 2 1 b i
4 2 2 e l
5 2 3 d k
6 3 1 c g
7 3 2 e l
8 3 3 d k
9 5 1 c g
10 5 2 f m
11 5 3 d k
12 6 1 g n
13 6 2 f m
14 6 3 d k

pandas transform one row into multiple rows

I have a dataframe as below.
My dataframe as below.
ID list
1 a, b, c
2 a, s
3 NA
5 f, j, l
I need to break each items in the list column(String) into independent row as below:
ID item
1 a
1 b
1 c
2 a
2 s
3 NA
5 f
5 j
5 l
Thanks.

Use str.split to separate your items then explode:
print (df.assign(list=df["list"].str.split(", ")).explode("list"))
ID list
0 1 a
0 1 b
0 1 c
1 2 a
1 2 s
2 3 NaN
3 5 f
3 5 j
3 5 l

A beginners approach : Just another way of doing the same thing using pd.DataFrame.stack
df['list'] = df['list'].map(lambda x : str(x).split(','))
dfOut = pd.DataFrame(df['list'].values.tolist())
dfOut.index = df['ID']
dfOut = dfOut.stack().reset_index()
del dfOut['level_1']
dfOut.rename(columns = {0 : 'list'}, inplace = True)
Output:
ID list
0 1 a
1 1 b
2 1 c
3 2 a
4 2 s
5 3 nan
6 5 f
7 5 j
8 5 l

if values existes from given list in multiple column and counts the number of column

i have below df
B C D E
2 2 4 11
11 0 5 3
12 10 1 11
5 9 7 15
1st i wants a unique value from whole df like below:
[0,1,2,3,4,5,7,9,10,11,12,15]
then i wants final output
value value exists in number of col
0 1
1 1
2 2
3 1
4 1
5 1
7 1
9 1
10 1
11 2
12 1
15 1
that means each value,how many columns its available
i wants that output

Using python you can do something like this:
# your input df as a list of lists
df = [[2,11,12,5], [2,0,10,9], [4,5,1,7], [11,3,11,15]]
#remove duplicates in each list
dfU = [list(set(l)) for l in df]
# sort each list (not required for this approach)
for l in dfU:
l.sort()
# the requested unique list
flatList = [item for sublist in df for item in sublist]
uniqueList = list(set(flatList))
print(uniqueList)
# output as a list of lists
output = []
for num in uniqueList:
cnt = 0
for idx in range(len(dfU)):
if dfU[idx].count(num) > 0:
cnt+=1
output.append([num,cnt])
print(output)
Side note, the count function is computationally expensive, so it would be better to do a linear scan along all sorted columns.

Use DataFrame.melt for reshape, remove duplicates by both columns and count by GroupBy.size with Series.reset_index for DataFrame:
df1 = (df.melt(value_name='value')
.drop_duplicates()
.groupby('value')
.size()
.reset_index(name='count'))
print (df1)
value count
0 0 1
1 1 1
2 2 2
3 3 1
4 4 1
5 5 2
6 7 1
7 9 1
8 10 1
9 11 2
10 12 1
11 15 1
Details:
print (df.melt(value_name='value'))
variable value
0 B 2
1 B 11
2 B 12
3 B 5
4 C 2
5 C 0
6 C 10
7 C 9
8 D 4
9 D 5
10 D 1
11 D 7
12 E 11
13 E 3
14 E 11
15 E 15
One 11 for index 14 is removed:
print (df.melt(value_name='value').drop_duplicates())
variable value
0 B 2
1 B 11
2 B 12
3 B 5
4 C 2
5 C 0
6 C 10
7 C 9
8 D 4
9 D 5
10 D 1
11 D 7
12 E 11
13 E 3
15 E 15
If want pure python solution:
from collections import Counter
L = sorted(Counter([y for x in df.T.values for y in set(x)]).items())
df1 = pd.DataFrame(L, columns=['value','count'])
print (df1)
value count
0 0 1
1 1 1
2 2 2
3 3 1
4 4 1
5 5 2
6 7 1
7 9 1
8 10 1
9 11 2
10 12 1
11 15 1

Pandas Dataframe show Count with Group by and Aggregate

I have this data
ID Value1 Value2 Type Type2
1 3 1 A X
2 2 2 A X
3 5 3 B Y
4 2 4 B Z
5 6 8 C Z
6 7 9 C Z
7 8 0 C L
8 3 2 D M
9 4 3 D M
10 6 5 D M
11 8 7 D M
Right now i am able to generate this output using this code
pandabook.groupby(['Type','Type2'],as_index=False)['Value1', 'Value2'].agg({'Value1': 'sum','Value2': 'sum'})
ID Value 1 Value2 Type Type2
1 5 3 A X
2 5 3 B Y
3 2 5 B Z
4 13 17 C Z
5 8 0 C L
6 21 17 D M
I want to show the Aggregated count as well, as show in this example
How can i achieve this output ?

Add new value to dictionary with size function, remove as_index=False for prevent:
ValueError: cannot insert Type, already exists
and last rename with reset_index:
df = pandabook.groupby(['Type','Type2']).agg({'Value1': 'sum','Value2': 'sum', 'Type':'size'})
df = df.rename(columns={'Type':'Count'}).reset_index()
print (df)
Type Type2 Value1 Value2 Count
0 A X 5 3 2
1 B Y 5 3 1
2 B Z 2 4 1
3 C L 8 0 1
4 C Z 13 17 2
5 D M 21 17 4

How to apply function to data frame column to created iterated column

I have IDs with system event times, and I have grouped the event times by id (individual systems) and made a new column where the value is 1 if the eventtimes.diff() is greater than 1 day, else 0 . Now that I have the flag I am trying to make a function that will be applied to groupby('ID') so the new column starts with 1 and keeps returning 1 for each row in the new column until the flag shows 1 then the new column will go up 1, to 2 and keep returning 2 until the flag shows 1 again.
I will apply this along with groupby('ID') since I need the new column to start over again at 1 for each ID.
I have tried to the following:
def try(x):
y = 1
if row['flag']==0:
y = y
else:
y += y+1
df['NewCol'] = df.groupby('ID')['flag'].apply(try)
I have tried differing variations of the above to no avail. Thanks in advance for any help you may provide.
Also, feel free to let me know if I messed up posting the question. Not sure if my title is great either.

Use boolean indexing for filtering + cumcount + reindex what is much faster solution as loopy apply :
I think you need for count only 1 per group and if no 1 then 1 is added to output:
df = pd.DataFrame({
'ID': ['a','a','a','a','b','b','b','b','b'],
'flag': [0,0,1,1,0,0,1,1,1]
})
df['new'] = (df[df['flag'] == 1].groupby('ID')['flag']
.cumcount()
.add(1)
.reindex(df.index, fill_value=1))
print (df)
ID flag new
0 a 0 1
1 a 0 1
2 a 1 1
3 a 1 2
4 b 0 1
5 b 0 1
6 b 1 1
7 b 1 2
8 b 1 3
Detail:
#filter by condition
print (df[df['flag'] == 1])
ID flag
2 a 1
3 a 1
6 b 1
7 b 1
8 b 1
#count per group
print (df[df['flag'] == 1].groupby('ID')['flag'].cumcount())
2 0
3 1
6 0
7 1
8 2
dtype: int64
#add 1 for count from 1
print (df[df['flag'] == 1].groupby('ID')['flag'].cumcount().add(1))
2 1
3 2
6 1
7 2
8 3
dtype: int64
If need count 0 and if no 0 is added -1:
df['new'] = (df[df['flag'] == 0].groupby('ID')['flag']
.cumcount()
.add(1)
.reindex(df.index, fill_value=-1))
print (df)
ID flag new
0 a 0 1
1 a 0 2
2 a 1 -1
3 a 1 -1
4 b 0 1
5 b 0 2
6 b 1 -1
7 b 1 -1
8 b 1 -1
Another 2 step solution:
df['new'] = df[df['flag'] == 1].groupby('ID')['flag'].cumcount().add(1)
df['new'] = df['new'].fillna(1).astype(int)
print (df)
ID flag new
0 a 0 1
1 a 0 1
2 a 1 1
3 a 1 2
4 b 0 1
5 b 0 1
6 b 1 1
7 b 1 2
8 b 1 3

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Groupby count in pandas of two categorical column - pandas - python-3.x

Let us try crosstab pd.crosstab(df.Type, df.Status) Out[268]: Status F P Type 1 2 1 2 1 2 3 0 2 pd.crosstab(df.Type, df.Status).stack().reset_index(name = 'freq') Out[273]: Type Status freq 0 1 F 2 1 1 P 1 2 2 F 1 3 2 P 2 4 3 F 0 5 3 P 2

Related

Complete DataFrame with missing steps python

pandas transform one row into multiple rows

if values existes from given list in multiple column and counts the number of column

Pandas Dataframe show Count with Group by and Aggregate

How to apply function to data frame column to created iterated column

Categories

Resources