Pandas Dataframe show Count with Group by and Aggregate

Pandas Dataframe show Count with Group by and Aggregate - python-3.x

I have this data
ID Value1 Value2 Type Type2
1 3 1 A X
2 2 2 A X
3 5 3 B Y
4 2 4 B Z
5 6 8 C Z
6 7 9 C Z
7 8 0 C L
8 3 2 D M
9 4 3 D M
10 6 5 D M
11 8 7 D M
Right now i am able to generate this output using this code
pandabook.groupby(['Type','Type2'],as_index=False)['Value1', 'Value2'].agg({'Value1': 'sum','Value2': 'sum'})
ID Value 1 Value2 Type Type2
1 5 3 A X
2 5 3 B Y
3 2 5 B Z
4 13 17 C Z
5 8 0 C L
6 21 17 D M
I want to show the Aggregated count as well, as show in this example
How can i achieve this output ?

Add new value to dictionary with size function, remove as_index=False for prevent:
ValueError: cannot insert Type, already exists
and last rename with reset_index:
df = pandabook.groupby(['Type','Type2']).agg({'Value1': 'sum','Value2': 'sum', 'Type':'size'})
df = df.rename(columns={'Type':'Count'}).reset_index()
print (df)
Type Type2 Value1 Value2 Count
0 A X 5 3 2
1 B Y 5 3 1
2 B Z 2 4 1
3 C L 8 0 1
4 C Z 13 17 2
5 D M 21 17 4

Related

pandas fill 0s with mean based on rows that match a condition in another column

I have a dataframe like below in which I need to replace the 0s with the mean of the rows where the parent_key matches the self_key.
Input DataFrame: df= pd.DataFrame ({'self_key':['a','b','c','d','e','e','e','f','f','f'],'parent_key':[np.nan,'a','b','b','c','c','c','d','d','d'], 'value':[0,0,0,0,4,6,14,12,8,22],'level':[1,2,3,3,4,4,4,4,4,4]})
The row 3 has self_key of 'd' so I would need to replace its 0 value in column 'value' with the mean of rows 7,8,9 to fill with the correct value of 14. Since the lower levels feed into the higher levels I would need to do it from lowest level to highest to fill out the dataframe as well but when I do the below code it doesn't work and I get the error "ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional". How can I fill in the 0s with the means from lowest level to highest?
df['value']=np.where((df['value']==0) & (df['level']==3), df['value'].groupby(df.where(df['parent_key']==df['self_key'])).transform('mean'), df['value'])
Input
self_key parent_key value level
0 a NaN 0 1
1 b a 0 2
2 c b 0 3
3 d b 0 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4
My approach is to repeat the above code 3 times and change the level from 3 to 2 to 1, but its not working for even level 3.
Expected Ouput:
self_key parent_key value level
0 a NaN 11 1
1 b a 11 2
2 c b 8 3
3 d b 14 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4

If I understand your problem correctly, you are trying to compute mean in a bottom-up fashion by filtering dataframe on certain keys. If so, then following should solve it:
for l in range(df["level"].max()-1, 0, -1):
df_sub = df[(df["level"] == l) & (df["value"] == 0)]
self_keys = df_sub["self_key"].tolist()
for k in self_keys:
df.loc[df_sub[df_sub["self_key"] == k].index, "value"] = df[df["parent_key"] == k]["value"].mean()
[Out]:
self_key parent_key value level
0 a 11 1
1 b a 11 2
2 c b 8 3
3 d b 14 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4

Complete DataFrame with missing steps python

I have a pandas data frame which misses some rows. It actually has the following format:
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
...
An observation should pass through every steps. I mean id ==1 has step 1 and 3 but misses step 2 (which I don't want). id==2 has step 1 and 2 and there is no step 3 and this is fine because there is no gap. id ==5 has step 2 but doesn't have step 1 so I am missing a line there.
I need to add some rows to complete the steps, I would keep var1 var2 and id as the same.
I would like to obtain this df :
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
1 2 a h
5 1 f m
...
It would be awesome if anyone could help with a smooth solution

You can try pivot the table then ffill and bfill:
(df.pivot(index='id', columns='step')
.groupby(level=0, axis=1)
.apply(lambda x: x.ffill().bfill())
.stack()
.reset_index()
)
Output:
id step var1 var2
0 1 1 a h
1 1 2 e l
2 1 3 d k
3 2 1 b i
4 2 2 e l
5 2 3 d k
6 3 1 c g
7 3 2 e l
8 3 3 d k
9 5 1 c g
10 5 2 f m
11 5 3 d k
12 6 1 g n
13 6 2 f m
14 6 3 d k

pandas transform one row into multiple rows

I have a dataframe as below.
My dataframe as below.
ID list
1 a, b, c
2 a, s
3 NA
5 f, j, l
I need to break each items in the list column(String) into independent row as below:
ID item
1 a
1 b
1 c
2 a
2 s
3 NA
5 f
5 j
5 l
Thanks.

Use str.split to separate your items then explode:
print (df.assign(list=df["list"].str.split(", ")).explode("list"))
ID list
0 1 a
0 1 b
0 1 c
1 2 a
1 2 s
2 3 NaN
3 5 f
3 5 j
3 5 l

A beginners approach : Just another way of doing the same thing using pd.DataFrame.stack
df['list'] = df['list'].map(lambda x : str(x).split(','))
dfOut = pd.DataFrame(df['list'].values.tolist())
dfOut.index = df['ID']
dfOut = dfOut.stack().reset_index()
del dfOut['level_1']
dfOut.rename(columns = {0 : 'list'}, inplace = True)
Output:
ID list
0 1 a
1 1 b
2 1 c
3 2 a
4 2 s
5 3 nan
6 5 f
7 5 j
8 5 l

if values existes from given list in multiple column and counts the number of column

i have below df
B C D E
2 2 4 11
11 0 5 3
12 10 1 11
5 9 7 15
1st i wants a unique value from whole df like below:
[0,1,2,3,4,5,7,9,10,11,12,15]
then i wants final output
value value exists in number of col
0 1
1 1
2 2
3 1
4 1
5 1
7 1
9 1
10 1
11 2
12 1
15 1
that means each value,how many columns its available
i wants that output

Using python you can do something like this:
# your input df as a list of lists
df = [[2,11,12,5], [2,0,10,9], [4,5,1,7], [11,3,11,15]]
#remove duplicates in each list
dfU = [list(set(l)) for l in df]
# sort each list (not required for this approach)
for l in dfU:
l.sort()
# the requested unique list
flatList = [item for sublist in df for item in sublist]
uniqueList = list(set(flatList))
print(uniqueList)
# output as a list of lists
output = []
for num in uniqueList:
cnt = 0
for idx in range(len(dfU)):
if dfU[idx].count(num) > 0:
cnt+=1
output.append([num,cnt])
print(output)
Side note, the count function is computationally expensive, so it would be better to do a linear scan along all sorted columns.

Use DataFrame.melt for reshape, remove duplicates by both columns and count by GroupBy.size with Series.reset_index for DataFrame:
df1 = (df.melt(value_name='value')
.drop_duplicates()
.groupby('value')
.size()
.reset_index(name='count'))
print (df1)
value count
0 0 1
1 1 1
2 2 2
3 3 1
4 4 1
5 5 2
6 7 1
7 9 1
8 10 1
9 11 2
10 12 1
11 15 1
Details:
print (df.melt(value_name='value'))
variable value
0 B 2
1 B 11
2 B 12
3 B 5
4 C 2
5 C 0
6 C 10
7 C 9
8 D 4
9 D 5
10 D 1
11 D 7
12 E 11
13 E 3
14 E 11
15 E 15
One 11 for index 14 is removed:
print (df.melt(value_name='value').drop_duplicates())
variable value
0 B 2
1 B 11
2 B 12
3 B 5
4 C 2
5 C 0
6 C 10
7 C 9
8 D 4
9 D 5
10 D 1
11 D 7
12 E 11
13 E 3
15 E 15
If want pure python solution:
from collections import Counter
L = sorted(Counter([y for x in df.T.values for y in set(x)]).items())
df1 = pd.DataFrame(L, columns=['value','count'])
print (df1)
value count
0 0 1
1 1 1
2 2 2
3 3 1
4 4 1
5 5 2
6 7 1
7 9 1
8 10 1
9 11 2
10 12 1
11 15 1

Column name and index of max value

I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].

consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g

print(your_dataframe.sort_values(ascending=False)[0:4])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas Dataframe show Count with Group by and Aggregate - python-3.x

Related

pandas fill 0s with mean based on rows that match a condition in another column

Complete DataFrame with missing steps python

pandas transform one row into multiple rows

if values existes from given list in multiple column and counts the number of column

Column name and index of max value

Categories

Resources