Should I stack, pivot, or groupby? - python-3.x

I'm still learning how to play with dataframe and still can't make this... I got a dataframe like this:
A B C D1 D2 D3
1 2 3 5 6 7
I need it to look like:
A B C DA D
1 2 3 D1 5
1 2 3 D2 6
1 2 3 D3 7
I know I should use something like groupby but I still can't find good documentation.

This is wide_to_long
ydf=pd.wide_to_long(df,'D',i=['A','B','C'],j='DA').reset_index()
ydf
A B C DA D
0 1 2 3 1 5
1 1 2 3 2 6
2 1 2 3 3 7

Use melt:
df.melt(['A','B','C'], var_name='DA', value_name='D')
Output:
A B C DA D
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
Use set_index and stack
df.set_index(['A','B','C']).stack().reset_index()
Output:
A B C level_3 0
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
And, you can do housekeeping by renaming column headers etc....

Related

How can I unpivot or stack a pandas dataframe in the way that I asked?

I have a python pandas dataframe.
For example here is my data:
id A_1 A_2 B_1 B_2
0 j2 1 5 10 8
1 j3 2 6 11 9
2 j4 3 7 12 10
I want it to look like this:
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2
Can you help me please. Thank you so much!
Use wide_to_long with DataFrame.sort_values:
df = (pd.wide_to_long(df, ['A','B'], i='id', j='Other', sep='_')
.sort_values('id')
.reset_index())
print (df)
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2 j3 1 2 11
3 j3 2 6 9
4 j4 1 3 12
5 j4 2 7 10
We can also use DataFrame.melt + Series.str.split to performance a DataFrame.pivot_table:
df2=df.melt('id')
df2[['columns','Other']]=df2['variable'].str.split('_',expand=True)
new_df= ( df2.pivot_table(columns='columns',index=['id','Other'],values='value')
.reset_index()
.rename_axis(columns=None) )
print(new_df)
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2 j3 1 2 11
3 j3 2 6 9
4 j4 1 3 12
5 j4 2 7 10

Dataframe find out duplicate values in column based on other columns, and then add label in to it

Given the following data frame:
import pandas as pd
d=pd.DataFrame({'ID':[1,1,1,1,2,2,2,2],
'values':['a','b','a','a','a','a','b','b']})
d
ID values
0 1 a
1 1 b
2 1 a
3 1 a
4 2 a
5 2 a
6 2 b
7 2 b
The data I want to get is:
ID values count label(values + ID)
0 1 a 3 a11
1 1 b 1 b11
2 1 a 3 a12
3 1 a 3 a13
4 2 a 2 a21
5 2 a 2 a22
6 2 b 2 b21
7 2 b 2 b22
Thank you so much!!!!!!!!!!!!!!!!!!!!
Seems like you need transform count + cumcount
d['count']=d.groupby(['ID','values'])['values'].transform('count')
d['label']=d['values']+d.ID.astype(str)+d.groupby(['ID','values']).cumcount().add(1).astype(str)
d
Out[511]:
ID values count label
0 1 a 3 a11
1 1 b 1 b11
2 1 a 3 a12
3 1 a 3 a13
4 2 a 2 a21
5 2 a 2 a22
6 2 b 2 b21
7 2 b 2 b22
You want to group by ID and values. Within each group, you are interested in two things: the number of members in the group (count) and the occurrence within the group (order):
df['order'] = df.groupby(['ID', 'values']).cumcount() + 1
df['count'] = df.groupby(['ID', 'values']).transform('count')
You can then concatenate their string values, along with the values using sum:
df['label'] = df[['values', 'ID', 'order']].astype(str).sum(axis=1)
Which leads to:
ID values order count label
0 1 a 1 3 a11
1 1 b 1 1 b11
2 1 a 2 3 a12
3 1 a 3 3 a13
4 2 a 1 2 a21
5 2 a 2 2 a22
6 2 b 1 2 b21
7 2 b 2 2 b22

Sum of all rows based on specific column values

I have a df like this:
Index Parameters A B C D E
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Pear 4 5 5 4 3
I want to add all the rows which has Parameter values as "Apple" , "Banana" and "Pear".
Output:
Index Parameters A B C D E
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Pear 4 5 5 4 3
6 Total 7 11 13 11 13
My Effort:
df[:,'Total'] = df.sum(axis=1) -- Works but I want specific values only and not all
Tried by the index in my case 1,2 and 5 but in my original df the index can vary from time to time and hence rejected that solution.
Saw various answers on SO but none of them could solve my problem!!
First idea is create index by Parameters column and select rows for sum and last convert index to column:
L = ["Apple" , "Banana" , "Pear"]
df = df.set_index('Parameters')
df.loc['Total'] = df.loc[L].sum()
df = df.reset_index()
print (df)
Parameters A B C D E
0 Apple 1 2 3 4 5
1 Banana 2 4 5 3 5
2 Potato 3 5 3 2 1
3 Tomato 1 1 1 1 1
4 Pear 4 5 5 4 3
5 Total 7 11 13 11 13
Or add new row for filtered rows by membership with Series.isin and overwrite last added value by Total:
last = len(df)
df.loc[last] = df[df['Parameters'].isin(L)].sum()
df.loc[last, 'Parameters'] = 'Total'
print (df)
Parameters A B C D E
Index
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Total 7 11 13 11 13
Another similar solution is filtering all columns without first and add value in one element list:
df.loc[len(df)] = ['Total'] + df.iloc[df['Parameters'].isin(L).values, 1:].sum().tolist()

Pandas how to turn each group into a dataframe using groupby

I have a dataframe looks like,
A B
1 2
1 3
1 4
2 5
2 6
3 7
3 8
If I df.groupby('A'), how do I turn each group into sub-dataframes, so it will look like, for A=1
A B
1 2
1 3
1 4
for A=2,
A B
2 5
2 6
for A=3,
A B
3 7
3 8
By using get_group
g=df.groupby('A')
g.get_group(1)
Out[367]:
A B
0 1 2
1 1 3
2 1 4
You are close, need convert groupby object to dictionary of DataFrames:
dfs = dict(tuple(df.groupby('A')))
print (dfs[1])
A B
0 1 2
1 1 3
2 1 4
print (dfs[2])
A B
3 2 5
4 2 6

pandas moving aggregate string

from pandas import *
import StringIO
df = read_csv(StringIO.StringIO('''id months state
1 1 C
1 2 3
1 3 6
1 4 9
2 1 C
2 2 C
2 3 3
2 4 6
2 5 9
2 6 9
2 7 9
2 8 C
'''), delimiter= '\t')
I want to create a column show the cumulative state of column state, by id.
id months state result
1 1 C C
1 2 3 C3
1 3 6 C36
1 4 9 C369
2 1 C C
2 2 C CC
2 3 3 CC3
2 4 6 CC36
2 5 9 CC69
2 6 9 CC699
2 7 9 CC6999
2 8 C CC6999C
Basically the cum concatenation of string columns. What is the best way to do it?
So long as the dtype is str then you can do the following:
In [17]:
df['result']=df.groupby('id')['state'].apply(lambda x: x.cumsum())
df
Out[17]:
id months state result
0 1 1 C C
1 1 2 3 C3
2 1 3 6 C36
3 1 4 9 C369
4 2 1 C C
5 2 2 C CC
6 2 3 3 CC3
7 2 4 6 CC36
8 2 5 9 CC369
9 2 6 9 CC3699
10 2 7 9 CC36999
11 2 8 C CC36999C
Essentially we groupby on 'id' column and then apply a lambda with a transform to return the cumsum. This will perform a cumulative concatenation of the string values and return a Series with it's index aligned to the original df so you can add it as a column

Resources