How can I unpivot or stack a pandas dataframe in the way that I asked? - python-3.x

I have a python pandas dataframe.
For example here is my data:
id A_1 A_2 B_1 B_2
0 j2 1 5 10 8
1 j3 2 6 11 9
2 j4 3 7 12 10
I want it to look like this:
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2
Can you help me please. Thank you so much!

Use wide_to_long with DataFrame.sort_values:
df = (pd.wide_to_long(df, ['A','B'], i='id', j='Other', sep='_')
.sort_values('id')
.reset_index())
print (df)
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2 j3 1 2 11
3 j3 2 6 9
4 j4 1 3 12
5 j4 2 7 10

We can also use DataFrame.melt + Series.str.split to performance a DataFrame.pivot_table:
df2=df.melt('id')
df2[['columns','Other']]=df2['variable'].str.split('_',expand=True)
new_df= ( df2.pivot_table(columns='columns',index=['id','Other'],values='value')
.reset_index()
.rename_axis(columns=None) )
print(new_df)
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2 j3 1 2 11
3 j3 2 6 9
4 j4 1 3 12
5 j4 2 7 10

Related

Align value counts of two dataframes side by side

If I have two dataframes - df1 (data for current day), and df2 (data for previous day).
Both dataframes have 40 columns, and all columns are object data type.
how do I compare Top 3 value_counts for both dataframes, ideally so that the result is side by side, like the following;
df1 df2
Column a Value count 1 Value count 1
Value count 2 Value count 2
Value count 3 Value count 3
Column b Value count 1 Value count 1
Value count 2 Value count 2
Value count 3 Value count 3
The main idea is to check for data anomalies between the data for the two days.
I only know that for each column per dataframe, I must do something like this -
df1.Column.value_counts().head(3)
But this doesn't show combined results as I want. Please help!
You can compare if same columns names in both DataFrames - first use lambda function with Series.value_counts, top3 and create default index for both DataFrames and then join them with concat and for expected order add DataFrame.stack:
np.random.seed(2022)
df1 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df2 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df11 = df1.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df22 = df2.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
df1 df2
0 c0 7 8
1 c0 6 8
2 c0 6 6
0 c1 8 9
1 c1 7 7
2 c1 7 7
0 c2 9 7
1 c2 7 7
2 c2 7 6
0 c3 9 7
1 c3 7 7
2 c3 7 6
0 c4 11 14
1 c4 7 8
2 c4 7 7
Or use DataFrame.compare:
df = (df11.compare(df22,keep_equal=True)
.rename(columns={'self':'df1','other':'df2'})
.stack(0)
.sort_index(level=1))
print (df)
df1 df2
0 c0 7 8
1 c0 6 8
2 c0 6 6
0 c1 8 9
1 c1 7 7
2 c1 7 7
0 c2 9 7
1 c2 7 7
2 c2 7 6
0 c3 9 7
1 c3 7 7
2 c3 7 6
0 c4 11 14
1 c4 7 8
2 c4 7 7
EDIT: For add categories use f-strings for join indices and values of Series in list comprehension:
np.random.seed(2022)
df1 = 'Cat1' + pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df2 = 'Cat2' + pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df11 = df1.apply(lambda x: [f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df22 = df2.apply(lambda x:[f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
df1 df2
0 c0 Cat18 - 7 Cat29 - 8
1 c0 Cat11 - 6 Cat24 - 8
2 c0 Cat19 - 6 Cat23 - 6
0 c1 Cat17 - 8 Cat24 - 9
1 c1 Cat10 - 7 Cat26 - 7
2 c1 Cat14 - 7 Cat20 - 7
0 c2 Cat13 - 9 Cat28 - 7
1 c2 Cat11 - 7 Cat25 - 7
2 c2 Cat19 - 7 Cat26 - 6
0 c3 Cat15 - 9 Cat20 - 7
1 c3 Cat18 - 7 Cat24 - 7
2 c3 Cat13 - 7 Cat27 - 6
0 c4 Cat12 - 11 Cat25 - 14
1 c4 Cat13 - 7 Cat20 - 8
2 c4 Cat15 - 7 Cat26 - 7

How can I add previous column values to to get new value in Excel?

I am working on graph and in need data in below format. I have data in COL A. I need to calculate COL B values as in below picture.
What is the formula for obtaining this in excel?
You can do with cumsum and shift:
# sample data
df = pd.DataFrame({'COL A': np.arange(11)})
df['COL B'] = df['COL A'].shift(fill_value=0).cumsum()
Output:
COL A COL B
0 0 0
1 1 0
2 2 1
3 3 3
4 4 6
5 5 10
6 6 15
7 7 21
8 8 28
9 9 36
10 10 45
Use simple MS technique.
You can use the formula (A3*A2)/2 for COL2

Should I stack, pivot, or groupby?

I'm still learning how to play with dataframe and still can't make this... I got a dataframe like this:
A B C D1 D2 D3
1 2 3 5 6 7
I need it to look like:
A B C DA D
1 2 3 D1 5
1 2 3 D2 6
1 2 3 D3 7
I know I should use something like groupby but I still can't find good documentation.
This is wide_to_long
ydf=pd.wide_to_long(df,'D',i=['A','B','C'],j='DA').reset_index()
ydf
A B C DA D
0 1 2 3 1 5
1 1 2 3 2 6
2 1 2 3 3 7
Use melt:
df.melt(['A','B','C'], var_name='DA', value_name='D')
Output:
A B C DA D
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
Use set_index and stack
df.set_index(['A','B','C']).stack().reset_index()
Output:
A B C level_3 0
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
And, you can do housekeeping by renaming column headers etc....

Dataframe find out duplicate values in column based on other columns, and then add label in to it

Given the following data frame:
import pandas as pd
d=pd.DataFrame({'ID':[1,1,1,1,2,2,2,2],
'values':['a','b','a','a','a','a','b','b']})
d
ID values
0 1 a
1 1 b
2 1 a
3 1 a
4 2 a
5 2 a
6 2 b
7 2 b
The data I want to get is:
ID values count label(values + ID)
0 1 a 3 a11
1 1 b 1 b11
2 1 a 3 a12
3 1 a 3 a13
4 2 a 2 a21
5 2 a 2 a22
6 2 b 2 b21
7 2 b 2 b22
Thank you so much!!!!!!!!!!!!!!!!!!!!
Seems like you need transform count + cumcount
d['count']=d.groupby(['ID','values'])['values'].transform('count')
d['label']=d['values']+d.ID.astype(str)+d.groupby(['ID','values']).cumcount().add(1).astype(str)
d
Out[511]:
ID values count label
0 1 a 3 a11
1 1 b 1 b11
2 1 a 3 a12
3 1 a 3 a13
4 2 a 2 a21
5 2 a 2 a22
6 2 b 2 b21
7 2 b 2 b22
You want to group by ID and values. Within each group, you are interested in two things: the number of members in the group (count) and the occurrence within the group (order):
df['order'] = df.groupby(['ID', 'values']).cumcount() + 1
df['count'] = df.groupby(['ID', 'values']).transform('count')
You can then concatenate their string values, along with the values using sum:
df['label'] = df[['values', 'ID', 'order']].astype(str).sum(axis=1)
Which leads to:
ID values order count label
0 1 a 1 3 a11
1 1 b 1 1 b11
2 1 a 2 3 a12
3 1 a 3 3 a13
4 2 a 1 2 a21
5 2 a 2 2 a22
6 2 b 1 2 b21
7 2 b 2 2 b22

pandas moving aggregate string

from pandas import *
import StringIO
df = read_csv(StringIO.StringIO('''id months state
1 1 C
1 2 3
1 3 6
1 4 9
2 1 C
2 2 C
2 3 3
2 4 6
2 5 9
2 6 9
2 7 9
2 8 C
'''), delimiter= '\t')
I want to create a column show the cumulative state of column state, by id.
id months state result
1 1 C C
1 2 3 C3
1 3 6 C36
1 4 9 C369
2 1 C C
2 2 C CC
2 3 3 CC3
2 4 6 CC36
2 5 9 CC69
2 6 9 CC699
2 7 9 CC6999
2 8 C CC6999C
Basically the cum concatenation of string columns. What is the best way to do it?
So long as the dtype is str then you can do the following:
In [17]:
df['result']=df.groupby('id')['state'].apply(lambda x: x.cumsum())
df
Out[17]:
id months state result
0 1 1 C C
1 1 2 3 C3
2 1 3 6 C36
3 1 4 9 C369
4 2 1 C C
5 2 2 C CC
6 2 3 3 CC3
7 2 4 6 CC36
8 2 5 9 CC369
9 2 6 9 CC3699
10 2 7 9 CC36999
11 2 8 C CC36999C
Essentially we groupby on 'id' column and then apply a lambda with a transform to return the cumsum. This will perform a cumulative concatenation of the string values and return a Series with it's index aligned to the original df so you can add it as a column

Resources