Align value counts of two dataframes side by side - python-3.x

If I have two dataframes - df1 (data for current day), and df2 (data for previous day).
Both dataframes have 40 columns, and all columns are object data type.
how do I compare Top 3 value_counts for both dataframes, ideally so that the result is side by side, like the following;
df1 df2
Column a Value count 1 Value count 1
Value count 2 Value count 2
Value count 3 Value count 3
Column b Value count 1 Value count 1
Value count 2 Value count 2
Value count 3 Value count 3
The main idea is to check for data anomalies between the data for the two days.
I only know that for each column per dataframe, I must do something like this -
df1.Column.value_counts().head(3)
But this doesn't show combined results as I want. Please help!

You can compare if same columns names in both DataFrames - first use lambda function with Series.value_counts, top3 and create default index for both DataFrames and then join them with concat and for expected order add DataFrame.stack:
np.random.seed(2022)
df1 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df2 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df11 = df1.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df22 = df2.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
df1 df2
0 c0 7 8
1 c0 6 8
2 c0 6 6
0 c1 8 9
1 c1 7 7
2 c1 7 7
0 c2 9 7
1 c2 7 7
2 c2 7 6
0 c3 9 7
1 c3 7 7
2 c3 7 6
0 c4 11 14
1 c4 7 8
2 c4 7 7
Or use DataFrame.compare:
df = (df11.compare(df22,keep_equal=True)
.rename(columns={'self':'df1','other':'df2'})
.stack(0)
.sort_index(level=1))
print (df)
df1 df2
0 c0 7 8
1 c0 6 8
2 c0 6 6
0 c1 8 9
1 c1 7 7
2 c1 7 7
0 c2 9 7
1 c2 7 7
2 c2 7 6
0 c3 9 7
1 c3 7 7
2 c3 7 6
0 c4 11 14
1 c4 7 8
2 c4 7 7
EDIT: For add categories use f-strings for join indices and values of Series in list comprehension:
np.random.seed(2022)
df1 = 'Cat1' + pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df2 = 'Cat2' + pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df11 = df1.apply(lambda x: [f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df22 = df2.apply(lambda x:[f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
df1 df2
0 c0 Cat18 - 7 Cat29 - 8
1 c0 Cat11 - 6 Cat24 - 8
2 c0 Cat19 - 6 Cat23 - 6
0 c1 Cat17 - 8 Cat24 - 9
1 c1 Cat10 - 7 Cat26 - 7
2 c1 Cat14 - 7 Cat20 - 7
0 c2 Cat13 - 9 Cat28 - 7
1 c2 Cat11 - 7 Cat25 - 7
2 c2 Cat19 - 7 Cat26 - 6
0 c3 Cat15 - 9 Cat20 - 7
1 c3 Cat18 - 7 Cat24 - 7
2 c3 Cat13 - 7 Cat27 - 6
0 c4 Cat12 - 11 Cat25 - 14
1 c4 Cat13 - 7 Cat20 - 8
2 c4 Cat15 - 7 Cat26 - 7

Related

Python Dataframe rename column compared to value

With Pandas I'm trying to rename unnamed columns in dataframe with values on the first ligne of data.
My dataframe:
id
store
unnamed: 1
unnamed: 2
windows
unnamed: 3
unnamed: 4
0
B1
B2
B3
B1
B2
B3
1
2
c
12
15
15
14
2
4
d
35
14
14
87
My wanted result:
id
store_B1
store_B3
store_B2
windows_B1
windows_B2
windows_B3
0
B1
B2
B3
B1
B2
B3
1
2
c
12
15
15
14
2
4
d
35
14
14
87
I don't know how I can match the column name with the value in my data. Thanks for your help. Regards
You can use df.columns.where to make unnamed: columns NaN, then convert it to a Series and use ffill:
df.columns = pd.Series(df.columns.where(~df.columns.str.startswith('unnamed:'))).ffill() + np.where(~df.columns.isin(['id','col2']), ('_' + df.iloc[0].astype(str)).tolist(), '')
Output:
>>> df
id store_B1 store_B2 store_B3 windows_B1 windows_B2 windows_B3
0 0 B1 B2 B3 B1 B2 B3
1 1 2 c 12 15 15 14
2 2 4 d 35 14 14 87

How can I add previous column values to to get new value in Excel?

I am working on graph and in need data in below format. I have data in COL A. I need to calculate COL B values as in below picture.
What is the formula for obtaining this in excel?
You can do with cumsum and shift:
# sample data
df = pd.DataFrame({'COL A': np.arange(11)})
df['COL B'] = df['COL A'].shift(fill_value=0).cumsum()
Output:
COL A COL B
0 0 0
1 1 0
2 2 1
3 3 3
4 4 6
5 5 10
6 6 15
7 7 21
8 8 28
9 9 36
10 10 45
Use simple MS technique.
You can use the formula (A3*A2)/2 for COL2

Should I stack, pivot, or groupby?

I'm still learning how to play with dataframe and still can't make this... I got a dataframe like this:
A B C D1 D2 D3
1 2 3 5 6 7
I need it to look like:
A B C DA D
1 2 3 D1 5
1 2 3 D2 6
1 2 3 D3 7
I know I should use something like groupby but I still can't find good documentation.
This is wide_to_long
ydf=pd.wide_to_long(df,'D',i=['A','B','C'],j='DA').reset_index()
ydf
A B C DA D
0 1 2 3 1 5
1 1 2 3 2 6
2 1 2 3 3 7
Use melt:
df.melt(['A','B','C'], var_name='DA', value_name='D')
Output:
A B C DA D
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
Use set_index and stack
df.set_index(['A','B','C']).stack().reset_index()
Output:
A B C level_3 0
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
And, you can do housekeeping by renaming column headers etc....

How can I unpivot or stack a pandas dataframe in the way that I asked?

I have a python pandas dataframe.
For example here is my data:
id A_1 A_2 B_1 B_2
0 j2 1 5 10 8
1 j3 2 6 11 9
2 j4 3 7 12 10
I want it to look like this:
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2
Can you help me please. Thank you so much!
Use wide_to_long with DataFrame.sort_values:
df = (pd.wide_to_long(df, ['A','B'], i='id', j='Other', sep='_')
.sort_values('id')
.reset_index())
print (df)
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2 j3 1 2 11
3 j3 2 6 9
4 j4 1 3 12
5 j4 2 7 10
We can also use DataFrame.melt + Series.str.split to performance a DataFrame.pivot_table:
df2=df.melt('id')
df2[['columns','Other']]=df2['variable'].str.split('_',expand=True)
new_df= ( df2.pivot_table(columns='columns',index=['id','Other'],values='value')
.reset_index()
.rename_axis(columns=None) )
print(new_df)
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2 j3 1 2 11
3 j3 2 6 9
4 j4 1 3 12
5 j4 2 7 10

How to create an ID that links rows based on multiple fields

I have a requirement to create GROUP_ID based on information present in two other fields. All ID_1 having same values must have a unique Group_ID and likewise, all ID_2 having same values must have a unique Group_ID. The Group_ID need not be contiguous.
ID_1 ID_2 GROUP_ID
X1 10 1
X1 20 1
Y1 30 2
Y2 30 2
A1 100 3
A1 200 3
B1 200 3
B1 200 3
B1 300 3
B1 300 3
C1 300 3
C1 400 3
I am using pyspark and I tried to solve in Spark SQL using window functions (see below), but unable to produce the desired output. Please let me know if there is an efficient way to solve this. My dataset is having >100M rows.
RowNum ID_1 ID_2 ID_1_1 ID_2_1 GROUP_ID
1 X1 10 1 1 1
2 X1 20 1 1 1
3 Y1 30 3 3 3
4 Y2 30 4 3 3
5 A1 100 5 5 5
6 A1 200 5 5 5
7 B1 200 7 5 5
8 B1 200 7 5 5
9 B1 300 7 7 5
10 B1 300 7 7 5
11 C1 300 11 7 7
12 C1 400 11 11 7
Where
ID_1_1 = First(ROWNUM) over (Partition by ID_1 order by RowNum)
ID_2_1 = First(ID_1_1) over (Partition by ID_2 order by ID_1_1)
Group_ID = First(ID_2_1) over (Partition by ID_1_1 order by ID_2_1)
Using above approach, Rows 11 and 12 gets a group ID of 7 instead of 5.

Resources