Can I apply Groupby on Pandas dataframe and calculate mean across all the columns? - python-3.x

Suppose I have a dataframe that looks like this:
Category Col1 Col2 Col3 Col4 Col5
Footwear 35 55 67 87 94
Apparels 56 65 54 84 77
Footwear 87 85 56 95 35
Handbags 83 62 724 51 62
Handbags 61 512 21 58 78
Apparels 50 62 172 77 5
Now, I want to find the mean and standard deviation for the unique categories, but not for the different columns separately, rather one mean and one std for each category. So I want an output like this:
Category mean stdev
Footwear xxx aaa
Apparels yyy bbb
Handbags zzz ccc
I cannot just calculate the mean and std first across the columns using mean function with axis=1 and then Groupby for the categories. It would yield incorrect results.
So my dilemma is that I want to perform a groupby, while aggregating across rows and columns at the same time.
I have a feeling that a user-defined function could do it, applying it through lambda aggregation along with Groupby. But I couldn't do it. Am I even on the right track? Thanks!

If I understand you correctly, lets try using melt and groupby with agg
df1 = pd.melt(df,id_vars='Category').groupby('Category').agg(mean=('value','mean'),
std=('value','std'))
print(df1)
mean std
Category
Apparels 70.2 41.983595
Footwear 69.6 23.291391
Handbags 171.2 241.295946

Related

How to check if a value in a column is found in a list in a column, with Spark SQL?

I have a delta table A as shown below.
point
cluster
points_in_cluster
37
1
[37,32]
45
2
[45,67,84]
67
2
[45,67,84]
84
2
[45,67,84]
32
1
[37,32]
Also I have a table B as shown below.
id
point
101
37
102
67
103
84
I want a query like the following. Here in obviously doesn't work for a list. So, what would be the right syntax?
select b.id, a.point
from A a, B b
where b.point in a.points_in_cluster
As a result I should have a table like the following
id
point
101
37
101
32
102
45
102
67
102
84
103
45
103
67
103
84
Based on your data sample, I'd do an equi-join on point column and then an explode on points_in_cluster :
from pyspark.sql import functions as F
# assuming A is df_A and B is df_B
df_A.join(
df_B,
on="point"
).select(
"id",
F.explode("points_in_cluster").alias("point")
)
Otherwise, you use array_contains:
select b.id, a.point
from A a, B b
where array_contains(a.points_in_cluster, b.point)

Find duplicated Rows based on selected columns condition in a pandas Dataframe

I have an extensive base converted into a dataframe where it is difficult to manually identify the following
The dataframe has columns with the names from_bus and to_bus, which are unique identifiers regardless of the order, for example for element 0:
L_ABAN_MACA_0_1 the associated ordered pair (109,140) is the same as (140,109).
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.476683
2
L_AGOY_BAÑO_1_2
69
66
0.476683
3
L_ALAN_INGA_1_1
189
188
0.452790
4
L_ALAN_INGA_1_2
188
189
0.500450
So I want to identify the duplicate ordered pairs and replace them with a single one, whose column value x_ohn_per_km is defined as the sum of the duplicated values, as follows:
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.953366
3
L_ALAN_INGA_1_1
189
188
0.953240
Let us try groupby on from_bus and to_bus after sorting the values in these columns along axis=1 then agg to aggregate the result, optionally reindex to conform the order of columns:
c = ['from_bus', 'to_bus']
df[c] = np.sort(df[c], axis=1)
df.groupby(c, sort=False, as_index=False)\
.agg({'name': 'first', 'x_ohm_per_km': 'sum'})\
.reindex(df.columns, axis=1)
Alternative approach:
d = {**dict.fromkeys(df, 'first'), 'x_ohm_per_km': 'sum'}
df.groupby([*np.sort(df[c], axis=1).T], sort=False, as_index=False).agg(d)
name from_bus to_bus x_ohm_per_km
0 L_ABAN_MACA_0_1 109 140 0.444450
1 L_AGOY_BAÑO_1_1 66 69 0.953366
2 L_ALAN_INGA_1_1 188 189 0.953240

How to filter a data frame from each first non NaN value until next and sum values from corresponding column?

I am struggling with the following data frame:
Activity Duration (mins)
BREAK/REST 120
AVAILABILITY 57
WORK 13
DRIVING 10
WORK 31
DRIVING 100
DRIVING 81
DRIVING 106
BREAK/REST 89
BREAK/REST 4
I am trying to find total duration for similar consecutive activities. Following is the output I am trying to achieve.
Activity Duration (mins)
BREAK/REST 120
AVAILABILITY 57
WORK 13
DRIVING 10
WORK 31
DRIVING 287
BREAK/REST 93
I am doing something like this:
import pandas as pd
df = pd.read_excel('reformed_data.xlsx')
df['Activity'].mask((df['Activity'].shift()==df['Activity']), inplace=True)
I am stuck at this point and don't know how to proceed. Please help! :(
IIUC we need shift + cumsum create the group key
s=df.groupby(df.Activity.ne(df.Activity.shift()).cumsum()).\
agg({'Activity':'first','Duration(mins)':'sum'})
s
Out[185]:
Activity Duration(mins)
Activity
1 BREAK/REST 120
2 AVAILABILITY 57
3 WORK 13
4 DRIVING 10
5 WORK 31
6 DRIVING 287
7 BREAK/REST 93

Find out minimum value of specific columns in a row in MS Excel

My table in Excel looks something like this:
abcd 67 94 52 89 24
efgh 23 45 93 54 34
ijkl 64 83 23 45 92
mnop 34 45 10 66 53
This is a student database containing marks obtained in various subjects. I need to calculate the percentage in each row such that out of 5 subjects, the first subject is always included with other 3 subject with maximum marks.
Example: abcd 67 94 52 89 24 75.5%
Here 75.5%=(67+94+52+89)/4=302/4=75.5 where 24 being the lowest has been excluded and 67 has to be taken even if it were the least.
What I require is the least(excluding the first column, of course) of all the columns in that particular row, so that I can sum all the marks and subtract this least marks and finally use it to calculate the percentage.
Any help/suggestion would be appreciated. Thank You.
You'll need to adjust this for your columns, but if you sum the entire range, then subtract the min value after, do a count of the range then subtract one from that, you will be able to get the average.
This code is using the 4 values from column B through F and the 4 values are: 67 94 52 89 24... which results in 75.5
=(SUM(B3:F3)-MIN(C3:F3))/(COUNT(B3:F3)-1)

MS Excel: how can I make Max() more efficient?

I have a set of data that looks like this:
ID Value MaxByID
0 32 80
0 80 80
0 4 80
0 68 80
0 6 80
1 32 68
1 54 68
1 56 68
1 68 68
1 44 68
2 54 92
2 52 92
2 92 92
4 68 68
4 52 68
5 74 74
5 22 74
6 52 94
6 52 94
6 46 94
6 94 94
6 56 94
6 14 94
I am using {=MAX(IF(A$2:A$100=A2,B$2:B$100))} to calculate the MaxByID column. However, the dataset has >100k rows, with mostly unique IDs: this seems to be a really inefficient way to do this, as each cell in C:C has to iterate through every cell in A:A.
The ID field is numeric and can be sorted- is there a way of more intelligently finding the MaxByID?
You may be able to use a pivot table to find the maximum for each unique ID: see this link for an example.
Once you have that table, VLOOKUP should enable you to quickly find MaxByID for each ID.
Once you have sorted by ID you could add columns to get the start row number and count for each unique. These 2 numbers allow you to calculate the size and position of the range of Unique values. So then you can use MAX(OFFSET(StartValueCell,StartThisUnique-1,0,CountThisUnique,1)) to get the max
This might be faster
{=IF(A2=A1,C1,MAX(($A$2:$A$24=A2)*($B$2:$B$24)))}
Since your data appears to be sorted, you could see if the ID matches the row above and simply copy the max down.

Resources