Average of a column by unique pair of another columns - pandas-groupby

I have a pandas dataframe as below :
df = pd.DataFrame({'start': {0: 365, 1: 365, 2: 365, 3: 365, 4: 356, 5: 261, 6: 240, 7: 238},
'end': {0: 240, 1: 261, 2: 356, 3: 238, 4: 365, 5: 365, 6: 365, 7: 365},
'value': {0: 585, 1: 567, 2: 191, 3: 186, 4: 196, 5: 545, 6: 564, 7: 184}})
Here's what the dataframe looks like,
start end value
1 365 240 585
2 365 261 567
3 365 356 191
4 365 238 186
5 356 365 196
6 261 365 545
7 240 365 564
8 238 365 184
There are four unique pairs of start-end. And i want a dataframe with average of value for each of this unique pairs. output dataframe would like below :
start end value
1 365 240 574.5
2 365 261 556
3 365 356 193.5
4 365 238 185
I know i can get the number of occurence of unique pair by using groupby and size functions, but cannot thing of a way to apply average on the value column for each unique pair. Does grouper function from Pandas can be used for this problem ?

IIUC, you want to sort start and end and then groupby-average on these two columns:
df[["start", "end"]] = -np.sort(-df.iloc[:, :2], axis=1)
df.groupby(["start", "end"]).value.mean().reset_index()
# out:
start end value
0 365 238 185.0
1 365 240 574.5
2 365 261 556.0
3 365 356 193.5

Related

In Excel, how can I sort by the first value when there are multiple values in the cell?

I have an automatically generated spreadsheet in Excel. The values of one column are:
1, 184
10, 18, 90
102, 207
11, 13
2
20, 50
204
3, 120
(all comma separated values in a single column, as below)
What I need to do is sort by the first number, so that the above would be:
1, 184
2
3, 120
10, 18, 90
11, 13
20, 50
102, 207
204
How can i do this in Excel?
Excel 365 current channel:
=SORTBY(A1:A8, NUMBERVALUE(TEXTBEFORE(A1:A8,",",,,,A1:A8)))

adding reversed columns to dataframe [duplicate]

This question already has an answer here:
Reversing the order of values in a single column of a Dataframe
(1 answer)
Closed 1 year ago.
Trying to add a reversed column to a data frame, but it just adds in normal order. For me, it looks like it is just following the index of the dataframe. Is it possible to reorder the index?
df_reversed = df['Buy'].iloc[::-1]
Data["newColumn"] = df_reversed
Image of the output
Image of df_reversed
This is how I want the output to be
A slight modification from #Chicodelarose, you can reverse just the values and get the result you want as follows:
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
df["calories_reversed"] = df["calories"].values[::-1]
print(df)
Output will be:
calories duration
0 420 50
1 380 40
2 390 45
calories duration calories_reversed
0 420 50 390
1 380 40 380
2 390 45 420
You need to call reset_index before assigning the values to the new column so that they are added to the data frame in reverse order:
Example:
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
df["calories_reversed"] = df["calories"][::-1].reset_index(drop=True)
print(df)
Output:
calories duration
0 420 50
1 380 40
2 390 45
calories duration calories_reversed
0 420 50 390
1 380 40 380
2 390 45 420

Python Pandas DF Pivot and Groupby

I need to iterate through my dataframe rows and pivot the single column bounding_box_y into 8 columns each time the value in text_y column changes.
original data frame
desired data frame
Can anyone help with some code that does NOT hardcode values into the code? The entire dataframe is over 6000 rows. I need to pivot the one column into 8 each time the value in another column changes.
Thanks!
Please try to include your data as callable code, so others can easily copy/paste and experiment. In your case you can get it with df.head(16).to_dict('list'). I used the following
df = pd.DataFrame({
'boundingBox_y': [183, 120, 305, 120, 305, 161, 182, 161, 318, 120, 381, 120, 382, 162, 318, 161],
'text_y': (['FORM'] * 8) + (['ABC'] * 8),
'confidence': ([0.987] * 8) + ([0.976] * 8)
})
Then you can pivot your dataframe but you need to add a new column to hold the pivoted column names.
# rename the current values column
df.rename({'boundingBox_y': 'value'}, axis=1, inplace=True)
# create a column that contains the columns headers and can be pivoted
df['boundingBox_y'] = df.groupby(['confidence', 'text_y']).transform('cumcount')
# pivot your df
df = df.pivot(index=['confidence', 'text_y'],
columns='boundingBox_y', values='value')
Output
boundingBox_y 0 1 2 3 4 5 6 7
confidence text_y
0.976 ABC 318 120 381 120 382 162 318 161
0.987 FORM 183 120 305 120 305 161 182 161

Groupby without an aggregation function and sort that data

I have customer ID and date of purchase. I need to sort date of purchase for each of the customer ID seperately.
I need a groupby operation but without an aggregation, and sort date of purchase for each customer.
Tried this way
new_data = data.groupby('custID').sort_values('purchase_date')
AttributeError: Cannot access callable attribute 'sort_values' of
'DataFrameGroupBy' objects, try using the 'apply' method
Expected result is like:
custID purchase_date
100 23/01/2019
100 29/01/2019
100 03/04/2019
120 02/05/2018
120 09/03/2019
120 11/05/2019
# import the pandas library
import pandas as pd
data = {
'purchase_date': ['23/01/2019', '19/01/2019', '12/01/2019', '23/01/2019', '11/01/2019', '23/01/2019', '06/05/2019', '05/05/2019', '05/01/2019', '02/07/2019',],
'custID': [100, 160, 100, 110, 160, 110, 110, 110, 110, 160]
}
df = pd.DataFrame(data)
sortedData = df.groupby('custID').apply(
lambda x: x.sort_values(by = 'purchase_date', ascending = True))
sortedData=sortedData.reset_index(drop=True, inplace=False)
OUTPUT:
print(sortedData)
Index custID purchase_date
0 100 12/01/2019
1 100 23/01/2019
2 110 05/01/2019
3 110 05/05/2019
4 110 06/05/2019
5 110 23/01/2019
6 110 23/01/2019
7 160 02/07/2019
8 160 11/01/2019
9 160 19/01/2019
print(sortedData.to_string(index=False))
custID purchase_date
100 12/01/2019
100 23/01/2019
110 05/01/2019
110 05/05/2019
110 06/05/2019
110 23/01/2019
110 23/01/2019
160 02/07/2019
160 11/01/2019
160 19/01/2019

pandas remove rows with multiple criteria

Consider the following pandas Data Frame:
df = pd.DataFrame({
'case_id': [1050, 1050, 1050, 1050, 1051, 1051, 1051, 1051],
'elm_id': [101, 102, 101, 102, 101, 102, 101, 102],
'cid': [1, 1, 2, 2, 1, 1, 2, 2],
'fx': [736.1, 16.5, 98.8, 158.5, 272.5, 750.0, 333.4, 104.2],
'fy': [992.0, 261.3, 798.3, 452.0, 535.9, 838.8, 526.7, 119.4],
'fz': [428.4, 611.0, 948.3, 523.9, 880.9, 340.3, 890.7, 422.1]})
When printed looks like this:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
4 1051 1 101 272.5 535.9 880.9
5 1051 1 102 750.0 838.8 340.3
6 1051 2 101 333.4 526.7 890.7
7 1051 2 102 104.2 119.4 422.1
I need to remove rows where 'case_id' = values in a List and 'cid' = values in a List. For simplicity lets just use Lists with a single value: cases = [1051] and ids = [1] respectively. In this scenario I want the NEW Data Frame to have (6) rows of data. It should look like this because there were two rows matching my criteria which should be removed:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
4 1051 2 101 333.4 526.7 890.7
5 1051 2 102 104.2 119.4 422.1
I've tried a few different things like:
df2 = df[(df.case_id != subcase) & (df.cid != commit_id)]
But this returns the inverse of what I was expecting:
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
I've also tried using .query(): df.query('(case_id != 1051) & (cid != 1)')
but got the same (2) rows of results.
Any help and/or explanations would be greatly appreciated.
Your code looks for the rows that meets the criteria, not drop it. You can drop thee rows using .drop()
Use the following:
df.drop(df.loc[(df['case_id'].isin(cases)) & (df['cid'].isin(ids))].index)
Output:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
6 1051 2 101 333.4 526.7 890.7
7 1051 2 102 104.2 119.4 422.1

Resources