Sort value in a column based on condition of another column in a dataframe - python-3.x

I have a dataframe that looks like this
Company Company Code Product Code Rating
Monster MNTR MNTR/Headphone1 3.2
Monster MNTR MNTR/Headphone2 3.9
Monster MNTR MNTR/Headphone3 NaN
Monster MNTR MNTR/Earbuds1 3.5
Bose BOSE BOSE/Headphone1 4.0
Bose BOSE BOSE/Earbuds1 NaN
Bose BOSE BOSE/Earbuds2 2.8
Apple APLE APLE/Headphone1 4.5
Sony SONY SONY/Headphone1 3.5
Sony SONY SONY/Headphone2 4.8
Sony SONY SONY/Earbuds1 3.0
Beats BEAT BEAT/Headphone1 3.5
Beats BEAT BEAT/Headphone2 3.7
If the Rating is >= 4.0, I want to group by the Company Code and bring all the products of the same company to the top, then sort by their Rating but keeping the original order of the Product Code and the company together. Like Sony, Apple and Bose.
If no ratings of any company products is above 4.0, I would group by the Company Code and sort the Company Code in alphabetical order. Like Beats and Monster.
Company Company Code Product Code Rating
Sony SONY SONY/Headphone1 3.5
Sony SONY SONY/Headphone2 4.8
Sony SONY SONY/Earbuds1 3.0
Apple APLE APLE/Headphone1 4.5
Bose BOSE BOSE/Headphone1 4.0
Bose BOSE BOSE/Earbuds1 NaN
Bose BOSE BOSE/Earbuds2 2.8
Beats BEAT BEAT/Headphone1 3.5
Beats BEAT BEAT/Headphone2 3.7
Monster MNTR MNTR/Headphone1 3.2
Monster MNTR MNTR/Headphone2 3.9
Monster MNTR MNTR/Headphone3 NaN
Monster MNTR MNTR/Earbuds1 3.5
I thought about dividing the dataframe into two parts - upper and lower, then use concat to join them back. For example,
condition = df['Rating'] >= 4.0
df_upper = df.loc[condition]
df_lower = df.loc[~condition]
.
.
.
df_merge = pd.concat([df_upper, df_lower], ignore_index=True)
But I have no idea where to apply groupby and sort. Thank you for helping out.

For sorting is used ordered categoricals by Categorical with filter Company Code of filtered rows and last sorting by DataFrame.sort_values:
condition = df['Rating'] >= 4.0
cats1 = df.loc[condition].sort_values('Rating', ascending=False)['Company Code'].unique()
cats2 = df.loc[~condition, 'Company Code'].sort_values().unique()
cats = pd.Index(cats1).union(pd.Index(cats2), sort=False)
print (cats)
Index(['SONY', 'APLE', 'BOSE', 'BEAT', 'MNTR'], dtype='object')
df['Company Code'] = pd.Categorical(df['Company Code'], ordered=True, categories=cats)
df = df.sort_values('Company Code')
print (df)
Company Company Code Product Code Rating
8 Sony SONY SONY/Headphone1 3.5
9 Sony SONY SONY/Headphone2 4.8
10 Sony SONY SONY/Earbuds1 3.0
7 Apple APLE APLE/Headphone1 4.5
4 Bose BOSE BOSE/Headphone1 4.0
5 Bose BOSE BOSE/Earbuds1 NaN
6 Bose BOSE BOSE/Earbuds2 2.8
11 Beats BEAT BEAT/Headphone1 3.5
12 Beats BEAT BEAT/Headphone2 3.7
0 Monster MNTR MNTR/Headphone1 3.2
1 Monster MNTR MNTR/Headphone2 3.9
2 Monster MNTR MNTR/Headphone3 NaN
3 Monster MNTR MNTR/Earbuds1 3.5

Related

how to get quartiles and classify a value according to this quartile range

I have this df:
d = pd.DataFrame({'Name':['Andres','Lars','Paul','Mike'],
'target':['A','A','B','C'],
'number':[10,12.3,11,6]})
And I want classify each number in a quartile. I am doing this:
(d.groupby(['Name','target','number'])['number']
.quantile([0.25,0.5,0.75,1]).unstack()
.reset_index()
.rename(columns={0.25:"1Q",0.5:"2Q",0.75:"3Q",1:"4Q"})
)
But as you can see, the 4 quartiles are all equal because the code above is calculating per row so if there's one 1 number per row all quartiles are equal.
If a run instead:
d['number'].quantile([0.25,0.5,0.75,1])
Then I have the 4 quartiles I am looking for:
0.25 9.000
0.50 10.500
0.75 11.325
1.00 12.300
What I need as output(showing only first 2 rows)
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.30 1
1 Lars A 12.3 9.0 10.5 11.325 12.30 4
you can see all quartiles has the the values considering tall values in the number column. Besides that, now we have a column names Rank that classify the number according to it's quartile. ex. In the first row 10 is within the 1st quartile.
Here's one way that build on the quantiles you've created by making it a DataFrame and joining it to d. Also assigns "Rank" column using rank method:
out = (d.join(d['number'].quantile([0.25,0.5,0.75,1])
.set_axis([f'{i}Q' for i in range(1,5)], axis=0)
.to_frame().T
.pipe(lambda x: x.loc[x.index.repeat(len(d))])
.reset_index(drop=True))
.assign(Rank=d['number'].rank(method='dense')))
Output:
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.3 2.0
1 Lars A 12.3 9.0 10.5 11.325 12.3 4.0
2 Paul B 11.0 9.0 10.5 11.325 12.3 3.0
3 Mike C 6.0 9.0 10.5 11.325 12.3 1.0

How to plot grouped boxplot by gnuplot

I wonder how to use gnuplot to plot this figure:
There are two problems I have:
the ytic is ..., 10^2, 10^1, 10^2, 10^3, ... How to handle such a
case?
I know gnuplot support boxplot, but how to regroup boxplot
according to some label?
Since I don't have the original data for the figure, I make up some data by myself.
There are two companies A, B, and C, selling different fruits with four prices.
Apple prices of company A: 1.2 1.3 1.4 1.1
Banana prices of company A: 2.2 2.1 2.4 2.5
Orange prices of company A: 3.1 3.3 3.4 3.5
Apple prices of company B: 1.2 1.3 1.4 1.1
Banana prices of company B: 2.2 2.1 2.4 2.5
Orange prices of company B: 3.1 3.3 3.4 3.5
Apple prices of company C: 2.2 1.3 1.4 2.1
Banana prices of company C: 3.2 3.1 3.4 2.5
Orange prices of company C: 2.1 3.3 1.4 2.5
I wonder how to plot those numbers by gnuplot.
Your question is not very detailed and your own coding attempt is missing, hence, there is a lot of uncertainty. I guess there is no simple single command to get your grouped boxplots.
There are for sure several ways to realize your graph, e.g. with multiplot.
The assumption for the example below is that all files have the data organized in columns and equal number of columns and same fruits in the same order. Otherwise the code must be adapted. It all depends on the degree of "automation" you would like to have. Vertical separation lines can be drawn via headless arrows (check help arrow).
So, see the following example as a starting point.
Data:
'Company A.dat'
Apples Bananas Oranges
1.2 2.2 3.1
1.3 2.1 3.3
1.4 2.4 3.4
1.1 2.5 3.5
'Company B.dat'
Apples Bananas Oranges
1.2 2.2 3.1
1.3 2.1 3.3
1.4 2.4 3.4
1.1 2.5 3.5
'Company C.dat'
Apples Bananas Oranges
2.2 3.2 2.1
1.3 3.1 3.3
1.4 3.4 1.4
2.1 2.5 2.5
Code:
### grouped boxplots
reset session
FILES = 'A B C'
File(n) = sprintf("Company %s.dat",word(FILES,n))
myXtic(n) = sprintf("Company %s",word(FILES,n))
set xlabel "Fruit prices"
set ylabel "Price"
set yrange [0:5]
set grid y
set key noautotitle
set style fill solid 0.3
N = words(FILES) # number of files
COLS = 3 # number of columns in file
PosX = 0 # x-position of boxplot
plot for [n=1:N] for [COL=1:COLS] PosX=PosX+1 File(n) u (PosX):COL w boxplot lc COL, \
for [COL=1:COLS] File(1) u (NaN):COL w boxes lc COL ti columnhead, \
for [n=1:N] File(1) u ((n-1)*COLS+COLS/2+1):(NaN):xtic(myXtic(n))
### end of code
Result:

Looping through a pandas DataFrame accessing previous elements

I have two DataFrames, FirstColumn & SecondColumn.
How do I create a new column, containing the correlation coeff. row by row for the two columns 5 periods back?
For example, the 5th row would be the R2 value of the two columns 5 periods back, the 6th row would be the corr coeff. value of the columns ranging from row 1-6 etc etc.
Additionally, what method is the most efficient when looping through a DataFrame, having to access previous rows?
FirstColumn SecondColumn
0 2 1.0
1 3 3.0
2 4 4.0
3 5 5.0
4 6 2.0
5 7 6.0
6 2 2.0
7 3 3.0
8 5 9.0
9 3 2.0
10 2 3.0
11 4 2.0
12 2 2.0
13 4 2.0
14 2 4.0
15 5 3.0
16 3 1.0
You can do:
df["corr"]=df.rolling(5, min_periods=1).corr()["FirstColumn"].loc[(slice(None), "SecondColumn")]
Outputs:
FirstColumn SecondColumn corr
0 2.0 1.0 NaN
1 3.0 3.0 1.000000
2 4.0 4.0 0.981981
3 5.0 5.0 0.982708
4 6.0 2.0 0.400000
5 7.0 6.0 0.400000
6 2.0 2.0 0.566707
7 3.0 3.0 0.610572
8 5.0 9.0 0.426961
9 3.0 2.0 0.737804
10 2.0 3.0 0.899659
11 4.0 2.0 0.698774
12 2.0 2.0 0.716769
13 4.0 2.0 -0.559017
14 2.0 4.0 -0.612372
15 5.0 3.0 -0.250000
16 3.0 1.0 -0.067267
You can use the shift(n) method to access the element n rows back. One approach would be to create "lag" columns, like so:
for i in range(5):
df['FirstCol_lag'+str(i)] = df.FirstColumn.shift(i)
Then you can do your formula operations on a row-by-row basis, e.g.
df['R2'] = foo([df.FirstCol_lag1, ... df.SecondCol_lag5])
The most efficient approach would be to not use a loop and do it this way. But if the data is very large this may not be feasible. I think the iterrows() function is pretty efficient too, you can test which is faster if you really care. For that you'd have to offset the row index manually and it would take more code.
Still you'll have to be careful about handling nans because the shift will be null for the first n columns of your dataframe.

Pandas pivot with multiple items per column, how to avoid aggregating them?

Follow up to this question, in particular this comment.
Consider following dataframe:
df = pd.DataFrame({
'Person': ['Adam', 'Adam', 'Cesar', 'Diana', 'Diana', 'Diana', 'Erika', 'Erika'],
'Belonging': ['House', 'Car', 'Car', 'House', 'Car', 'Bike', 'House', 'Car'],
'Value': [300.0, 10.0, 12.0, 450.0, 15.0, 2.0, 600.0, 11.0],
})
Which looks like this:
Person Belonging Value
0 Adam House 300.0
1 Adam Car 10.0
2 Cesar Car 12.0
3 Diana House 450.0
4 Diana Car 15.0
5 Diana Bike 2.0
6 Erika House 600.0
7 Erika Car 11.0
Using a pivot_table() is a nice way to reshape this data that will allow querying it by Person and see all of their belongings in a single row, making it really easy to answer queries such as "How to find the Value of Persons Car, if they have a House valued more than 400.0?"
A pivot_table() can be easily built for this data set with:
df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
)
Which will look like:
Belonging Bike Car House
Person
Adam NaN 10.0 300.0
Cesar NaN 12.0 NaN
Diana 2.0 15.0 450.0
Erika NaN 11.0 600.0
But this gets limited when a Person has more than one of the same type of Belonging, for example two Cars, two Houses or two Bikes.
Consider the updated data:
df = pd.DataFrame({
'Person': ['Adam', 'Adam', 'Cesar', 'Diana', 'Diana', 'Diana', 'Erika', 'Erika', 'Diana', 'Adam'],
'Belonging': ['House', 'Car', 'Car', 'House', 'Car', 'Bike', 'House', 'Car', 'Car', 'House'],
'Value': [300.0, 10.0, 12.0, 450.0, 15.0, 2.0, 600.0, 11.0, 21.0, 180.0],
})
Which looks like:
Person Belonging Value
0 Adam House 300.0
1 Adam Car 10.0
2 Cesar Car 12.0
3 Diana House 450.0
4 Diana Car 15.0
5 Diana Bike 2.0
6 Erika House 600.0
7 Erika Car 11.0
8 Diana Car 21.0
9 Adam House 180.0
Now that same pivot_table() will return the average of Diana's two cars, or Adam's two houses:
Belonging Bike Car House
Person
Adam NaN 10.0 240.0
Cesar NaN 12.0 NaN
Diana 2.0 18.0 450.0
Erika NaN 11.0 600.0
So we can pass pivot_table() an aggfunc='sum' or aggfunc=np.sum to get the sum rather than the average, which will give us 480.0 and 36.0 and is probably a better representation of the total value a Person owns in Belongings of a certain type. But we're missing details.
We can use aggfunc=list which will preserve them:
df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
aggfunc=list,
)
Belonging Bike Car House
Person
Adam NaN [10.0] [300.0, 180.0]
Cesar NaN [12.0] NaN
Diana [2.0] [15.0, 21.0] [450.0]
Erika NaN [11.0] [600.0]
This keeps the detail about multiple Belongings per Person, but on the other hand is quite inconvenient in that it is using Python lists rather than native Pandas types and columns, so it makes some queries such as the total Values in Houses difficult to answer.
Using aggfunc=np.sum, we could simply use pd_pivot['House'].sum() to get the total of 1530.0. Even questions such as the one above, Cars for Persons with a House worth more than 400.0 are now harder to answer.
What's a better way to reshape this data that will:
Allow easy querying a Person's Belongings in a single row, like the pivot_table() does;
Preserve the details of Persons who have multiple Belongings of a certain type;
Use native Pandas columns and data types that make it possible to use Pandas methods for querying and summarizing the data.
I thought of updating the Belonging descriptions to include a counter, such as "House 1", "Car 2", etc. Perhaps sorting so that the most valuable one comes first (to help answer questions such as "has a house worth more than 400.0" looking at "House 1" only.)
Or perhaps using a pd.MultiIndex to still be able to access all "House" columns together.
But unsure how to actually reshape the data in such a way.
Or are there better suggestions on how to reshape it (other than adding a count per belonging) that would preserve the features described above? How would you reshape it and how would you answer all these queries I mentioned above?
Perhaps sth like this:
given your Pivot table in the following dataframe:
pv = df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
aggfunc=list,
)
then apply pd.Series to all columns.
For proper naming of columns, calculate maximum length of lists in each column and then use 'set_axis' for renaming:
new_pv = pd.DataFrame(index=pv.index)
for col in pv:
n = int(pv[col].str.len().max())
new_pv = pd.concat([new_pv, pv[col].apply(pd.Series).set_axis([f'{col}_{i}' for i in range(n)], 1, inplace = False)], 1)
# Bike_0 Car_0 Car_1 House_0 House_1
# Person
# Adam NaN 10.0 NaN 300.0 180.0
# Cesar NaN 12.0 NaN NaN NaN
# Diana 2.0 15.0 21.0 450.0 NaN
# Erika NaN 11.0 NaN 600.0 NaN
counting of houses:
new_pv.filter(like='House').count(1)
# Person
# Adam 2
# Cesar 0
# Diana 1
# Erika 1
# dtype: int64
sum of all house's values:
new_pv.filter(like='House').sum().sum()
# 1530.0
Using groupby, you could achieve something like this.
df_new = df.groupby(['Person', 'Belonging']).agg(('sum', 'count', 'min', 'max'))
which would give.
Value
sum count min max
Person Belonging
Adam Car 10.0 1 10.0 10.0
House 480.0 2 180.0 300.0
Cesar Car 12.0 1 12.0 12.0
Diana Bike 2.0 1 2.0 2.0
Car 36.0 2 15.0 21.0
House 450.0 1 450.0 450.0
Erika Car 11.0 1 11.0 11.0
House 600.0 1 600.0 600.0
You could define your own functions in the .agg method to provide more suitable descriptions also.
Edit
Alternatively, you could try
df['Belonging'] = df["Belonging"] + "_" + df.groupby(['Person','Belonging']).cumcount().add(1).astype(str)
Person Belonging Value
0 Adam House_1 300.0
1 Adam Car_1 10.0
2 Cesar Car_1 12.0
3 Diana House_1 450.0
4 Diana Car_1 15.0
5 Diana Bike_1 2.0
6 Erika House_1 600.0
7 Erika Car_1 11.0
8 Diana Car_2 21.0
9 Adam House_2 180.0
Then you can just use pivot
df.pivot('Person', 'Belonging')
Value
Belonging Bike_1 Car_1 Car_2 House_1 House_2
Person
Adam NaN 10.0 NaN 300.0 180.0
Cesar NaN 12.0 NaN NaN NaN
Diana 2.0 15.0 21.0 450.0 NaN
Erika NaN 11.0 NaN 600.0 NaN
I ended up working out a solution to this one, inspired by the excellent answers by #SpghttCd and #Josmoor98, but with a couple differences:
Using a MultiIndex, so I have a really easy way to get all Houses or all Cars.
Sorting values, so looking at the first House or Car can be used to tell who has one worth more than X.
Code for the pivot table:
df_pivot = (df
.assign(BelongingNo=df
.sort_values(by='Value', ascending=False)
.groupby(['Person', 'Belonging'])
.cumcount() + 1
)
.pivot_table(
values='Value',
index='Person',
columns=['Belonging', 'BelongingNo'],
)
)
Resulting DataFrame:
Belonging Bike Car House
BelongingNo 1 1 2 1 2
Person
Adam NaN 10.0 NaN 300.0 180.0
Cesar NaN 12.0 NaN NaN NaN
Diana 2.0 21.0 15.0 450.0 NaN
Erika NaN 11.0 NaN 600.0 NaN
Queries are pretty straightforward.
For example, finding the Value of Person's Cars, if they have a House valued more than 400.0:
df_pivot.loc[
df_pivot[('House', 1)] > 400.0,
'Car'
]
Result:
BelongingNo 1 2
Person
Diana 21.0 15.0
Erika 11.0 NaN
The average Car price for them:
df_pivot.loc[
df_pivot[('House', 1)] > 400.0,
'Car'
].stack().mean()
Result: 15.6666
Here, using stack() is a powerful way to flatten the second level of the MultiIndex, after having used the top-level to select a Belonging column.
Same is useful to get the total Value of all Houses:
df_pivot['House'].sum()
Results in the expected 1530.0.
Finally, looking at all Belongings of a single Person:
df_pivot.loc['Adam'].dropna()
Returns the expected two Houses and the one Car, with their respective Values.
I tried doing this with the lists in the dataframe, so that they get converted to ndarrays.
pd_df_pivot = df_pivot.copy(deep=True)
for row in range(0,df_pivot.shape[0]):
for col in range(0,df_pivot.shape[1]):
if type(df_pivot.iloc[row,col]) is list:
pd_df_pivot.iloc[row,col] = np.array(df_pivot.iloc[row,col])
else:
pd_df_pivot.iloc[row,col] = df_pivot.iloc[row,col]

Replace missing values based on another column

I am trying to replace the missing values in a dataframe based on filtering of another column, "Country"
>>> data.head()
Country Advanced skiers, freeriders Snow parks
0 Greece NaN NaN
1 Switzerland 5.0 5.0
2 USA NaN NaN
3 Norway NaN NaN
4 Norway 3.0 4.0
Obviously this is just a small snippet of the data, but I am looking to replace all the NaN values with the average value for each feature.
I have tried grouping the data by the country and then calculating the mean of each column. When I print out the resulting array, it comes up with the expected values. However, when I put it into the .fillna() method, the data appears unchanged
I've tried #DSM's solution from this similar post, but I am not sure how to apply it to multiple columns.
listOfRatings = ['Advanced skiers, freeriders', 'Snow parks']
print (data.groupby('Country')[listOfRatings].mean().fillna(0))
-> displays the expected results
data[listOfRatings] = data[listOfRatings].fillna(data.groupby('Country')[listOfRatings].mean().fillna(0))
-> appears to do nothing to the dataframe
Assuming this is the complete dataset, this is what I would expect the results to be.
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0
Can anyone explain what I am doing wrong, and how to fix the code?
You can use transform for return new DataFrame with same size as original filled by aggregated values:
print (data.groupby('Country')[listOfRatings].transform('mean').fillna(0))
Advanced skiers, freeriders Snow parks
0 0.0 0.0
1 5.0 5.0
2 0.0 0.0
3 3.0 4.0
4 3.0 4.0
#dynamic generate all columns names without Country
listOfRatings = data.columns.difference(['Country'])
df1 = data.groupby('Country')[listOfRatings].transform('mean').fillna(0)
data[listOfRatings] = data[listOfRatings].fillna(df1)
print (data)
print (data)
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0

Resources