How to plot grouped boxplot by gnuplot - gnuplot

I wonder how to use gnuplot to plot this figure:
There are two problems I have:
the ytic is ..., 10^2, 10^1, 10^2, 10^3, ... How to handle such a
case?
I know gnuplot support boxplot, but how to regroup boxplot
according to some label?
Since I don't have the original data for the figure, I make up some data by myself.
There are two companies A, B, and C, selling different fruits with four prices.
Apple prices of company A: 1.2 1.3 1.4 1.1
Banana prices of company A: 2.2 2.1 2.4 2.5
Orange prices of company A: 3.1 3.3 3.4 3.5
Apple prices of company B: 1.2 1.3 1.4 1.1
Banana prices of company B: 2.2 2.1 2.4 2.5
Orange prices of company B: 3.1 3.3 3.4 3.5
Apple prices of company C: 2.2 1.3 1.4 2.1
Banana prices of company C: 3.2 3.1 3.4 2.5
Orange prices of company C: 2.1 3.3 1.4 2.5
I wonder how to plot those numbers by gnuplot.

Your question is not very detailed and your own coding attempt is missing, hence, there is a lot of uncertainty. I guess there is no simple single command to get your grouped boxplots.
There are for sure several ways to realize your graph, e.g. with multiplot.
The assumption for the example below is that all files have the data organized in columns and equal number of columns and same fruits in the same order. Otherwise the code must be adapted. It all depends on the degree of "automation" you would like to have. Vertical separation lines can be drawn via headless arrows (check help arrow).
So, see the following example as a starting point.
Data:
'Company A.dat'
Apples Bananas Oranges
1.2 2.2 3.1
1.3 2.1 3.3
1.4 2.4 3.4
1.1 2.5 3.5
'Company B.dat'
Apples Bananas Oranges
1.2 2.2 3.1
1.3 2.1 3.3
1.4 2.4 3.4
1.1 2.5 3.5
'Company C.dat'
Apples Bananas Oranges
2.2 3.2 2.1
1.3 3.1 3.3
1.4 3.4 1.4
2.1 2.5 2.5
Code:
### grouped boxplots
reset session
FILES = 'A B C'
File(n) = sprintf("Company %s.dat",word(FILES,n))
myXtic(n) = sprintf("Company %s",word(FILES,n))
set xlabel "Fruit prices"
set ylabel "Price"
set yrange [0:5]
set grid y
set key noautotitle
set style fill solid 0.3
N = words(FILES) # number of files
COLS = 3 # number of columns in file
PosX = 0 # x-position of boxplot
plot for [n=1:N] for [COL=1:COLS] PosX=PosX+1 File(n) u (PosX):COL w boxplot lc COL, \
for [COL=1:COLS] File(1) u (NaN):COL w boxes lc COL ti columnhead, \
for [n=1:N] File(1) u ((n-1)*COLS+COLS/2+1):(NaN):xtic(myXtic(n))
### end of code
Result:

Related

Sort value in a column based on condition of another column in a dataframe

I have a dataframe that looks like this
Company Company Code Product Code Rating
Monster MNTR MNTR/Headphone1 3.2
Monster MNTR MNTR/Headphone2 3.9
Monster MNTR MNTR/Headphone3 NaN
Monster MNTR MNTR/Earbuds1 3.5
Bose BOSE BOSE/Headphone1 4.0
Bose BOSE BOSE/Earbuds1 NaN
Bose BOSE BOSE/Earbuds2 2.8
Apple APLE APLE/Headphone1 4.5
Sony SONY SONY/Headphone1 3.5
Sony SONY SONY/Headphone2 4.8
Sony SONY SONY/Earbuds1 3.0
Beats BEAT BEAT/Headphone1 3.5
Beats BEAT BEAT/Headphone2 3.7
If the Rating is >= 4.0, I want to group by the Company Code and bring all the products of the same company to the top, then sort by their Rating but keeping the original order of the Product Code and the company together. Like Sony, Apple and Bose.
If no ratings of any company products is above 4.0, I would group by the Company Code and sort the Company Code in alphabetical order. Like Beats and Monster.
Company Company Code Product Code Rating
Sony SONY SONY/Headphone1 3.5
Sony SONY SONY/Headphone2 4.8
Sony SONY SONY/Earbuds1 3.0
Apple APLE APLE/Headphone1 4.5
Bose BOSE BOSE/Headphone1 4.0
Bose BOSE BOSE/Earbuds1 NaN
Bose BOSE BOSE/Earbuds2 2.8
Beats BEAT BEAT/Headphone1 3.5
Beats BEAT BEAT/Headphone2 3.7
Monster MNTR MNTR/Headphone1 3.2
Monster MNTR MNTR/Headphone2 3.9
Monster MNTR MNTR/Headphone3 NaN
Monster MNTR MNTR/Earbuds1 3.5
I thought about dividing the dataframe into two parts - upper and lower, then use concat to join them back. For example,
condition = df['Rating'] >= 4.0
df_upper = df.loc[condition]
df_lower = df.loc[~condition]
.
.
.
df_merge = pd.concat([df_upper, df_lower], ignore_index=True)
But I have no idea where to apply groupby and sort. Thank you for helping out.
For sorting is used ordered categoricals by Categorical with filter Company Code of filtered rows and last sorting by DataFrame.sort_values:
condition = df['Rating'] >= 4.0
cats1 = df.loc[condition].sort_values('Rating', ascending=False)['Company Code'].unique()
cats2 = df.loc[~condition, 'Company Code'].sort_values().unique()
cats = pd.Index(cats1).union(pd.Index(cats2), sort=False)
print (cats)
Index(['SONY', 'APLE', 'BOSE', 'BEAT', 'MNTR'], dtype='object')
df['Company Code'] = pd.Categorical(df['Company Code'], ordered=True, categories=cats)
df = df.sort_values('Company Code')
print (df)
Company Company Code Product Code Rating
8 Sony SONY SONY/Headphone1 3.5
9 Sony SONY SONY/Headphone2 4.8
10 Sony SONY SONY/Earbuds1 3.0
7 Apple APLE APLE/Headphone1 4.5
4 Bose BOSE BOSE/Headphone1 4.0
5 Bose BOSE BOSE/Earbuds1 NaN
6 Bose BOSE BOSE/Earbuds2 2.8
11 Beats BEAT BEAT/Headphone1 3.5
12 Beats BEAT BEAT/Headphone2 3.7
0 Monster MNTR MNTR/Headphone1 3.2
1 Monster MNTR MNTR/Headphone2 3.9
2 Monster MNTR MNTR/Headphone3 NaN
3 Monster MNTR MNTR/Earbuds1 3.5

how to get quartiles and classify a value according to this quartile range

I have this df:
d = pd.DataFrame({'Name':['Andres','Lars','Paul','Mike'],
'target':['A','A','B','C'],
'number':[10,12.3,11,6]})
And I want classify each number in a quartile. I am doing this:
(d.groupby(['Name','target','number'])['number']
.quantile([0.25,0.5,0.75,1]).unstack()
.reset_index()
.rename(columns={0.25:"1Q",0.5:"2Q",0.75:"3Q",1:"4Q"})
)
But as you can see, the 4 quartiles are all equal because the code above is calculating per row so if there's one 1 number per row all quartiles are equal.
If a run instead:
d['number'].quantile([0.25,0.5,0.75,1])
Then I have the 4 quartiles I am looking for:
0.25 9.000
0.50 10.500
0.75 11.325
1.00 12.300
What I need as output(showing only first 2 rows)
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.30 1
1 Lars A 12.3 9.0 10.5 11.325 12.30 4
you can see all quartiles has the the values considering tall values in the number column. Besides that, now we have a column names Rank that classify the number according to it's quartile. ex. In the first row 10 is within the 1st quartile.
Here's one way that build on the quantiles you've created by making it a DataFrame and joining it to d. Also assigns "Rank" column using rank method:
out = (d.join(d['number'].quantile([0.25,0.5,0.75,1])
.set_axis([f'{i}Q' for i in range(1,5)], axis=0)
.to_frame().T
.pipe(lambda x: x.loc[x.index.repeat(len(d))])
.reset_index(drop=True))
.assign(Rank=d['number'].rank(method='dense')))
Output:
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.3 2.0
1 Lars A 12.3 9.0 10.5 11.325 12.3 4.0
2 Paul B 11.0 9.0 10.5 11.325 12.3 3.0
3 Mike C 6.0 9.0 10.5 11.325 12.3 1.0

Two row header to one row header for a data frame in Pandas

I have a data set with a multi-index, 2-dimensional header. I would like to merge it into one header row by removing empty column names in the first row by previous non nan column name in the same row.
Below showing structure of dataframe I have.
First two rows are header.
id One Two
response X1 Y1 Z1 X2 Y2
0 0 1.1 1.2 1.4 1.11 1.22
1 1 1.1 1.2 1.3 1.11 1.22
2 2 1.1 1.2 1.1 1.11 1.22
I want to change above data frame to one in below,
id One 1.X1 One 2.Y1 One 3.Z1 Two 1.X2 Two 2.Y2
0 0 1.1 1.2 1.4 1.11 1.22
1 1 1.1 1.2 1.3 1.11 1.22
2 2 1.1 1.2 1.1 1.11 1.22
Actual data frame has more than 100 columns.
Hope someone can help me here.
Than you so much.
Mary Abin.
if your columns are indeed a MultiIndex
i.e
print(df.columns)
MultiIndex([( 'id', 'response'),
('One', 'X1'),
('One', 'Y1'),
('One', 'Z1'),
('Two', 'X2'),
('Two', 'Y2')],
)
then we can pass them into a new data frame and use a cumulative count on the first level before flattening the columns.
s = pd.DataFrame.from_records(df.columns)
s['col'] = (s.groupby(0).cumcount()+1).astype(str) + '.'
#skip the first row and re-order columns to match your desired order.
df.columns = ['id'] + s.iloc[1:, [0,2,1]].astype(str).agg(' '.join,1).tolist()
print(df)
id One 1. X1 One 2. Y1 One 3. Z1 Two 1. X2 Two 2. Y2
0 0 1.1 1.2 1.4 1.11 1.22
1 1 1.1 1.2 1.3 1.11 1.22
2 2 1.1 1.2 1.1 1.11 1.22
print(s)
0 1 col
0 id response 1.
1 One X1 1.
2 One Y1 2.
3 One Z1 3.
4 Two X2 1.
5 Two Y2 2.
df.columns = df.columns.droplevel(0)
Check this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.droplevel.html

2D plots from several input data files

My code is returning 1000 snapshot_XXXX.dat files (XXXX = 0001, 0002,...). They are two columns data files that take a picture of the system I am running at a specific time. I would like to mix them in the order they are created to build a 2D plot (or heatmap) that will show the evolution of the quantity I am following over time.
How can I do this using gnuplot?
Assuming you want the time axis going from bottom to top, you could try the following:
n=4 # Number of snapshots
set palette defined (0 "white", 1 "red")
unset key
set style fill solid
set ylabel "Snapshot/Time"
set yrange [0.5:n+0.5]
set ytics 1
# This functions gives the name of the snapshot file
snapshot(i) = sprintf("snapshot_%04d.dat", i)
# Plot all snapshot files.
# - "with boxes" fakes the heat map
# - "linecolor palette" takes the third column in the "using"
# instruction which is the second column in the datafiles
# Plot from top to bottom because each boxplot overlays the previous ones.
plot for [i=1:n] snapshot(n+1-i) using 1:(n+1.5-i):2 with boxes linecolor palette
This example data
snapshot_0001.dat snapshot_0002.dat snapshot_0003.dat snapshot_0004.dat
1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
1.5 0.0 1.5 0.0 1.5 0.0 1.5 0.0
2.0 0.5 2.0 0.7 2.0 0.7 2.0 0.7
2.5 1.0 2.5 1.5 2.5 1.5 2.5 1.5
3.0 0.5 3.0 0.7 3.0 1.1 3.0 1.5
3.5 0.0 3.5 0.0 3.5 0.7 3.5 1.1
4.0 0.0 4.0 0.0 4.0 0.0 4.0 0.7
4.5 0.0 4.5 0.0 4.5 0.0 4.5 0.0
5.0 0.0 5.0 0.0 5.0 0.0 5.0 0.0
results in this image (tested with Gnuplot 5.0):
You can change the order of the plots if you want to go from top to bottom. If you want to go from left to right, maybe this can help (not tested).

gnuplot: yerrorbars with linecolor variable

I want to draw yerrorbars with different colors. I am able to draw points with different colors using the following code:
reset
plot "-" using 1:2:3 with points linecolor variable
# x y linecolor
-4.0 -3.8 1
-3.0 -2.9 1
-2.0 -2.1 2
-1.0 -1.2 1
1.0 1.1 1
2.0 2.2 2
3.0 3.3 3
4.0 4.5 3
end
But I am not sure how to extend this to yerrrorbars. When I try and use the following code, the errorbars are colored only with default color. How do I color the errorbars with a specific color?
reset
plot "-" using 1:2:($1-$2) with yerrorbars linecolor variable
# x y linecolor
-4.0 -3.8 1
-3.0 -2.9 1
-2.0 -2.1 2
-1.0 -1.2 1
1.0 1.1 1
2.0 2.2 2
3.0 3.3 3
4.0 4.5 3
end
I found a way to do this by separating the data and then plotting it. But if there is a way without separating the data it would be a nicer solution.
reset
plot "-" using 1:2:($1-$2) with yerrorbars lc 1, \
"-" using 1:2:($1-$2) with yerrorbars lc 2, \
"-" using 1:2:($1-$2) with yerrorbars lc 3
# x y
-4.0 -3.8
-3.0 -2.9
-1.0 -1.2
1.0 1.1
end
-2.0 -2.1
2.0 2.2
end
3.0 3.3
4.0 4.5
end
using specifies which columns will be the input for the command. So since your third column is linecolor, and yerrorbars linecolor expects the fourth column to be the line color, you need to specify using 1:2:($1-$2):3. So, this is the corrected version of your example:
reset
plot "-" using 1:2:($1-$2):3 with yerrorbars linecolor variable
# x y linecolor
-4.0 -3.8 1
-3.0 -2.9 1
-2.0 -2.1 2
-1.0 -1.2 1
1.0 1.1 1
2.0 2.2 2
3.0 3.3 3
4.0 4.5 3
end
The problem is, that the third column ($1 - $2) is used to plot the yerrorbar (the ydelta more specifically). The documentation:
3 columns: x y ydelta
You'll need to add another column for the linecolor. If you want to make up something fancy, you could do something like:
plot "/tmp/test.foo" using 1:2:($1-$2):(int($1)+1) with yerrorbars linecolor variable
(e.g. use the integer part of the first column and add 1).
Or you can also use ternary operators if you want to choose between two colors:
plot "-" using 1:2:($1 > 1 ? 1 : 3) with yerrorbars linecolor variable
(e.g. choose linecolor 1 if the value in the first column is greater than 1, linecolor 3 otherwise)

Resources