Line plot for over 1 million datapoint - python-3.x

I am having hard time plotting a desired line plot. I have a dataset containing 23 columns 21 columns are %age amount paid from 0-2 with a stepsize of 0.1, 1 column for the user id for that particular customer and the last column in the customer segment that he belongs to. I want to plot for all the customers in my dataset the payment pattern with 0-2 with 0.1 stepsize on my x-axis and the values for %age paid on the y-axis and color each line of a customer based on the segment that he belongs to. My dataset looks like the following:
Id paid_0.0 paid_0.1paid_0.2paid_0.3paid_0.4 Segment
AC005839 0.30 0.38 0.45 0.53 0.61 Best
AC005842 0.30 0.30 0.52 0.52 0.52 Best
AC005843 0.30 0.38 0.45 0.53 0.61 Best
AC005851 0.24 0.31 0.35 0.35 0.51 Medium
AC005852 0.30 0.38 0.45 0.53 0.61 Best
AC005853 0.30 0.38 0.45 0.53 0.61 Best
AC005856 0.30 0.38 0.45 0.53 0.61 Best
AC005858 0.30 0.38 0.45 0.53 0.54 Best
AC005859 0.33 0.43 0.54 0.65 0.65 Best
I am trying to generate a plot as below:

Related

How to scale dataset with huge difference in stdev for DNN training?

I'm trying to train a DNN model using one dataset with huge difference in stdev. The following scalers were tested but none of them work: MinMaxScaler, StandardScaler, RobustScaler, PowerTransformer. The reason they didn't work was that those models can achieve high predictive performance on the validation sets but they had little predictivity on external test sets. The dataset has more than 10,000 rows and 200 columns. Here are a prt of statistics of the dataset.
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
mean 11.31 -1.04 11.31 0.21 0.55 359.01 337.64 358.58 131.70 0.01 0.09
std 2.72 1.42 2.72 0.24 0.20 139.86 131.40 139.67 52.25 0.14 0.47
min 2.00 -10.98 2.00 0.00 0.02 59.11 50.04 59.07 26.00 0.00 0.00
5% 5.24 -4.07 5.24 0.01 0.19 190.25 178.15 190.10 70.00 0.00 0.00
25% 10.79 -1.35 10.79 0.05 0.41 269.73 254.14 269.16 98.00 0.00 0.00
50% 12.15 -0.64 12.15 0.13 0.58 335.47 316.23 335.15 122.00 0.00 0.00
75% 12.99 -0.21 12.99 0.27 0.72 419.42 394.30 419.01 154.00 0.00 0.00
95% 14.17 0.64 14.17 0.73 0.85 594.71 560.37 594.10 220.00 0.00 1.00
max 19.28 2.00 19.28 5.69 0.95 2924.47 2642.23 2922.13 1168.00 6.00 16.00

How to log a table of metrics into mlflow

I am trying to see if mlflow is the right place to store my metrics in the model tracking. According to the doc log_metric takes either a key value or a dict of key-values. I am wondering how to log something like below into mlflow so it can be visualized meaningfully.
precision recall f1-score support
class1 0.89 0.98 0.93 174
class2 0.96 0.90 0.93 30
class3 0.96 0.90 0.93 30
class4 1.00 1.00 1.00 7
class5 0.93 1.00 0.96 13
class6 1.00 0.73 0.85 15
class7 0.95 0.97 0.96 39
class8 0.80 0.67 0.73 6
class9 0.97 0.86 0.91 37
class10 0.95 0.81 0.88 26
class11 0.50 1.00 0.67 5
class12 0.93 0.89 0.91 28
class13 0.73 0.84 0.78 19
class14 1.00 1.00 1.00 6
class15 0.45 0.83 0.59 6
class16 0.97 0.98 0.97 245
class17 0.93 0.86 0.89 206
accuracy 0.92 892
macro avg 0.88 0.90 0.88 892
weighted avg 0.93 0.92 0.92 892

get the value from another values if value is nan [duplicate]

I am trying to create a column which contains only the minimum of the one row and a few columns, for example:
A0 A1 A2 B0 B1 B2 C0 C1
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72
Here I am trying to create a column which contains the minimum for each row of columns B0, B1, B2.
The output would look like this:
A0 A1 A2 B0 B1 B2 C0 C1 Minimum
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75 0.42
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73 0.00
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03 0.51
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61 0.51
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53 0.17
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72 0.01
Here is part of the code, but it is not doing what I want it to do:
for i in range(0,2):
df['Minimum'] = df.loc[0,'B'+str(i)].min()
This is a one-liner, you just need to use the axis argument for min to tell it to work across the columns rather than down:
df['Minimum'] = df.loc[:, ['B0', 'B1', 'B2']].min(axis=1)
If you need to use this solution for different numbers of columns, you can use a for loop or list comprehension to construct the list of columns:
n_columns = 2
cols_to_use = ['B' + str(i) for i in range(n_columns)]
df['Minimum'] = df.loc[:, cols_to_use].min(axis=1)
For my tasks a universal and flexible approach is the following example:
df['Minimum'] = df[['B0', 'B1', 'B2']].apply(lambda x: min(x[0],x[1],x[2]), axis=1)
The target column 'Minimum' is assigned the result of the lambda function based on the selected DF columns['B0', 'B1', 'B2']. Access elements in a function through the function alias and his new Index(if count of elements is more then one). Be sure to specify axis=1, which indicates line-by-line calculations.
This is very convenient when you need to make complex calculations.
However, I assume that such a solution may be inferior in speed.
As for the selection of columns, in addition to the 'for' method, I can suggest using a filter like this:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
literally, a filter is applied to the list of DF columns through a lambda function that checks for the occurrence of the letter 'B'.
after that the first example can be written as follows:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
df['Minimum'] = df[calls_to_use].apply(lambda x: min(x), axis=1)
although after pre-selecting the columns, it would be preferable:
df['Minimum'] = df[calls_to_use].min(axis=1)

Plotting Heatmap with different column/line widths

I am simulating something and want to figure out the influence of two parameters. Therefore I vary them both and look for the result on each pair of parameter values and get a result like:
0 1000 2000 3000 4000 5000 ....
0 13.2 14.8 19.9 25.5 27.3 ...
1000 21.3 25.9 32.3 etc.
2000 etc.
3000
4000
....
To visualize them, I use gnuplot, creating a heatmap, which works perfectly fine, showing me colors and height:
reset
set terminal qt
set title "Test"
unset key
set tic scale 0
set palette rgbformula 7,5,15
set cbrange [0:100]
set cblabel "Transmission"
set pm3d at s interpolate 1,1
unset surf
set xlabel "U_{Lense} [V]"
set ylabel "E_{Start} [eV]"
set datafile separator "\t"
splot "UT500test.csv" matrix rowheaders columnheaders
Now I want to look more detailed on some areas on my heatmap, and vary my parameters in steps of 100 difference, not 1000 as shown in the table above. But because the simulation takes quite a long time, I just do this for some areas, so my table looks like this:
0 1000 2000 2100 2200 2300 2400 ... 2900 3000 4000 ...
...
Now I want to show this in the heatmap, too. But everytime I tried this, all the bins on the heatmap, no matter if 1000 or 100 difference are of the same width. But I want the ones with 100 difference to be only 1/10 of the width of the 1000 differences. Is there a possibility to do this?
The extra steps with stats are not necessary.
You can access the true coordinates directly as a nonuniform matrix:
set offset 100,100,100,100
plot $Data matrix nonuniform using 1:2:3 with points pt 5 lc palette
The missing piece is to fill in the full area rather than plotting single points. You can do this using pm3d:
set pm3d corners2color mean
set view map
splot $Data matrix nonuniform with pm3d
The colors do not match the previous plot because pm3d considers all 4 corners of each box when assigning a color. I told it to take the mean value (that's the default) but many other variants are possible. You could smooth the coloring further with set pm3d interpolate 3,3
You could do something with plotting style with boxxyerror. It's pretty straightforward, except the way to get the x-coordinates into an array which will be used later during plotting. Maybe, there are smarter solutions.
Script:
### heatmap with irregular spacing
reset session
unset key
$Data <<EOD
0.00 0.00 1000 2000 2100 2200 2300 2400 3000 4000
1000 0.75 0.75 0.43 0.34 0.61 0.74 0.66 0.97 0.58
1100 0.82 0.90 0.18 0.12 0.87 0.15 0.01 0.57 0.97
1200 0.10 0.15 0.68 0.73 0.55 0.07 0.98 0.89 0.01
1300 0.67 0.38 0.41 0.85 0.37 0.45 0.49 0.21 0.98
1400 0.76 0.53 0.68 0.09 0.22 0.40 0.59 0.33 0.08
2000 0.37 0.32 0.30 NaN 0.33 NaN 0.73 0.94 0.96
3000 0.07 0.61 0.37 0.54 0.32 0.28 0.62 0.51 0.48
4000 0.79 0.98 0.78 0.06 0.16 0.45 0.83 0.50 0.10
5000 0.49 0.95 0.29 0.59 0.55 0.88 0.29 0.47 0.93
EOD
stats $Data nooutput
BoxHalfWidth=50
# put first row into array
array ArrayX[STATS_columns]
set table $Dummy
plot for [i=1:STATS_columns] $Data u (ArrayX[i]=column(i)) every ::0::0 with table
unset table
plot for [i=2:STATS_columns] $Data u (ArrayX[i]):1:(BoxHalfWidth):(BoxHalfWidth):i every ::1 with boxxyerror fs solid 1.0 palette
### end of script
Result:
Edit:
With a little bit more effort you can as well generate a plot which covers the whole area.
In contrast to the simpler code from #Ethan, the recangles are centered on the datapoint coordinates and have the color of the actual datapoint z-value. Furthermore, the datapoint (2200,2000) is also plotted. The borders of the rectangles are halfway between matrix points. The outer rectangles have dimensions equal to the x and y distance to the next inner matrix point.
Revision: (simplified version, works for gnuplot>=5.0.1)
The following solution works for gnuplot 5.0.1, but not for 5.0.0 (haven't found out yet why).
There will be a warning: warning: matrix contains missing or undefined values which can be ignored.
I noticed that there seems to be a bug(?!) with the matrix column index, but you can fix it with:
colIdxFix(n) = (r0=r1,r1=column(-1),r0==r1?c=c+1:c=1) # fix for missing column index in a matrix
plot r1=c=0 $Data nonuniform matrix u 1:2:(colIdxFix(0)) ....
Script: (works with gnuplot>=5.0.1)
### heatmap with irregular spacing with filled area
# compatible with gnuplot>=5.0.1
reset session
$Data <<EOD
0.00 0.00 1000 2000 2100 2200 2300 2400 3000 4000
1000 0.75 0.75 0.43 0.34 0.61 0.74 0.66 0.97 0.58
1100 0.82 0.90 0.18 0.12 0.87 0.15 0.01 0.57 0.97
1200 0.10 0.15 0.68 0.73 0.55 0.07 0.98 0.89 0.01
1300 0.67 0.38 0.41 0.85 0.37 0.45 0.49 0.21 0.98
1400 0.76 0.53 0.68 0.09 0.22 0.40 0.59 0.33 0.08
2000 0.37 0.32 0.30 NaN 0.33 NaN 0.73 0.94 0.96
3000 0.07 0.61 0.37 0.54 0.32 0.28 0.62 0.51 0.48
4000 0.79 0.98 0.78 0.06 0.16 0.45 0.83 0.50 0.10
5000 0.49 0.95 0.29 0.59 0.55 0.88 0.29 0.47 0.93
EOD
# get irregular x- and y-values into string
Xs = Ys = ""
stats $Data matrix u ($1==0 ? Ys=Ys.sprintf(" %g",$3) : 0, \
$2==0 ? Xs=Xs.sprintf(" %g",$3) : 0) nooutput
# box extension d in dn (negative) and dp (positive) direction
d(vs,n0,n1) = abs(real(word(vs,n0+1))-real(word(vs,n1+1)))/2.
dn(vs,n) = (n==1 ? (n0=1,n1=2) : (n0=n,n1=n-1), -d(vs,n0,n1))
dp(vs,n) = (Ns=words(vs)-1, n>=Ns ? (n0=Ns-1,n1=Ns) : (n0=n,n1=n+1), d(vs,n0,n1))
unset key
set offset 1,1,1,1
set style fill solid 1.0
colIdxFix(n) = (r0=r1,r1=column(-1),r0==r1?c=c+1:c=1) # fix for missing column index in a matrix (bug?!)
plot r1=c=0 $Data nonuniform matrix u 1:2:($1+dn(Xs,colIdxFix(0))):($1+dp(Xs,c)): \
($2+dn(Ys,int(column(-1))+1)):($2+dp(Ys,int(column(-1))+1)):3 w boxxy palette
### end of script
Result:
Edit2: (I leave this here for gnuplot 5.0.0)
Just for fun, here is the "retro-version" for gnuplot 5.0:
gnuplot5.0 does not support arrays. Although, gnuplot5.0 supports datablocks, but apparently indexing like $Datablock[1] does not work. So, the workaround-around is to put the matrix X,Y coordinates into strings CoordsX and CoordsY and get the coordinates with word(). If there is not another limitation with string and word(), the following worked with gnuplot5.0 and gave the same result as above.
Script:
### heatmap with irregular spacing with filled area
# compatible with gnuplot 5.0
reset session
unset key
$Data <<EOD
0.00 0.00 1000 2000 2100 2200 2300 2400 3000 4000
1000 0.75 0.75 0.43 0.34 0.61 0.74 0.66 0.97 0.58
1100 0.82 0.90 0.18 0.12 0.87 0.15 0.01 0.57 0.97
1200 0.10 0.15 0.68 0.73 0.55 0.07 0.98 0.89 0.01
1300 0.67 0.38 0.41 0.85 0.37 0.45 0.49 0.21 0.98
1400 0.76 0.53 0.68 0.09 0.22 0.40 0.59 0.33 0.08
2000 0.37 0.32 0.30 NaN 0.33 NaN 0.73 0.94 0.96
3000 0.07 0.61 0.37 0.54 0.32 0.28 0.62 0.51 0.48
4000 0.79 0.98 0.78 0.06 0.16 0.45 0.83 0.50 0.10
5000 0.49 0.95 0.29 0.59 0.55 0.88 0.29 0.47 0.93
EOD
stats $Data nooutput
ColCount = int(STATS_columns-1)
RowCount = int(STATS_records-1)
# put first row and column into arrays
CoordsX = ""
set table $Dummy
set xrange[0:1] # to avoid warnings
do for [i=2:ColCount+1] {
plot $Data u (Value=column(i)) every ::0::0 with table
CoordsX = CoordsX.sprintf("%g",Value)." "
}
unset table
CoordsY = ""
set table $Dummy
do for [i=1:RowCount] {
plot $Data u (Value=$1) every ::i::i with table
CoordsY= CoordsY.sprintf("%g",Value)." "
}
unset table
dx(i) = (word(CoordsX,i)-word(CoordsX,i-1))*0.5
dy(i) = (word(CoordsY,i)-word(CoordsY,i-1))*0.5
ndx(i,j) = word(CoordsX,i) - (i-1<1 ? dx(i+1) : dx(i))
pdx(i,j) = word(CoordsX,i) + (i+1>ColCount ? dx(i) : dx(i+1))
ndy(i,j) = word(CoordsY,j) - (j-1<1 ? dy(j+1) : dy(j))
pdy(i,j) = word(CoordsY,j) + (j+1>RowCount ? dy(j) : dy(j+1))
set xrange[ndx(1,1):pdx(ColCount,1)]
set yrange[ndy(1,1):pdy(1,RowCount)]
set tic out
plot for [i=2:ColCount+1] $Data u (real(word(CoordsX,i-1))):1:(ndx(i-1,int($0))):(pdx(i-1,int($0))): \
(ndy(i-1,int($0+1))):(pdy(i-1,int($0+1))):i every ::1 with boxxyerror fs solid 1.0 palette
### end of script

How to handle data file, control fit extending with x-axis and fit logistic

I have a long.dat file as following.
#x1 y1 sd1 x2 y2 sd2 x3 y3 sd3
2.50 9.04 0.03 2.51 16.08 0.04 2.50 26.96 0.07
2.25 9.06 0.05 1.84 16.01 0.16 1.91 26.94 0.21
1.11 9.12 0.19 1.06 15.90 0.14 1.30 26.41 0.10
0.71 9.97 0.18 0.86 16.47 0.33 0.92 28.59 0.92
0.60 11.36 0.24 0.77 17.31 0.18 0.73 33.55 1.40
0.56 12.44 0.55 0.72 18.25 0.25 0.65 37.82 2.16
0.50 14.23 0.37 0.71 18.73 0.49 0.57 44.75 2.69
0.43 16.93 1.20 0.63 20.55 0.64 0.51 52.11 1.01
0.38 19.18 1.12 0.57 22.27 0.94 0.47 58.01 2.17
0.32 24.83 2.26 0.52 25.04 0.53 0.42 65.92 2.62
0.30 28.87 1.39 0.46 29.75 2.41 0.38 71.60 1.81
0.25 34.23 2.07 0.41 37.92 1.49 0.34 75.81 0.68
0.21 39.52 0.53 0.37 43.33 1.81 0.32 77.12 0.68
0.16 44.10 1.81 0.32 47.22 0.57 0.28 79.87 2.03
0.13 49.73 1.19 0.28 49.36 0.99 0.22 85.93 1.32
0.13 49.73 1.19 0.22 53.94 0.98 0.19 89.10 2.14
0.13 49.73 1.19 0.18 57.28 1.56 0.16 96.48 1.28
0.13 49.73 1.19 0.14 63.66 1.90 0.14 100.09 1.46
0.13 49.73 1.19 0.12 67.92 0.64 0.12 103.90 0.48
0.13 49.73 1.19 0.12 67.92 0.64 0.12 103.90 0.48
I tried to fit my data with second order polynomial. I am having problems with
(1) My x1,y1,sd1 data coluns are shorter than x2,y,sd2. So I had to append x1,y2,sd1 at x1= 0.13. Otherwise, text file is doing "something" resulting wrong plotting. Is there any way to avoid it rather than appending with same values?
(2) In my plotting, the fit f8(x) is extending the last value at about 7.5 to match f12(x) at about x = 8.25. If I set my x-range [0:100], all the fits extend to x=100. How can I control this?
Here are the codes,
Set key left
f8(x) = a8*x*x+b8*x+c8
fit f8(x) 'long.dat' u (1/$1):($2/800**3) via a8,b8,c8
plot f8(x), 'long.dat' u (1/$1):($2/800**3): ($3/800**3) w errorbars not
f10(x) = a10*x*x+b10*x+c10
fit f10(x) 'long.dat' u (1/$4):($5/1000**3) via a10,b10,c10
replot f10(x), 'long.dat' u (1/$4):($5/1000**3): ($6/1000**3) w errorbars not
f12(x) = a12*x*x+b12*x+c12
fit f12(x) 'long.dat' u (1/$7):($8/1200**3) via a12,b12,c12
replot f12(x), '' u (1/$7):($8/1200**3): ($9/1200**3) w errorbars not
(3) I tried to use logistic fit g(x) = a/(1+bexp(-kx)) on x1,y1 data set but severaly failed! Codes are here,
Set key left
g(x) = a/(1+b*exp(-k*x))
fit g(x) 'long.dat' u (1/$1):($2/800**3) via a,b,k
plot g(x), 'long.dat' u (1/$1):($2/800**3): ($3/800**3) w errorbars not
Any comment/suggestion would be highly appreciated! Many many thanks for going through this big post and any feedback in advance!
1) you can use the NaN keyword for the missing points: gnuplot will ignore them
2) if what you want to plot is a function, by definition it's defined for every x so it will extend allover
what you might want to do is to store the fitted points on a file, something like:
set table "func.txt"
plot [0.5:7.5] f(x)
unset table
and then plot the file rather than the function. you might want to use the samples of your datafile to tune the result: type "help samples"
Some more suggestions besides #bibi's answer:
How should gnuplot know, that at a certain row the first number it encounters belongs to column 4? For this you can use e.g. a comma as column delimiter:
0.16, 44.10, 1.81, 0.32, 47.22, 0.57, 0.28, 79.87, 2.03
0.13, 49.73, 1.19, 0.28, 49.36, 0.99, 0.22, 85.93, 1.32
, , , 0.22, 53.94, 0.98, 0.19, 89.10, 2.14
And tell gnuplot about it:
set datafile separator ','
All functions are drawn with the same xrange. You can use different limits for a function by return 1/0 when outside the desired range:
f(x) = a*x**2 + b*x + c
f_p(x, min, max) = (x >= min && x <= max) ? f(x) : 1/0
plot f_p(x, 0.5, 7.5)
You can use stats to extract the limits:
stats 'long.dat' using (1/$1) name 'A_' nooutput
plot f_p(x, A_min, A_max)
For fitting, gnuplot uses 1 as starting value for the parameters, if you haven't assigned them an explicit value. And you can imagine, that with a=1 you're not too close to your values of 1e-7. For nonlinear fitting, there doesn't exists one unique solution only, for all starting values. So its all about finding the correct starting value and a proper model function.
With the starting values a=1e-7; b = 50; k = 1 you get a solution, but the fit isn't very good.

Resources