get the value from another values if value is nan [duplicate] - python-3.x

I am trying to create a column which contains only the minimum of the one row and a few columns, for example:
A0 A1 A2 B0 B1 B2 C0 C1
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72
Here I am trying to create a column which contains the minimum for each row of columns B0, B1, B2.
The output would look like this:
A0 A1 A2 B0 B1 B2 C0 C1 Minimum
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75 0.42
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73 0.00
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03 0.51
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61 0.51
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53 0.17
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72 0.01
Here is part of the code, but it is not doing what I want it to do:
for i in range(0,2):
df['Minimum'] = df.loc[0,'B'+str(i)].min()

This is a one-liner, you just need to use the axis argument for min to tell it to work across the columns rather than down:
df['Minimum'] = df.loc[:, ['B0', 'B1', 'B2']].min(axis=1)
If you need to use this solution for different numbers of columns, you can use a for loop or list comprehension to construct the list of columns:
n_columns = 2
cols_to_use = ['B' + str(i) for i in range(n_columns)]
df['Minimum'] = df.loc[:, cols_to_use].min(axis=1)

For my tasks a universal and flexible approach is the following example:
df['Minimum'] = df[['B0', 'B1', 'B2']].apply(lambda x: min(x[0],x[1],x[2]), axis=1)
The target column 'Minimum' is assigned the result of the lambda function based on the selected DF columns['B0', 'B1', 'B2']. Access elements in a function through the function alias and his new Index(if count of elements is more then one). Be sure to specify axis=1, which indicates line-by-line calculations.
This is very convenient when you need to make complex calculations.
However, I assume that such a solution may be inferior in speed.
As for the selection of columns, in addition to the 'for' method, I can suggest using a filter like this:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
literally, a filter is applied to the list of DF columns through a lambda function that checks for the occurrence of the letter 'B'.
after that the first example can be written as follows:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
df['Minimum'] = df[calls_to_use].apply(lambda x: min(x), axis=1)
although after pre-selecting the columns, it would be preferable:
df['Minimum'] = df[calls_to_use].min(axis=1)

Related

How to scale dataset with huge difference in stdev for DNN training?

I'm trying to train a DNN model using one dataset with huge difference in stdev. The following scalers were tested but none of them work: MinMaxScaler, StandardScaler, RobustScaler, PowerTransformer. The reason they didn't work was that those models can achieve high predictive performance on the validation sets but they had little predictivity on external test sets. The dataset has more than 10,000 rows and 200 columns. Here are a prt of statistics of the dataset.
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
mean 11.31 -1.04 11.31 0.21 0.55 359.01 337.64 358.58 131.70 0.01 0.09
std 2.72 1.42 2.72 0.24 0.20 139.86 131.40 139.67 52.25 0.14 0.47
min 2.00 -10.98 2.00 0.00 0.02 59.11 50.04 59.07 26.00 0.00 0.00
5% 5.24 -4.07 5.24 0.01 0.19 190.25 178.15 190.10 70.00 0.00 0.00
25% 10.79 -1.35 10.79 0.05 0.41 269.73 254.14 269.16 98.00 0.00 0.00
50% 12.15 -0.64 12.15 0.13 0.58 335.47 316.23 335.15 122.00 0.00 0.00
75% 12.99 -0.21 12.99 0.27 0.72 419.42 394.30 419.01 154.00 0.00 0.00
95% 14.17 0.64 14.17 0.73 0.85 594.71 560.37 594.10 220.00 0.00 1.00
max 19.28 2.00 19.28 5.69 0.95 2924.47 2642.23 2922.13 1168.00 6.00 16.00

How to groupby a dataframe based on one column and transpose based on another column

I have a dataframe with values as:
col_1
Timestamp
data_1
data_2
aaa
22/12/2001
0.21
0.2
abb
22/12/2001
0.20
0
acc
22/12/2001
0.12
0.19
aaa
23/12/2001
0.23
0.21
abb
23/12/2001
0.32
0.18
acc
23/12/2001
0.52
0.20
I need to group the dataframe based on the timestamp and add columns w.r.t the col_1 column for data_1 and data_2 such as:
Timestamp
aaa_data_1
abb_data_1
acc_data_1
aaa_data_2
abb_data_2
acc_data_2
22/12/2001
0.21
0.20
0.12
0.2
0
0.19
23/12/2001
0.23
0.32
0.52
0.21
0.18
0.20
I am able to group by based on timestamp but not finding a way to update/add the columns.
And with df.pivot(index='Timestamp', columns='col_1'), I get
Timestamp
aaa_data_1
abb_data_1
acc_data_1
aaa_data_2
abb_data_2
acc_data_2
22/12/2001
0.12
0.19
22/12/2001
0.20
0
22/12/2001
0.21
0.2
23/12/2001
0.52
0.20
23/12/2001
0.32
0.18
23/12/2001
0.23
0.21
A pivot plus a column rename are all you need:
result = df.pivot(index='Timestamp', columns='col_1')
result.columns = [f'{col_1}_{data}' for data, col_1 in result.columns]
#CodeDifferent's answer suffices, since your data does not have aggregation; an alternative option is the dev version of pivot_wider from pyjanitor (they are wrappers around pandas functions):
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
df.pivot_wider(index='Timestamp',
names_from='col_1',
levels_order=['col_1', None],
names_sep='_')
Timestamp aaa_data_1 abb_data_1 acc_data_1 aaa_data_2 abb_data_2 acc_data_2
0 22/12/2001 0.21 0.20 0.12 0.20 0.00 0.19
1 23/12/2001 0.23 0.32 0.52 0.21 0.18 0.20
This will fail if there are duplicates in the combination of index and names_from; in that case you can use the pivot_table, which takes care of duplicates:
(df.pivot_table(index='Timestamp', columns='col_1')
.swaplevel(axis = 1)
.pipe(lambda df: df.set_axis(df.columns.map('_'.join), axis =1))
)
aaa_data_1 abb_data_1 acc_data_1 aaa_data_2 abb_data_2 acc_data_2
Timestamp
22/12/2001 0.21 0.20 0.12 0.20 0.00 0.19
23/12/2001 0.23 0.32 0.52 0.21 0.18 0.20
Or with a helper method from pyjanitor, for a bit of a cleaner method chaining syntax:
(df.pivot_table(index='Timestamp', columns='col_1')
.swaplevel(axis = 1)
.collapse_levels()

Line plot for over 1 million datapoint

I am having hard time plotting a desired line plot. I have a dataset containing 23 columns 21 columns are %age amount paid from 0-2 with a stepsize of 0.1, 1 column for the user id for that particular customer and the last column in the customer segment that he belongs to. I want to plot for all the customers in my dataset the payment pattern with 0-2 with 0.1 stepsize on my x-axis and the values for %age paid on the y-axis and color each line of a customer based on the segment that he belongs to. My dataset looks like the following:
Id paid_0.0 paid_0.1paid_0.2paid_0.3paid_0.4 Segment
AC005839 0.30 0.38 0.45 0.53 0.61 Best
AC005842 0.30 0.30 0.52 0.52 0.52 Best
AC005843 0.30 0.38 0.45 0.53 0.61 Best
AC005851 0.24 0.31 0.35 0.35 0.51 Medium
AC005852 0.30 0.38 0.45 0.53 0.61 Best
AC005853 0.30 0.38 0.45 0.53 0.61 Best
AC005856 0.30 0.38 0.45 0.53 0.61 Best
AC005858 0.30 0.38 0.45 0.53 0.54 Best
AC005859 0.33 0.43 0.54 0.65 0.65 Best
I am trying to generate a plot as below:

Position of Cells where the mid point of graph lies

I have the following data.
x y
0.00 0.00
0.03 1.74
0.05 2.60
0.08 3.04
0.11 3.47
0.13 3.90
0.16 4.33
0.19 4.59
0.21 4.76
0.20 3.90
0.18 3.12
0.18 2.60
0.16 2.17
0.15 1.73
0.13 1.47
0.12 1.21
0.14 2.60
0.17 3.47
0.21 3.90
0.23 4.33
0.26 4.76
0.28 5.19
0.31 5.45
0.33 5.62
0.37 5.79
0.38 5.97
0.42 6.14
0.44 6.22
0.47 6.31
0.49 6.39
0.51 6.48
I used =max()/2 to obtain the 50%th percentile, which in this case is 3.24.
The point 3.24 does not exist for the y values but it falls in between the 3.04 and 3.47.
How can I find the address of these 2 cells?
Note: The 50th percentile also hits on the other part of the graph, but I only require the first instance.
Assuming you data in columns A and B, with header row in 1 (first numbers in row 2). Assuming you =max()/2 formula is in D2
Use aggregate to determine the first row where the Y value exceeds you mean. Then do it again and subtract 1 from the row.
=AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)
That will return the row number of 6. First occurrence exceeding the value in D2.
=AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)-1
That will give you row number of 5.
use the row numbers in conjunction with INDEX and you can pull the X value.
=INDEX(A:A,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)-1)
=INDEX(A:A,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1))
That will give you the X values. if you want the corresponding Y values, simply change the index look up range from A:A to B:B.
=INDEX(B:B,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)-1)
=INDEX(B:B,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1))

How to handle data file, control fit extending with x-axis and fit logistic

I have a long.dat file as following.
#x1 y1 sd1 x2 y2 sd2 x3 y3 sd3
2.50 9.04 0.03 2.51 16.08 0.04 2.50 26.96 0.07
2.25 9.06 0.05 1.84 16.01 0.16 1.91 26.94 0.21
1.11 9.12 0.19 1.06 15.90 0.14 1.30 26.41 0.10
0.71 9.97 0.18 0.86 16.47 0.33 0.92 28.59 0.92
0.60 11.36 0.24 0.77 17.31 0.18 0.73 33.55 1.40
0.56 12.44 0.55 0.72 18.25 0.25 0.65 37.82 2.16
0.50 14.23 0.37 0.71 18.73 0.49 0.57 44.75 2.69
0.43 16.93 1.20 0.63 20.55 0.64 0.51 52.11 1.01
0.38 19.18 1.12 0.57 22.27 0.94 0.47 58.01 2.17
0.32 24.83 2.26 0.52 25.04 0.53 0.42 65.92 2.62
0.30 28.87 1.39 0.46 29.75 2.41 0.38 71.60 1.81
0.25 34.23 2.07 0.41 37.92 1.49 0.34 75.81 0.68
0.21 39.52 0.53 0.37 43.33 1.81 0.32 77.12 0.68
0.16 44.10 1.81 0.32 47.22 0.57 0.28 79.87 2.03
0.13 49.73 1.19 0.28 49.36 0.99 0.22 85.93 1.32
0.13 49.73 1.19 0.22 53.94 0.98 0.19 89.10 2.14
0.13 49.73 1.19 0.18 57.28 1.56 0.16 96.48 1.28
0.13 49.73 1.19 0.14 63.66 1.90 0.14 100.09 1.46
0.13 49.73 1.19 0.12 67.92 0.64 0.12 103.90 0.48
0.13 49.73 1.19 0.12 67.92 0.64 0.12 103.90 0.48
I tried to fit my data with second order polynomial. I am having problems with
(1) My x1,y1,sd1 data coluns are shorter than x2,y,sd2. So I had to append x1,y2,sd1 at x1= 0.13. Otherwise, text file is doing "something" resulting wrong plotting. Is there any way to avoid it rather than appending with same values?
(2) In my plotting, the fit f8(x) is extending the last value at about 7.5 to match f12(x) at about x = 8.25. If I set my x-range [0:100], all the fits extend to x=100. How can I control this?
Here are the codes,
Set key left
f8(x) = a8*x*x+b8*x+c8
fit f8(x) 'long.dat' u (1/$1):($2/800**3) via a8,b8,c8
plot f8(x), 'long.dat' u (1/$1):($2/800**3): ($3/800**3) w errorbars not
f10(x) = a10*x*x+b10*x+c10
fit f10(x) 'long.dat' u (1/$4):($5/1000**3) via a10,b10,c10
replot f10(x), 'long.dat' u (1/$4):($5/1000**3): ($6/1000**3) w errorbars not
f12(x) = a12*x*x+b12*x+c12
fit f12(x) 'long.dat' u (1/$7):($8/1200**3) via a12,b12,c12
replot f12(x), '' u (1/$7):($8/1200**3): ($9/1200**3) w errorbars not
(3) I tried to use logistic fit g(x) = a/(1+bexp(-kx)) on x1,y1 data set but severaly failed! Codes are here,
Set key left
g(x) = a/(1+b*exp(-k*x))
fit g(x) 'long.dat' u (1/$1):($2/800**3) via a,b,k
plot g(x), 'long.dat' u (1/$1):($2/800**3): ($3/800**3) w errorbars not
Any comment/suggestion would be highly appreciated! Many many thanks for going through this big post and any feedback in advance!
1) you can use the NaN keyword for the missing points: gnuplot will ignore them
2) if what you want to plot is a function, by definition it's defined for every x so it will extend allover
what you might want to do is to store the fitted points on a file, something like:
set table "func.txt"
plot [0.5:7.5] f(x)
unset table
and then plot the file rather than the function. you might want to use the samples of your datafile to tune the result: type "help samples"
Some more suggestions besides #bibi's answer:
How should gnuplot know, that at a certain row the first number it encounters belongs to column 4? For this you can use e.g. a comma as column delimiter:
0.16, 44.10, 1.81, 0.32, 47.22, 0.57, 0.28, 79.87, 2.03
0.13, 49.73, 1.19, 0.28, 49.36, 0.99, 0.22, 85.93, 1.32
, , , 0.22, 53.94, 0.98, 0.19, 89.10, 2.14
And tell gnuplot about it:
set datafile separator ','
All functions are drawn with the same xrange. You can use different limits for a function by return 1/0 when outside the desired range:
f(x) = a*x**2 + b*x + c
f_p(x, min, max) = (x >= min && x <= max) ? f(x) : 1/0
plot f_p(x, 0.5, 7.5)
You can use stats to extract the limits:
stats 'long.dat' using (1/$1) name 'A_' nooutput
plot f_p(x, A_min, A_max)
For fitting, gnuplot uses 1 as starting value for the parameters, if you haven't assigned them an explicit value. And you can imagine, that with a=1 you're not too close to your values of 1e-7. For nonlinear fitting, there doesn't exists one unique solution only, for all starting values. So its all about finding the correct starting value and a proper model function.
With the starting values a=1e-7; b = 50; k = 1 you get a solution, but the fit isn't very good.

Resources