How to log a table of metrics into mlflow - mlflow

I am trying to see if mlflow is the right place to store my metrics in the model tracking. According to the doc log_metric takes either a key value or a dict of key-values. I am wondering how to log something like below into mlflow so it can be visualized meaningfully.
precision recall f1-score support
class1 0.89 0.98 0.93 174
class2 0.96 0.90 0.93 30
class3 0.96 0.90 0.93 30
class4 1.00 1.00 1.00 7
class5 0.93 1.00 0.96 13
class6 1.00 0.73 0.85 15
class7 0.95 0.97 0.96 39
class8 0.80 0.67 0.73 6
class9 0.97 0.86 0.91 37
class10 0.95 0.81 0.88 26
class11 0.50 1.00 0.67 5
class12 0.93 0.89 0.91 28
class13 0.73 0.84 0.78 19
class14 1.00 1.00 1.00 6
class15 0.45 0.83 0.59 6
class16 0.97 0.98 0.97 245
class17 0.93 0.86 0.89 206
accuracy 0.92 892
macro avg 0.88 0.90 0.88 892
weighted avg 0.93 0.92 0.92 892

Related

Adding two columns based on the match of column values of dataframe1 and colnames of dataframe2

I have two tibbles in R like this ones:
portfolio
MAR PLC KIN AMN
1 Fin It Sov 567
2 Cdi Fr Mnc 782
3 Hlt De Pse 312
4 Uti It Sov 234
...
and cases
It Fr De Fin Cdi Hlt Uti
1 0.11 0.21 0.56 0.43 0.89 0.26 0.77
2 0.92 0.03 0.44 0.52 0.78 0.24 0.86
3 0.14 0.42 0.83 0.03 0.22 0.75 0.65
4 0.83 0.31 0.06 0.42 0.89 0.07 0.48
5 0.12 0.29 0.51 0.95 0.38 0.81 0.76
...
I would like to add two columns to the first tibble conditional on the combination of portfolio$MAR and portfolio$PLC, returning in the two additional columns the values of the matched MAR and PLC in the second tibble. Something like this:
df_result
MAR PLC KIN AMN cases(MAR) cases(PLC)
1 Fin It Sov 567 0.43 0.11
2 Fin It Sov 567 0.52 0.92
3 Fin It Sov 567 0.03 0.14
4 Fin It Sov 567 0.42 0.83
5 Fin It Sov 567 0.95 0.12
6 Cdi Fr Mnc 782 0.89 0.21
7 Cdi Fr Mnc 782 0.78 0.03
8 Cdi Fr Mnc 782 0.22 0.42
9 Cdi Fr Mnc 782 0.89 0.31
10 Cdi Fr Mnc 782 0.38 0.29
11 Hlt De Pse 312 0.26 0.56
...
12 Uti It Sov 234 0.76 0.12
I tried with left_join but I really don't think is the right way to proceed.

How to scale dataset with huge difference in stdev for DNN training?

I'm trying to train a DNN model using one dataset with huge difference in stdev. The following scalers were tested but none of them work: MinMaxScaler, StandardScaler, RobustScaler, PowerTransformer. The reason they didn't work was that those models can achieve high predictive performance on the validation sets but they had little predictivity on external test sets. The dataset has more than 10,000 rows and 200 columns. Here are a prt of statistics of the dataset.
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
mean 11.31 -1.04 11.31 0.21 0.55 359.01 337.64 358.58 131.70 0.01 0.09
std 2.72 1.42 2.72 0.24 0.20 139.86 131.40 139.67 52.25 0.14 0.47
min 2.00 -10.98 2.00 0.00 0.02 59.11 50.04 59.07 26.00 0.00 0.00
5% 5.24 -4.07 5.24 0.01 0.19 190.25 178.15 190.10 70.00 0.00 0.00
25% 10.79 -1.35 10.79 0.05 0.41 269.73 254.14 269.16 98.00 0.00 0.00
50% 12.15 -0.64 12.15 0.13 0.58 335.47 316.23 335.15 122.00 0.00 0.00
75% 12.99 -0.21 12.99 0.27 0.72 419.42 394.30 419.01 154.00 0.00 0.00
95% 14.17 0.64 14.17 0.73 0.85 594.71 560.37 594.10 220.00 0.00 1.00
max 19.28 2.00 19.28 5.69 0.95 2924.47 2642.23 2922.13 1168.00 6.00 16.00

Support Vector Method

I have the following dataset as a small part of the big dataset.
PM2.5 is the dependent variable, while the other seven-column
represent the independent variables, AOD, BLH, RH, WS, Prec. and Temp.
I am looking to use the Support Vector Method SVM multiple regression
to find the best fit multiple variable regression equation using the python code.
I will appreciate your help a lot.
PM2.5 AOD BLH RH WS Prec Temp SLP
43.52 0.42 0.39 0.74 1.2 0.4 4.95 1.03
18.4 0.31 0.41 0.71 2.9 0.0 13.4 1.02
53.36 0.30 0.91 0.75 3.21 2.8 17.2 1.01
18.83 0.36 0.29 0.48 1.7 0.6 20.5 1.02
21.2 0.39 0.36 0.52 0.93 0.1 22.0 1.02
12.17 0.15 0.69 0.52 0.55 0.1 18.67 1.01
8.75 0.11 0.42 0.59 4.98 0.1 18.67 1.01
7.7 0.31 0.048 0.52 0.95 0.0 22.44 1.02
6.58 0.05 0.48 0.57 2.75 0.0 32.38 1.02
Data as an xls file is here
Thanks a lot in advance

get the value from another values if value is nan [duplicate]

I am trying to create a column which contains only the minimum of the one row and a few columns, for example:
A0 A1 A2 B0 B1 B2 C0 C1
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72
Here I am trying to create a column which contains the minimum for each row of columns B0, B1, B2.
The output would look like this:
A0 A1 A2 B0 B1 B2 C0 C1 Minimum
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75 0.42
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73 0.00
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03 0.51
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61 0.51
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53 0.17
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72 0.01
Here is part of the code, but it is not doing what I want it to do:
for i in range(0,2):
df['Minimum'] = df.loc[0,'B'+str(i)].min()
This is a one-liner, you just need to use the axis argument for min to tell it to work across the columns rather than down:
df['Minimum'] = df.loc[:, ['B0', 'B1', 'B2']].min(axis=1)
If you need to use this solution for different numbers of columns, you can use a for loop or list comprehension to construct the list of columns:
n_columns = 2
cols_to_use = ['B' + str(i) for i in range(n_columns)]
df['Minimum'] = df.loc[:, cols_to_use].min(axis=1)
For my tasks a universal and flexible approach is the following example:
df['Minimum'] = df[['B0', 'B1', 'B2']].apply(lambda x: min(x[0],x[1],x[2]), axis=1)
The target column 'Minimum' is assigned the result of the lambda function based on the selected DF columns['B0', 'B1', 'B2']. Access elements in a function through the function alias and his new Index(if count of elements is more then one). Be sure to specify axis=1, which indicates line-by-line calculations.
This is very convenient when you need to make complex calculations.
However, I assume that such a solution may be inferior in speed.
As for the selection of columns, in addition to the 'for' method, I can suggest using a filter like this:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
literally, a filter is applied to the list of DF columns through a lambda function that checks for the occurrence of the letter 'B'.
after that the first example can be written as follows:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
df['Minimum'] = df[calls_to_use].apply(lambda x: min(x), axis=1)
although after pre-selecting the columns, it would be preferable:
df['Minimum'] = df[calls_to_use].min(axis=1)

Line plot for over 1 million datapoint

I am having hard time plotting a desired line plot. I have a dataset containing 23 columns 21 columns are %age amount paid from 0-2 with a stepsize of 0.1, 1 column for the user id for that particular customer and the last column in the customer segment that he belongs to. I want to plot for all the customers in my dataset the payment pattern with 0-2 with 0.1 stepsize on my x-axis and the values for %age paid on the y-axis and color each line of a customer based on the segment that he belongs to. My dataset looks like the following:
Id paid_0.0 paid_0.1paid_0.2paid_0.3paid_0.4 Segment
AC005839 0.30 0.38 0.45 0.53 0.61 Best
AC005842 0.30 0.30 0.52 0.52 0.52 Best
AC005843 0.30 0.38 0.45 0.53 0.61 Best
AC005851 0.24 0.31 0.35 0.35 0.51 Medium
AC005852 0.30 0.38 0.45 0.53 0.61 Best
AC005853 0.30 0.38 0.45 0.53 0.61 Best
AC005856 0.30 0.38 0.45 0.53 0.61 Best
AC005858 0.30 0.38 0.45 0.53 0.54 Best
AC005859 0.33 0.43 0.54 0.65 0.65 Best
I am trying to generate a plot as below:

Resources