convert txt tables to python dictionary - python-3.x

I have multiple text files which contain some tables , most of the tables are of these two types. I want a way to convert these tables into python dictionaries.
precision recall f1-score support
BASE_RENT_ANNUAL 0.53 0.57 0.55 1408
BASE_RENT_MONTHLY 0.65 0.54 0.59 3904
BASE_RENT_PSF 0.68 0.59 0.63 1248
RENT_INCREMENT_MONTHLY 0.63 0.44 0.52 7530
SECURITY_DEPOSIT_AMOUNT 0.88 0.89 0.88 3557
micro avg 0.69 0.58 0.63 17647
macro avg 0.67 0.61 0.63 17647
weighted avg 0.68 0.58 0.62 17647
Hard Evaluation Metrics
--------------------------------------------------
Reading predictions from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/training/test_predictions.txt...
Nb tokens in test set: 957800
Reading training data from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/corpus/train.txt...
Nb tokens in training set: 211153
Strict mode: OFF
---------------------------------------------------------------------
Test tokens Nb tokens Nb words Nb errors Token error rate
---------------------------------------------------------------------
all 957800 5408 39333 0.0411
---------------------------------------------------------------------
unseen-I 704 19 704 1.0000
unseen-O 59870 1724 10208 0.1705
unseen-all 60574 1743 10912 0.1801
---------------------------------------------------------------------
diff-I 13952 70 13952 1.0000
diff-O 5285 121 4645 0.8789
diff-etype 0 0 0 0.0000
diff-all 19237 191 18597 0.9667
---------------------------------------------------------------------
all-unseen+diff 79811 1934 29509 0.3697
---------------------------------------------------------------------
Avg TER on unseen and diff: 0.5734
I have tried this in my code to convert the second table to dictionary but it is not working as expected.:
from itertools import dropwhile, takewhile
with open("idm.txt") as f:
dp = dropwhile(lambda x: not x.startswith("-"), f)
next(dp) # skip ----
names = next(dp).split() # get headers names
next(f) # skip -----
out = []
for line in takewhile(lambda x: not x.startswith("-"), f):
a, b = line.rsplit(None, 1)
out.append(dict(zip(names, a.split(None, 7) + [b])))
Expected output:
{BASE_RENT_ANNUAL: {precision:0.53,recall:0.57,f1-score:0.55,support:1408},
BASE_RENT_MONTHLY: {...}, ..
}

Not the same approach, but the following could be a beginning to your true solution
txt = ''' precision recall f1-score support
BASE_RENT_ANNUAL 0.53 0.57 0.55 1408
BASE_RENT_MONTHLY 0.65 0.54 0.59 3904
BASE_RENT_PSF 0.68 0.59 0.63 1248
RENT_INCREMENT_MONTHLY 0.63 0.44 0.52 7530
SECURITY_DEPOSIT_AMOUNT 0.88 0.89 0.88 3557
micro avg 0.69 0.58 0.63 17647
macro avg 0.67 0.61 0.63 17647
weighted avg 0.68 0.58 0.62 17647
Hard Evaluation Metrics
--------------------------------------------------
Reading predictions from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/training/test_predictions.txt...
Nb tokens in test set: 957800
Reading training data from /mnt/c/Users/Aleksandra/mlbuddy/python/bilstm/corpus/train.txt...
Nb tokens in training set: 211153
Strict mode: OFF
---------------------------------------------------------------------
Test tokens Nb tokens Nb words Nb errors Token error rate
---------------------------------------------------------------------
all 957800 5408 39333 0.0411
---------------------------------------------------------------------
unseen-I 704 19 704 1.0000
unseen-O 59870 1724 10208 0.1705
unseen-all 60574 1743 10912 0.1801
---------------------------------------------------------------------
diff-I 13952 70 13952 1.0000
diff-O 5285 121 4645 0.8789
diff-etype 0 0 0 0.0000
diff-all 19237 191 18597 0.9667
---------------------------------------------------------------------
all-unseen+diff 79811 1934 29509 0.3697
---------------------------------------------------------------------
Avg TER on unseen and diff: 0.5734'''
lst1 = [x.split() for x in txt.split('\n') if x]
lst2 = [(x[0],x[1:]) for x in lst1 if (not x[0].startswith('-') and x[0] == x[0].upper())]
dico = dict(lst2)
dico2 = {}
for k in dico:
dico2[k] = {'precision':dico[k][0],'recall':dico[k][1],'f1-score':dico[k][2],'support':dico[k][3]}
print(dico2)

Related

How to scale dataset with huge difference in stdev for DNN training?

I'm trying to train a DNN model using one dataset with huge difference in stdev. The following scalers were tested but none of them work: MinMaxScaler, StandardScaler, RobustScaler, PowerTransformer. The reason they didn't work was that those models can achieve high predictive performance on the validation sets but they had little predictivity on external test sets. The dataset has more than 10,000 rows and 200 columns. Here are a prt of statistics of the dataset.
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
mean 11.31 -1.04 11.31 0.21 0.55 359.01 337.64 358.58 131.70 0.01 0.09
std 2.72 1.42 2.72 0.24 0.20 139.86 131.40 139.67 52.25 0.14 0.47
min 2.00 -10.98 2.00 0.00 0.02 59.11 50.04 59.07 26.00 0.00 0.00
5% 5.24 -4.07 5.24 0.01 0.19 190.25 178.15 190.10 70.00 0.00 0.00
25% 10.79 -1.35 10.79 0.05 0.41 269.73 254.14 269.16 98.00 0.00 0.00
50% 12.15 -0.64 12.15 0.13 0.58 335.47 316.23 335.15 122.00 0.00 0.00
75% 12.99 -0.21 12.99 0.27 0.72 419.42 394.30 419.01 154.00 0.00 0.00
95% 14.17 0.64 14.17 0.73 0.85 594.71 560.37 594.10 220.00 0.00 1.00
max 19.28 2.00 19.28 5.69 0.95 2924.47 2642.23 2922.13 1168.00 6.00 16.00

MLP classifier results in different models using almost identical data

I am simulating soccer predictions using scikit-learns MLP classifier. Two model trainings using almost identical data (the second one contains 42 more rows out of 5466 total) and configuration (e.g. random_state) results in the below statistics:
2020-09-19 00:00:00
-------------------------------------------MLPClassifier--------------------------------------------
Fitting 3 folds for each of 27 candidates, totalling 81 fits
GridSearchCV:
Best score : 0.5179227897048015
Best params: {'classifier__alpha': 2.4, 'classifier__hidden_layer_sizes': [3, 3], 'preprocessor__num__scaling': StandardScaler(), 'selector': SelectFromModel(estimator=RandomForestClassifier(n_estimators=10,
random_state=42),
threshold='2.1*median'), 'selector__threshold': '2.1*median'}
precision recall f1-score support
A 0.59 0.57 0.58 1550
D 0.09 0.47 0.15 244
H 0.82 0.57 0.67 3143
accuracy 0.57 4937
macro avg 0.50 0.54 0.47 4937
weighted avg 0.71 0.57 0.62 4937
2020-09-26 00:00:00
-------------------------------------------MLPClassifier--------------------------------------------
Fitting 3 folds for each of 27 candidates, totalling 81 fits
GridSearchCV:
Best score : 0.5253689104507783
Best params: {'classifier__alpha': 2.4, 'classifier__hidden_layer_sizes': [3, 3], 'preprocessor__num__scaling': StandardScaler(), 'selector': SelectFromModel(estimator=RandomForestClassifier(n_estimators=10,
random_state=42),
threshold='1.6*median'), 'selector__threshold': '1.6*median'}
precision recall f1-score support
A 0.62 0.57 0.59 1611
D 0.00 0.00 0.00 0
H 0.86 0.57 0.69 3336
accuracy 0.57 4947
macro avg 0.49 0.38 0.43 4947
weighted avg 0.78 0.57 0.66 4947
How is that possible, that one model never predicts D, while the other one does? I am trying to understand, what's going here. I am afraid, posting the whole problem/code is not possible, so I am looking for a generic answer. I have this behaviour (D's <-> no D's) throughout 38 observations.

How to log a table of metrics into mlflow

I am trying to see if mlflow is the right place to store my metrics in the model tracking. According to the doc log_metric takes either a key value or a dict of key-values. I am wondering how to log something like below into mlflow so it can be visualized meaningfully.
precision recall f1-score support
class1 0.89 0.98 0.93 174
class2 0.96 0.90 0.93 30
class3 0.96 0.90 0.93 30
class4 1.00 1.00 1.00 7
class5 0.93 1.00 0.96 13
class6 1.00 0.73 0.85 15
class7 0.95 0.97 0.96 39
class8 0.80 0.67 0.73 6
class9 0.97 0.86 0.91 37
class10 0.95 0.81 0.88 26
class11 0.50 1.00 0.67 5
class12 0.93 0.89 0.91 28
class13 0.73 0.84 0.78 19
class14 1.00 1.00 1.00 6
class15 0.45 0.83 0.59 6
class16 0.97 0.98 0.97 245
class17 0.93 0.86 0.89 206
accuracy 0.92 892
macro avg 0.88 0.90 0.88 892
weighted avg 0.93 0.92 0.92 892

get the value from another values if value is nan [duplicate]

I am trying to create a column which contains only the minimum of the one row and a few columns, for example:
A0 A1 A2 B0 B1 B2 C0 C1
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72
Here I am trying to create a column which contains the minimum for each row of columns B0, B1, B2.
The output would look like this:
A0 A1 A2 B0 B1 B2 C0 C1 Minimum
0 0.84 0.47 0.55 0.46 0.76 0.42 0.24 0.75 0.42
1 0.43 0.47 0.93 0.39 0.58 0.83 0.35 0.39 0.39
2 0.12 0.17 0.35 0.00 0.19 0.22 0.93 0.73 0.00
3 0.95 0.56 0.84 0.74 0.52 0.51 0.28 0.03 0.51
4 0.73 0.19 0.88 0.51 0.73 0.69 0.74 0.61 0.51
5 0.18 0.46 0.62 0.84 0.68 0.17 0.02 0.53 0.17
6 0.38 0.55 0.80 0.87 0.01 0.88 0.56 0.72 0.01
Here is part of the code, but it is not doing what I want it to do:
for i in range(0,2):
df['Minimum'] = df.loc[0,'B'+str(i)].min()
This is a one-liner, you just need to use the axis argument for min to tell it to work across the columns rather than down:
df['Minimum'] = df.loc[:, ['B0', 'B1', 'B2']].min(axis=1)
If you need to use this solution for different numbers of columns, you can use a for loop or list comprehension to construct the list of columns:
n_columns = 2
cols_to_use = ['B' + str(i) for i in range(n_columns)]
df['Minimum'] = df.loc[:, cols_to_use].min(axis=1)
For my tasks a universal and flexible approach is the following example:
df['Minimum'] = df[['B0', 'B1', 'B2']].apply(lambda x: min(x[0],x[1],x[2]), axis=1)
The target column 'Minimum' is assigned the result of the lambda function based on the selected DF columns['B0', 'B1', 'B2']. Access elements in a function through the function alias and his new Index(if count of elements is more then one). Be sure to specify axis=1, which indicates line-by-line calculations.
This is very convenient when you need to make complex calculations.
However, I assume that such a solution may be inferior in speed.
As for the selection of columns, in addition to the 'for' method, I can suggest using a filter like this:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
literally, a filter is applied to the list of DF columns through a lambda function that checks for the occurrence of the letter 'B'.
after that the first example can be written as follows:
calls_to_use = list(filter(lambda f:'B' in f, df.columns))
df['Minimum'] = df[calls_to_use].apply(lambda x: min(x), axis=1)
although after pre-selecting the columns, it would be preferable:
df['Minimum'] = df[calls_to_use].min(axis=1)

Is it possible to run a Cox-Proportional-Hazards-Model with an exponential distribution for the baseline hazard in `lifelines` or another package?

I consider using the lifelines package to fit a Cox-Proportional-Hazards-Model. I read that lifelines uses a nonparametric approach to fit the baseline hazard, which results in different baseline_hazards for some time points (see code example below). For my application, I need an
exponential distribution leading to a baseline hazard h0(t) = lambda which is constant across time.
So my question is: is it (in the meantime) possible to run a Cox-Proportional-Hazards-Model with an exponential distribution for the baseline hazard in lifelines or another Python package?
Example code:
from lifelines import CoxPHFitter
import pandas as pd
df = pd.DataFrame({'duration': [4, 6, 5, 5, 4, 6],
'event': [0, 0, 0, 1, 1, 1],
'cat': [0, 1, 0, 1, 0, 1]})
cph = CoxPHFitter()
cph.fit(df, duration_col='duration', event_col='event', show_progress=True)
cph.baseline_hazard_
gives
baseline hazard
T
4.0 0.160573
5.0 0.278119
6.0 0.658032
👋lifelines author here.
So, this model is not natively in lifelines, but you can easily implement it yourself (and maybe something I'll do for a future release). This idea relies on the intersection of proportional hazard models and AFT (accelerated failure time) models. In the cox-ph model with exponential hazard (i.e. constant baseline hazard), the hazard looks like:
h(t|x) = lambda_0(t) * exp(beta * x) = lambda_0 * exp(beta * x)
In the AFT specification for an exponential distribution, the hazard looks like:
h(t|x) = exp(-beta * x - beta_0) = exp(-beta * x) * exp(-beta_0) = exp(-beta * x) * lambda_0
Note the negative sign difference!
So instead of doing a CoxPH, we can do an Exponential AFT fit (and flip the signs if we want the same interpretation as the CoxPH). We can use the custom regession model syntax to do this:
from lifelines.fitters import ParametricRegressionFitter
from autograd import numpy as np
class ExponentialAFTFitter(ParametricRegressionFitter):
# this is necessary, and should always be a non-empty list of strings.
_fitted_parameter_names = ['lambda_']
def _cumulative_hazard(self, params, T, Xs):
# params is a dictionary that maps unknown parameters to a numpy vector.
# Xs is a dictionary that maps unknown parameters to a numpy 2d array
lambda_ = np.exp(np.dot(Xs['lambda_'], params['lambda_']))
return T / lambda_
Testing this,
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi = load_rossi()
rossi['intercept'] = 1
regressors = {'lambda_': rossi.columns}
eaf = ExponentialAFTFitter().fit(rossi, "week", "arrest", regressors=regressors)
eaf.print_summary()
"""
<lifelines.ExponentialAFTFitter: fitted with 432 observations, 318 censored>
event col = 'arrest'
number of subjects = 432
number of events = 114
log-likelihood = -686.37
time fit was run = 2019-06-27 15:13:18 UTC
---
coef exp(coef) se(coef) z p -log2(p) lower 0.95 upper 0.95
lambda_ fin 0.37 1.44 0.19 1.92 0.06 4.18 -0.01 0.74
age 0.06 1.06 0.02 2.55 0.01 6.52 0.01 0.10
race -0.30 0.74 0.31 -0.99 0.32 1.63 -0.91 0.30
wexp 0.15 1.16 0.21 0.69 0.49 1.03 -0.27 0.56
mar 0.43 1.53 0.38 1.12 0.26 1.93 -0.32 1.17
paro 0.08 1.09 0.20 0.42 0.67 0.57 -0.30 0.47
prio -0.09 0.92 0.03 -3.03 <0.005 8.65 -0.14 -0.03
_intercept 4.05 57.44 0.59 6.91 <0.005 37.61 2.90 5.20
_fixed _intercept 0.00 1.00 0.00 nan nan nan 0.00 0.00
---
"""
CoxPHFitter().fit(load_rossi(), 'week', 'arrest').print_summary()
"""
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
number of subjects = 432
number of events = 114
partial log-likelihood = -658.75
time fit was run = 2019-06-27 15:17:41 UTC
---
coef exp(coef) se(coef) z p -log2(p) lower 0.95 upper 0.95
fin -0.38 0.68 0.19 -1.98 0.05 4.40 -0.75 -0.00
age -0.06 0.94 0.02 -2.61 0.01 6.79 -0.10 -0.01
race 0.31 1.37 0.31 1.02 0.31 1.70 -0.29 0.92
wexp -0.15 0.86 0.21 -0.71 0.48 1.06 -0.57 0.27
mar -0.43 0.65 0.38 -1.14 0.26 1.97 -1.18 0.31
paro -0.08 0.92 0.20 -0.43 0.66 0.59 -0.47 0.30
prio 0.09 1.10 0.03 3.19 <0.005 9.48 0.04 0.15
---
Concordance = 0.64
Log-likelihood ratio test = 33.27 on 7 df, -log2(p)=15.37
"""
Notice the sign change! So if you want the constant baseline hazard in the model, it's exp(-4.05).

Resources