How to change prediction in H2O GBM and DRF - python-3.x

I am building a classification model in h2o DRF and GBM. I want to change probability of prediction such that if p0 <0.2 then predict= 0 else predict=1

Currently, you need to do this manually. It would be easier if we had a threshold argument for the predict() method, so I created a JIRA ticket ticket to make this a bit more straight-forward.
See a Python example below of how to do this manually below.
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
my_gbm.train(x=x, y=y, training_frame=train)
# Predict on a test set using default threshold
pred = my_gbm.predict(test_data=test)
Look at the pred frame:
In [16]: pred.tail()
Out[16]:
predict p0 p1
--------- -------- --------
1 0.484712 0.515288
0 0.693893 0.306107
1 0.319674 0.680326
0 0.582344 0.417656
1 0.471658 0.528342
1 0.079922 0.920078
1 0.150146 0.849854
0 0.835288 0.164712
0 0.639877 0.360123
1 0.54377 0.45623
[10 rows x 3 columns]
Here's how to manually create the predictions you want. More info on how to slice H2OFrames is available in the H2O User Guide.
# Binary column which is 1 if >=0.2 and 0 if <0.2
newpred = pred["p1"] >= 0.2
newpred.tail()
Look at the binary column:
In [23]: newpred.tail()
Out[23]:
p1
----
1
1
1
1
1
1
1
0
1
1
[10 rows x 1 column]
Now you have the predictions you want. You could also replace the "predict" column with the new predicted labels.
pred["predict"] = newpred
Now re-examine the pred frame:
In [24]: pred.tail()
Out[24]:
predict p0 p1
--------- -------- --------
1 0.484712 0.515288
1 0.693893 0.306107
1 0.319674 0.680326
1 0.582344 0.417656
1 0.471658 0.528342
1 0.079922 0.920078
1 0.150146 0.849854
0 0.835288 0.164712
1 0.639877 0.360123
1 0.54377 0.45623
[10 rows x 3 columns]

Related

Stratifying folds with StratifiedKFold in sklearn

I do not understand very well the logic behind sklearn function train_test_split and StratifiedKFold for obtaining balanced splits according to multiple "columns" and not only according to the target distribution. I know the previous sentence is a bit obscure so I hope the following code helps.
import numpy as np
import pandas as pd
import random
n_samples = 100
prob = 0.2
pos = int(n_samples * prob)
neg = n_samples - pos
target = [1] * pos + [0] * neg
cat = ["a"] * 50 + ["b"] * 50
random.shuffle(target)
random.shuffle(cat)
ds = pd.DataFrame()
ds["target"] = target
ds["cat"] = cat
ds["f1"] = np.random.random(size=(n_samples,))
ds["f2"] = np.random.random(size=(n_samples,))
print(ds.head())
This is a 100-example dataset, target distribution is governed by p, in this case we have 20% positive examples. There is a binary categorical column cat, perfectly balanced. The output of the previous code is:
target cat f1 f2
0 0 a 0.970585 0.134268
1 0 a 0.410689 0.225524
2 0 a 0.638111 0.273830
3 0 b 0.594726 0.579668
4 0 a 0.737440 0.667996
with train_test_split(), stratify on target and cat, if we study the frequencies, we get:
from sklearn.model_selection import train_test_split, StratifiedKFold
# with train_test_split
training, valid = train_test_split(range(n_samples),
test_size=20,
stratify=ds[["target", "cat"]])
print("---")
print("* training")
print(ds.loc[training, ["target", "cat"]].value_counts() / len(training)) # balanced
print("* validation")
print(ds.loc[valid, ["target", "cat"]].value_counts() / len(valid)) # balanced
we get this:
* dataset
0 0.8
1 0.2
Name: target, dtype: float64
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
---
* training
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
* validation
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
It is perfectly stratified.
Now with StratifiedKFold:
# with stratified k-fold
skf = StratifiedKFold(n_splits=5)
try:
for train, valid in skf.split(X=range(len(ds)), y=ds[["target", "cat"]]):
pass
except:
print("! does not work")
for train, valid in skf.split(X=range(len(ds)), y=ds.target):
print("happily iterating")
output:
! does not work
happily iterating
happily iterating
happily iterating
happily iterating
happily iterating
How do I obtain what I got with train_test_split with StratifiedKFold? I know there might be data distributions not allowing such stratifications in k-fold cross validation, but I cannot understand why train_test_split accepts two or more columns and the other method does not.
This doesn't seem readily possible currently.
Multilabel isn't exactly what you're looking for, but related. That's been asked here before, and was an Issue on sklearn's github (not sure why it got closed).
As a bit of a hack, you should be able to just combine your two columns into a new one with ordered pairs, and stratify on that?

How to apply accuracy_score function to two columns in group by

I have the following data frame:
wn Ground_truth Prediction
A 1 1
A 1 1
A 1 0
A 1 1
B 0 1
B 1 1
B 0 0
for each group ( A , B) i would like to calculate the accuracy_score(Ground_truth, Prediction)
Specifically for accuracy you can actually do something simpler:
df.assign(x=df['Ground_truth']==df['Prediction']).groupby('wn').mean()
you can use the accuracy_score function from sklearn. You can check their document from here
from sklearn.metrics import accuracy_score
ground_truth = df["Ground_truth"].values
predictions = df["Prediction"].values
accuracy = accuracy_score(ground_truth, predictions)

Multivariate binary sequence prediction with CRF

this question is an extension of this one which focuses on LSTM as opposed to CRF. Unfortunately, I do not have any experience with CRFs, which is why I'm asking these questions.
Problem:
I would like to predict a sequence of binary signal for multiple, non-independent groups. My dataset is moderately small (~1000 records per group), so I would like to try a CRF model here.
Available data:
I have a dataset with the following variables:
Timestamps
Group
Binary signal representing activity
Using this dataset I would like to forecast group_a_activity and group_b_activity which are both 0 or 1.
Note that the groups are believed to be cross-correlated and additional features can be extracted from timestamps -- for simplicity we can assume that there is only 1 feature we extract from the timestamps.
What I have so far:
Here is the data setup that you can reproduce on your own machine.
# libraries
import re
import numpy as np
import pandas as pd
data_length = 18 # how long our data series will be
shift_length = 3 # how long of a sequence do we want
df = (pd.DataFrame # create a sample dataframe
.from_records(np.random.randint(2, size=[data_length, 3]))
.rename(columns={0:'a', 1:'b', 2:'extra'}))
df.head() # check it out
# shift (assuming data is sorted already)
colrange = df.columns
shift_range = [_ for _ in range(-shift_length, shift_length+1) if _ != 0]
for c in colrange:
for s in shift_range:
if not (c == 'extra' and s > 0):
charge = 'next' if s > 0 else 'last' # 'next' variables is what we want to predict
formatted_s = '{0:02d}'.format(abs(s))
new_var = '{var}_{charge}_{n}'.format(var=c, charge=charge, n=formatted_s)
df[new_var] = df[c].shift(s)
# drop unnecessary variables and trim missings generated by the shift operation
df.dropna(axis=0, inplace=True)
df.drop(colrange, axis=1, inplace=True)
df = df.astype(int)
df.head() # check it out
# a_last_03 a_last_02 ... extra_last_02 extra_last_01
# 3 0 1 ... 0 1
# 4 1 0 ... 0 0
# 5 0 1 ... 1 0
# 6 0 0 ... 0 1
# 7 0 0 ... 1 0
[5 rows x 15 columns]
Before we get to the CRF part, I suspect that I cannot use approach this problem from a multi-task learning point of view (predicting patterns for both A and B via one model) and therefore I'm going to have to predict each of them individually.
Now the CRF part. I've found some relevant example (here is one) but they all tend to predict a single class value based on a prior sequence.
Here is my attempt at using a CRF here:
import pycrfsuite
crf_features = [] # a container for features
crf_labels = [] # a container for response
# lets focus on group A only for this one
current_response = [c for c in df.columns if c.startswith('a_next')]
# predictors are going to have to be nested otherwise I'll run into problems with dimensions
current_predictors = [c for c in df.columns if not 'next' in c]
current_predictors = set([re.sub('_\d+$','',v) for v in current_predictors])
for index, row in df.iterrows():
# not sure if its an effective way to iterate over a DF...
iter_features = []
for p in current_predictors:
pred_feature = []
# note that 0/1 values have to be converted into booleans
for k in range(shift_length):
iter_pred_feature = p + '_{0:02d}'.format(k+1)
pred_feature.append(p + "=" + str(bool(row[iter_pred_feature])))
iter_features.append(pred_feature)
iter_response = [row[current_response].apply(lambda z: str(bool(z))).tolist()]
crf_labels.extend(iter_response)
crf_features.append(iter_features)
trainer = pycrfsuite.Trainer(verbose=True)
for xseq, yseq in zip(crf_features, crf_labels):
trainer.append(xseq, yseq)
trainer.set_params({
'c1': 0.0, # coefficient for L1 penalty
'c2': 0.0, # coefficient for L2 penalty
'max_iterations': 10, # stop earlier
# include transitions that are possible, but not observed
'feature.possible_transitions': True
})
trainer.train('testcrf.crfsuite')
tagger = pycrfsuite.Tagger()
tagger.open('testcrf.crfsuite')
tagger.tag(xseq)
# ['False', 'True', 'False']
It seems that I did manage to get it working, but I'm not sure if I've approached it correctly. I'll formulate my questions in the Questions section, but first, here is an alternative approach using keras_contrib package:
from keras import Sequential
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
# we are gonna have to revisit data prep stage again
# separate predictors and response
response_df_dict = {}
for g in ['a','b']:
response_df_dict[g] = df[[c for c in df.columns if 'next' in c and g in c]]
# reformat for LSTM
# the response for every row is a matrix with depth of 2 (the number of groups) and width = shift_length
# the predictors are of the same dimensions except the depth is not 2 but the number of predictors that we have
response_array_list = []
col_prefix = set([re.sub('_\d+$','',c) for c in df.columns if 'next' not in c])
for c in col_prefix:
current_array = df[[z for z in df.columns if z.startswith(c)]].values
response_array_list.append(current_array)
# reshape into samples (1), time stamps (2) and channels/variables (0)
response_array = np.array([response_df_dict['a'].values,response_df_dict['b'].values])
response_array = np.reshape(response_array, (response_array.shape[1], response_array.shape[2], response_array.shape[0]))
predictor_array = np.array(response_array_list)
predictor_array = np.reshape(predictor_array, (predictor_array.shape[1], predictor_array.shape[2], predictor_array.shape[0]))
model = Sequential()
model.add(CRF(2, input_shape=(predictor_array.shape[1],predictor_array.shape[2])))
model.summary()
model.compile(loss=crf_loss, optimizer='adam', metrics=['accuracy'])
model.fit(predictor_array, response_array, epochs=10, batch_size=1)
model_preds = model.predict(predictor_array) # not gonna worry about train/test split here
Questions:
My main question is whether or not I've constructed both of my CRF models correctly. What worries me is that (1) there is not a lot of documentation out there on CRF models, (2) CRFs are mainly used for predicting a single label given a sequence, (3) the input features are nested and (4) when used in a multi-tasked fashion, I'm not sure if it is valid.
I have a few extra questions as well:
Is a CRF appropriate for this problem?
How are the 2 approaches (one based on pycrfuite and one based on keras_contrib) different and what are their advantages/disadvantages?
In a more general sense, what is the advantage of combining CRF and LSTM models into one (like one discussed here)
Many thanks!

Ensemble model in H2O with fold_column argument

I am new to H2O in python. I am trying to model my data using ensemble model following the example codes from H2O's web site. (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html)
I have applied GBM and RF as base models. And then using stacking, I tried to merge them in ensemble model. In addition, in my training data I created one additional column named 'fold' to be used in fold_column = "fold"
I applied 10 fold cv and I observed that I received results from cv1. However, all the predictions coming from other 9 cvs, they are empty. What am I missing here?
Here is my sample data:
code:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init(port=23, nthreads=6)
train = h2o.H2OFrame(ens_df)
test = h2o.H2OFrame(test_ens_eq)
x = train.drop(['Date','EQUITY','fold'],axis=1).columns
y = 'EQUITY'
cat_cols = ['A','B','C','D']
train[cat_cols] = train[cat_cols].asfactor()
test[cat_cols] = test[cat_cols].asfactor()
my_gbm = H2OGradientBoostingEstimator(distribution="gaussian",
ntrees=10,
max_depth=3,
min_rows=2,
learn_rate=0.2,
keep_cross_validation_predictions=True,
seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column = "fold")
Then when I check cv results with
my_gbm.cross_validation_predictions():
Plus when I try the ensemble in the test set I get the warning below:
# Train a stacked ensemble using the GBM and GLM above
ensemble = H2OStackedEnsembleEstimator(model_id="mlee_ensemble",
base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
# Eval ensemble performance on the test data
perf_stack_test = ensemble.model_performance(test)
pred = ensemble.predict(test)
pred
/mgmt/data/conda/envs/python3.6_4.4/lib/python3.6/site-packages/h2o/job.py:69: UserWarning: Test/Validation dataset is missing column 'fold': substituting in a column of NaN
warnings.warn(w)
Am I missing something about fold_column?
Here is an example of how to use a custom fold column (created from a list). This is a modified version of the example Python code in the Stacked Ensemble page in the H2O User Guide.
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# Add a fold column, generate from a list
# The list has 10 unique values, so there will be 10 folds
fold_list = list(range(10)) * 1000
train['fold_id'] = h2o.H2OFrame(fold_list)
# Train and cross-validate a GBM
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
ntrees=10,
keep_cross_validation_predictions=True,
seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column="fold_id")
# Train and cross-validate a RF
my_rf = H2ORandomForestEstimator(ntrees=50,
keep_cross_validation_predictions=True,
seed=1)
my_rf.train(x=x, y=y, training_frame=train, fold_column="fold_id")
# Train a stacked ensemble using the GBM and RF above
ensemble = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
To answer your second question about how to view the cross-validated predictions in a model. They are stored in two places, however, the method that you probably want to use is: .cross_validation_holdout_predictions() This method returns a single H2OFrame of the cross-validated predictions, in the original order of the training observations:
In [11]: my_gbm.cross_validation_holdout_predictions()
Out[11]:
predict p0 p1
--------- -------- --------
1 0.323155 0.676845
1 0.248131 0.751869
1 0.288241 0.711759
1 0.407768 0.592232
1 0.507294 0.492706
0 0.6417 0.3583
1 0.253329 0.746671
1 0.289916 0.710084
1 0.524328 0.475672
1 0.252006 0.747994
[10000 rows x 3 columns]
The second method, .cross_validation_predictions() is a list which stores the predictions from each fold in an H2OFrame that has the same number of rows as the original training frame, but the rows that are not active in that fold have a value of zero. This is not usually the format that people find most useful, so I'd recommend using the other method instead.
In [13]: type(my_gbm.cross_validation_predictions())
Out[13]: list
In [14]: len(my_gbm.cross_validation_predictions())
Out[14]: 10
In [15]: my_gbm.cross_validation_predictions()[0]
Out[15]:
predict p0 p1
--------- -------- --------
1 0.323155 0.676845
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
[10000 rows x 3 columns]

How should we interpret the results of the H2O predict function?

I have trained and stored a random forest binary classification model. Now I'm trying to simulate processing new (out-of-sample) data with this model. My Python (Anaconda 3.6) code is:
import h2o
import pandas as pd
import sys
localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
h2o.remove_all()
model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
model = h2o.load_model(model_path)
new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
print(new_data.head(10))
predict = model.predict(new_data) # predict returns a data frame
print(predict.describe())
predicted = predict[0,0]
probability = predict[0,2] # probability the prediction is a "1"
print('prediction: ', predicted, ', probability: ', probability)
When I run this code I get:
>>> import h2o
>>> import pandas as pd
>>> import sys
>>> localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
-------------------------- ------------------------------
H2O cluster uptime: 22 hours 22 mins
H2O cluster version: 3.10.5.4
H2O cluster version age: 18 days
H2O cluster name: H2O_from_python_Charles_0fqq0c
H2O cluster total nodes: 1
H2O cluster free memory: 6.790 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
Python version: 3.6.1 final
-------------------------- ------------------------------
>>> h2o.remove_all()
>>> model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
>>> model = h2o.load_model(model_path)
>>> new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
Parse progress: |█████████████████████████████████████████████████████████| 100%
>>> print(new_data.head(10))
BoxRatio Thrust Velocity OnBalRun vwapGain
---------- -------- ---------- ---------- ----------
1.502 55.044 0.38 37 0.845
[1 row x 5 columns]
>>> predict = model.predict(new_data) # predict returns a data frame
drf prediction progress: |████████████████████████████████████████████████| 100%
>>> print(predict.describe())
Rows:1
Cols:3
predict p0 p1
------- --------- ------------------ -------------------
type enum real real
mins 0.8849431818181818 0.11505681818181818
mean 0.8849431818181818 0.11505681818181818
maxs 0.8849431818181818 0.11505681818181818
sigma 0.0 0.0
zeros 0 0
missing 0 0 0
0 1 0.8849431818181818 0.11505681818181818
None
>>> predicted = predict[0,0]
>>> probability = predict[0,2] # probability the prediction is a "1"
>>> print('prediction: ', predicted, ', probability: ', probability)
prediction: 1 , probability: 0.11505681818181818
>>>
I am confused by the contents of the "predict" data frame. Please tell me what the numbers in the columns labeled "p0" and "p1" mean. I hope they are probabilities, and as you can see by my code, I am trying to get the predicted classification (0 or 1) and a probability that this classification is correct. Does my code correctly do that?
Any comments will be greatly appreciated.
Charles
p0 is the probability (between 0 and 1) that class 0 is chosen.
p1 is the probability (between 0 and 1) that class 1 is chosen.
The thing to keep in mind is that the "prediction" is made by applying a threshold to p1. That threshold point is chosen depending on whether you want to reduce false positives or false negatives. It's not just 0.5.
The threshold chosen for "the prediction" is max-F1. But you can extract out p1 yourself and threshold it any way you like.
Darren Cook asked me to post the first few lines of my training data. Here is is:
BoxRatio Thrust Velocity OnBalRun vwapGain Altitude
0 0.000 0.000 2.186 4.534 0.361 1
1 0.000 0.000 0.561 2.642 0.909 1
2 2.824 2.824 2.199 4.748 1.422 1
3 0.442 0.452 1.702 3.695 1.186 0
4 0.084 0.088 0.612 1.699 0.700 1
The response column is labeled "Altitude". Class 1 is what I want to see from new "out-of-sample" data. "1" is good, and it means that "Altitude" was reached (true positive). "0" means that "Altitude" was not reached (true negative). In the predict table above, "1" was predicted with a probability of 0.11505681818181818. This does not make sense to me.
Charles

Resources