h2o vs scikit learn confusion matrix - python-3.x

Anyone able to match the sklearn confusion matrix to h2o?
They never match....
Doing something similar with Keras produces a perfect match.
But in h2o they are always off. Tried it every which way...
Borrowed some code from:
Any difference between H2O and Scikit-Learn metrics scoring?
# In[30]:
import pandas as pd
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
model = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
model.train(x=x, y=y, training_frame=train)
# In[31]:
# Test AUC
model.model_performance(test).auc()
# 0.7817203808052897
# In[32]:
# Generate predictions on a test set
pred = model.predict(test)
# In[33]:
from sklearn.metrics import roc_auc_score, confusion_matrix
pred_df = pred.as_data_frame()
y_true = test[y].as_data_frame()
roc_auc_score(y_true, pred_df['p1'].tolist())
#pred_df.head()
# In[36]:
y_true = test[y].as_data_frame().values
cm = pd.DataFrame(confusion_matrix(y_true, pred_df['predict'].values))
# In[37]:
print(cm)
0 1
0 1354 961
1 540 2145
# In[38]:
model.model_performance(test).confusion_matrix()
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.353664307031828:
0 1 Error Rate
0 964.0 1351.0 0.5836 (1351.0/2315.0)
1 274.0 2411.0 0.102 (274.0/2685.0)
Total 1238.0 3762.0 0.325 (1625.0/5000.0)
# In[39]:
h2o.cluster().shutdown()

This does the trick, thx for the hunch Vivek. Still not an exact match but extremely close.
perf = model.model_performance(train)
threshold = perf.find_threshold_by_max_metric('f1')
model.model_performance(test).confusion_matrix(thresholds=threshold)

I also meet the same issue. Here is what I would do to make a fair comparison:
model.train(x=x, y=y, training_frame=train, validation_frame=test)
cm1 = model.confusion_matrix(metrics=['F1'], valid=True)
Since we train the model using training data and validation data, then the pred['predict'] will use the threshold which maximizes the F1 score of validation data. To make sure, one can use these lines:
threshold = perf.find_threshold_by_max_metric(metric='F1', valid=True)
pred_df['predict'] = pred_df['p1'].apply(lambda x: 0 if x < threshold else 1)
To get another confusion matrix from scikit learn:
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(y_true, pred_df['predict'])
In my case, I don't understand why I get slightly different results. Something like, for example:
print(cm1)
>> [[3063 176]
[ 94 146]]
print(cm2)
>> [[3063 176]
[ 95 145]]

Related

Fewer than expected purity scores in PCA analysis

I'm trying to plot the line graph of purity scores against the captured variances in PCA. The goal is to plot the line graph of purity scores against the captured variances of 89% and 99% only. In my code when the components/dimensions are 2 it captures 89% of variance and and when components/dimensions are 4 it captures 99% of variance.
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("clustering.csv")
X10_df = df.drop("Class",axis = 1) #feature matrix
Y10_df = df["Class"] #Target vector
X10_df = np.array(X10_df)
Y10_df = np.array(Y10_df)
scaler = StandardScaler() # Standardizing the data
df_std = scaler.fit_transform(X10_df)
pca = PCA()
pca.fit(df_std)
purity = []
n_comp = range(2,5)
for k in n_comp :
pca = PCA(n_components = k)
pca.fit(df_std)
pca.transform(df_std)
scores_pca = pca.transform(df_std)
kmeans_pca = KMeans(n_clusters=3, init ='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y12 = kmeans_pca.fit_predict(scores_pca)
purity13 = purity_score(Y10_df, pred_y12)
purity.append(purity13)
Below function calculates the purity score :
def purity_score(y_true, y_pred):
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
However, while I have four variance scores, I only have three purity scores. I expected to have four purity scores so that I could create a plot of the variance vs purity.
Why there are only three purity scores?
Here is the link to my dataset file : https://gofile.io/d/3CgFTi
This is simply because when you using for loop with a range, the last number in the range is ignored. So in a range(2,5), it will go 2, 3, 4 and then quite the loop. Please read on for loop in Python.

Find wrongly categorized samples from validation step

I am using a keras neural net for identifying category in which the data belongs.
self.model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(lr=0.001, decay=0.0001),
metrics=[categorical_accuracy])
Fit function
history = self.model.fit(self.X,
{'output': self.Y},
validation_split=0.3,
epochs=400,
batch_size=32
)
I am interested in finding out which labels are getting categorized wrongly in the validation step. Seems like a good way to understand what is happening under the hood.
You can use model.predict_classes(validation_data) to get the predicted classes for your validation data, and compare these predictions with the actual labels to find out where the model was wrong. Something like this:
predictions = model.predict_classes(validation_data)
wrong = np.where(predictions != Y_validation)
If you are interested in looking 'under the hood', I'd suggest to use
model.predict(validation_data_x)
to see the scores for each class, for each observation of the validation set.
This should shed some light on which categories the model is not so good at classifying. The way to predict the final class is
scores = model.predict(validation_data_x)
preds = np.argmax(scores, axis=1)
be sure to use the proper axis for np.argmax (I'm assuming your observation axis is 1). Use preds to then compare with the real class.
Also, as another exploration you want to see the overall accuracy on this dataset, use
model.evaluate(x=validation_data_x, y=validation_data_y)
I ended up creating a metric which prints the "worst performing category id + score" on each iteration. Ideas from link
import tensorflow as tf
import numpy as np
class MaxIoU(object):
def __init__(self, num_classes):
super().__init__()
self.num_classes = num_classes
def max_iou(self, y_true, y_pred):
# Wraps np_max_iou method and uses it as a TensorFlow op.
# Takes numpy arrays as its arguments and returns numpy arrays as
# its outputs.
return tf.py_func(self.np_max_iou, [y_true, y_pred], tf.float32)
def np_max_iou(self, y_true, y_pred):
# Compute the confusion matrix to get the number of true positives,
# false positives, and false negatives
# Convert predictions and target from categorical to integer format
target = np.argmax(y_true, axis=-1).ravel()
predicted = np.argmax(y_pred, axis=-1).ravel()
# Trick from torchnet for bincounting 2 arrays together
# https://github.com/pytorch/tnt/blob/master/torchnet/meter/confusionmeter.py
x = predicted + self.num_classes * target
bincount_2d = np.bincount(x.astype(np.int32), minlength=self.num_classes**2)
assert bincount_2d.size == self.num_classes**2
conf = bincount_2d.reshape((self.num_classes, self.num_classes))
# Compute the IoU and mean IoU from the confusion matrix
true_positive = np.diag(conf)
false_positive = np.sum(conf, 0) - true_positive
false_negative = np.sum(conf, 1) - true_positive
# Just in case we get a division by 0, ignore/hide the error and set the value to 0
with np.errstate(divide='ignore', invalid='ignore'):
iou = false_positive / (true_positive + false_positive + false_negative)
iou[np.isnan(iou)] = 0
return np.max(iou).astype(np.float32) + np.argmax(iou).astype(np.float32)
~
usage:
custom_metric = MaxIoU(len(catagories))
self.model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(lr=0.001, decay=0.0001),
metrics=[categorical_accuracy, custom_metric.max_iou])

Tensorflow Extracting Classification Predictions

I've a tensorflow NN model for classification of one-hot-encoded group labels (groups are exclusive), which ends with (layerActivs[-1] are the activations of the final layer):
probs = sess.run(tf.nn.softmax(layerActivs[-1]),...)
classes = sess.run(tf.round(probs))
preds = sess.run(tf.argmax(classes))
The tf.round is included to force any low probabilities to 0. If all probabilities are below 50% for an observation, this means that no class will be predicted. I.e., if there are 4 classes, we could have probs[0,:] = [0.2,0,0,0.4], so classes[0,:] = [0,0,0,0]; preds[0] = 0 follows.
Obviously this is ambiguous, as it is the same result that would occur if we had probs[1,:]=[.9,0,.1,0] -> classes[1,:] = [1,0,0,0] -> 1 preds[1] = 0. This is a problem when using the tensorflow builtin metrics class, as the functions can't distinguish between no prediction, and prediction in class 0. This is demonstrated by this code:
import numpy as np
import tensorflow as tf
import pandas as pd
''' prepare '''
classes = 6
n = 100
# simulate data
np.random.seed(42)
simY = np.random.randint(0,classes,n) # pretend actual data
simYhat = np.random.randint(0,classes,n) # pretend pred data
truth = np.sum(simY == simYhat)/n
tabulate = pd.Series(simY).value_counts()
# create placeholders
lab = tf.placeholder(shape=simY.shape, dtype=tf.int32)
prd = tf.placeholder(shape=simY.shape, dtype=tf.int32)
AM_lab = tf.placeholder(shape=simY.shape,dtype=tf.int32)
AM_prd = tf.placeholder(shape=simY.shape,dtype=tf.int32)
# create one-hot encoding objects
simYOH = tf.one_hot(lab,classes)
# create accuracy objects
acc = tf.metrics.accuracy(lab,prd) # real accuracy with tf.metrics
accOHAM = tf.metrics.accuracy(AM_lab,AM_prd) # OHE argmaxed to labels - expected to be correct
# now setup to pretend we ran a model & generated OHE predictions all unclassed
z = np.zeros(shape=(n,classes),dtype=float)
testPred = tf.constant(z)
''' run it all '''
# setup
sess = tf.Session()
sess.run([tf.global_variables_initializer(),tf.local_variables_initializer()])
# real accuracy with tf.metrics
ACC = sess.run(acc,feed_dict = {lab:simY,prd:simYhat})
# OHE argmaxed to labels - expected to be correct, but is it?
l,p = sess.run([simYOH,testPred],feed_dict={lab:simY})
p = np.argmax(p,axis=-1)
ACCOHAM = sess.run(accOHAM,feed_dict={AM_lab:simY,AM_prd:p})
sess.close()
''' print stuff '''
print('Accuracy')
print('-known truth: %0.4f'%truth)
print('-on unprocessed data: %0.4f'%ACC[1])
print('-on faked unclassed labels data (s.b. 0%%): %0.4f'%ACCOHAM[1])
print('----------\nTrue Class Freqs:\n%r'%(tabulate.sort_index()/n))
which has the output:
Accuracy
-known truth: 0.1500
-on unprocessed data: 0.1500
-on faked unclassed labels data (s.b. 0%): 0.1100
----------
True Class Freqs:
0 0.11
1 0.19
2 0.11
3 0.25
4 0.17
5 0.17
dtype: float64
Note freq for class 0 is same as faked accuracy...
I experimented with setting a value of preds to np.nan for observations with no predictions, but tf.metrics.accuracy throws ValueError: cannot convert float NaN to integer; also tried np.inf but got OverflowError: cannot convert float infinity to integer.
How can I convert the rounded probabilities to class predictions, but appropriately handle unpredicted observations?
This has gone long enough without an answer, so I'll post here as the answer my solution. I convert belonging probabilities to class predictions with a new function that has 3 main steps:
set any NaN probabilities to 0
set any probabilities below 1/num_classes to 0
use np.argmax() to extract predicted classes, then set any unclassed observations to a uniformly selected class
The resultant vector of integer class labels can be passed to the tf.metrics functions. My function below:
def predFromProb(classProbs):
'''
Take in as input an (m x p) matrix of m observations' class probabilities in
p classes and return an m-length vector of integer class labels (0...p-1).
Probabilities at or below 1/p are set to 0, as are NaNs; any unclassed
observations are randomly assigned to a class.
'''
numClasses = classProbs.shape[1]
# zero out class probs that are at or below chance, or NaN
probs = classProbs.copy()
probs[np.isnan(probs)] = 0
probs = probs*(probs > 1/numClasses)
# find any un-classed observations
unpred = ~np.any(probs,axis=1)
# get the predicted classes
preds = np.argmax(probs,axis=1)
# randomly classify un-classed observations
rnds = np.random.randint(0,numClasses,np.sum(unpred))
preds[unpred] = rnds
return preds

Ensemble model in H2O with fold_column argument

I am new to H2O in python. I am trying to model my data using ensemble model following the example codes from H2O's web site. (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html)
I have applied GBM and RF as base models. And then using stacking, I tried to merge them in ensemble model. In addition, in my training data I created one additional column named 'fold' to be used in fold_column = "fold"
I applied 10 fold cv and I observed that I received results from cv1. However, all the predictions coming from other 9 cvs, they are empty. What am I missing here?
Here is my sample data:
code:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init(port=23, nthreads=6)
train = h2o.H2OFrame(ens_df)
test = h2o.H2OFrame(test_ens_eq)
x = train.drop(['Date','EQUITY','fold'],axis=1).columns
y = 'EQUITY'
cat_cols = ['A','B','C','D']
train[cat_cols] = train[cat_cols].asfactor()
test[cat_cols] = test[cat_cols].asfactor()
my_gbm = H2OGradientBoostingEstimator(distribution="gaussian",
ntrees=10,
max_depth=3,
min_rows=2,
learn_rate=0.2,
keep_cross_validation_predictions=True,
seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column = "fold")
Then when I check cv results with
my_gbm.cross_validation_predictions():
Plus when I try the ensemble in the test set I get the warning below:
# Train a stacked ensemble using the GBM and GLM above
ensemble = H2OStackedEnsembleEstimator(model_id="mlee_ensemble",
base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
# Eval ensemble performance on the test data
perf_stack_test = ensemble.model_performance(test)
pred = ensemble.predict(test)
pred
/mgmt/data/conda/envs/python3.6_4.4/lib/python3.6/site-packages/h2o/job.py:69: UserWarning: Test/Validation dataset is missing column 'fold': substituting in a column of NaN
warnings.warn(w)
Am I missing something about fold_column?
Here is an example of how to use a custom fold column (created from a list). This is a modified version of the example Python code in the Stacked Ensemble page in the H2O User Guide.
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# Add a fold column, generate from a list
# The list has 10 unique values, so there will be 10 folds
fold_list = list(range(10)) * 1000
train['fold_id'] = h2o.H2OFrame(fold_list)
# Train and cross-validate a GBM
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
ntrees=10,
keep_cross_validation_predictions=True,
seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column="fold_id")
# Train and cross-validate a RF
my_rf = H2ORandomForestEstimator(ntrees=50,
keep_cross_validation_predictions=True,
seed=1)
my_rf.train(x=x, y=y, training_frame=train, fold_column="fold_id")
# Train a stacked ensemble using the GBM and RF above
ensemble = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
To answer your second question about how to view the cross-validated predictions in a model. They are stored in two places, however, the method that you probably want to use is: .cross_validation_holdout_predictions() This method returns a single H2OFrame of the cross-validated predictions, in the original order of the training observations:
In [11]: my_gbm.cross_validation_holdout_predictions()
Out[11]:
predict p0 p1
--------- -------- --------
1 0.323155 0.676845
1 0.248131 0.751869
1 0.288241 0.711759
1 0.407768 0.592232
1 0.507294 0.492706
0 0.6417 0.3583
1 0.253329 0.746671
1 0.289916 0.710084
1 0.524328 0.475672
1 0.252006 0.747994
[10000 rows x 3 columns]
The second method, .cross_validation_predictions() is a list which stores the predictions from each fold in an H2OFrame that has the same number of rows as the original training frame, but the rows that are not active in that fold have a value of zero. This is not usually the format that people find most useful, so I'd recommend using the other method instead.
In [13]: type(my_gbm.cross_validation_predictions())
Out[13]: list
In [14]: len(my_gbm.cross_validation_predictions())
Out[14]: 10
In [15]: my_gbm.cross_validation_predictions()[0]
Out[15]:
predict p0 p1
--------- -------- --------
1 0.323155 0.676845
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
[10000 rows x 3 columns]

Multivariate Dirichlet process mixtures for density estimation using pymc3

I want to extend the Austin's example on Dirichlet process mixtures for density estimationto the multivariate case.
The first information about multivariate Gaussian mixture using pymc3 I have found is this issue at Github. People involved in the issue said that there are two different solutions but they don't work for me. For instance, by using the Brandon's Multivariate Extension in a simple model like this:
import numpy as np
import pymc3 as pm
from mvnormal_extension import MvNormal
with pm.Model() as model:
var_x = MvNormal('var_x', mu = 3*np.zeros(2), tau = np.diag(np.ones(2)), shape=2)
trace = pm.sample(100)
I can't obtain the proper Mean around (3,3):
pm.summary(trace)
var_x:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
0.220 1.161 0.116 [-1.897, 2.245]
0.165 1.024 0.102 [-2.626, 1.948]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
-1.897 -0.761 0.486 1.112 2.245
-2.295 -0.426 0.178 0.681 2.634
The other solution can be reproduced thanks to Benavente:
import numpy as np
import pymc3 as pm
import scipy
import theano
from theano import tensor
target_data = np.random.random((500, 16))
N_COMPONENTS = 5
N_SAMPLES, N_DIMS = target_data.shape
# Dirichilet prior.
ALPHA_0 = np.ones(N_COMPONENTS)
# Component means prior.
MU_0 = np.zeros(N_DIMS)
LAMB_0 = 1. * np.eye(N_DIMS)
# Components precision prior.
BETA_0, BETA_1 = 0., 1. # Covariance stds prior uniform limits.
L_0 = 2. # LKJ corr. shape. Larger shape -> more biased to identity.
# In order to convert the upper triangular correlation values to a
# complete correlation matrix, we need to construct an index matrix:
# Source: http://stackoverflow.com/q/29759789/1901296
N_ELEMS = N_DIMS * (N_DIMS - 1) / 2
tri_index = np.zeros([N_DIMS, N_DIMS], dtype=int)
tri_index[np.triu_indices(N_DIMS, k=1)] = np.arange(N_ELEMS)
tri_index[np.triu_indices(N_DIMS, k=1)[::-1]] = np.arange(N_ELEMS)
with pm.Model() as model:
# Component weight prior.
pi = pm.Dirichlet('pi', ALPHA_0, testval=np.ones(N_COMPONENTS) / N_COMPONENTS)
#pi_potential = pm.Potential('pi_potential', tensor.switch(tensor.min(pi) < .01, -np.inf, 0))
###################
# Components plate.
###################
# Component means.
mus = [pm.MvNormal('mu_{}'.format(i), MU_0, LAMB_0, shape=N_DIMS)
for i in range(N_COMPONENTS)]
# Component precisions.
#lamb = diag(sigma) * corr(corr_shape) * diag(sigma)
corr_vecs = [
pm.LKJCorr('corr_vec_{}'.format(i), L_0, N_DIMS)
for i in range(N_COMPONENTS)
]
# Transform the correlation vector representations to matrices.
corrs = [
tensor.fill_diagonal(corr_vecs[i][tri_index], 1.)
for i in range(N_COMPONENTS)
]
# Stds for the correlation matrices.
cov_stds = pm.Uniform('cov_stds', BETA_0, BETA_1, shape=(N_COMPONENTS, N_DIMS))
# Finally re-compose the covariance matrices using diag(sigma) * corr * diag(sigma)
# Source http://austinrochford.com/posts/2015-09-16-mvn-pymc3-lkj.html
lambs = []
for i in range(N_COMPONENTS):
std_diag = tensor.diag(cov_stds[i])
cov = std_diag.dot(corrs[i]).dot(std_diag)
lambs.append(tensor.nlinalg.matrix_inverse(cov))
stacked_mus = tensor.stack(mus)
stacked_lambs = tensor.stack(lambs)
#####################
# Observations plate.
#####################
z = pm.Categorical('z', pi, shape=N_SAMPLES)
#theano.as_op(itypes=[tensor.dmatrix, tensor.lvector, tensor.dmatrix, tensor.dtensor3],
otypes=[tensor.dscalar])
def likelihood_op(values, z_values, mu_values, prec_values):
logp = 0.
for i in range(N_COMPONENTS):
indices = z_values == i
if not indices.any():
continue
logp += scipy.stats.multivariate_normal(
mu_values[i], prec_values[i]).logpdf(values[indices]).sum()
return logp
def likelihood(values):
return likelihood_op(values, z, stacked_mus, stacked_lambs)
y = pm.DensityDist('y', likelihood, observed=target_data)
step1 = pm.Metropolis(vars=mus + lambs + [pi])
step2 = pm.ElemwiseCategoricalStep(vars=[z], values=list(range(N_COMPONENTS)))
trace = pm.sample(100, step=[step1, step2])
I have changed in this code pm.ElemwiseCategoricalStep to pm.ElemwiseCategorical and
logp += scipy.stats.multivariate_normal(mu_values[i], prec_values[i]).logpdf(values[indices])
by
logp += scipy.stats.multivariate_normal(mu_values[i], prec_values[i]).logpdf(values[indices]).sum()
but I get this exception:
ValueError: expected an ndarray
Apply node that caused the error: Elemwise{Composite{((i0 + i1) - (i2 + i3))}}[(0, 0)](Sum{acc_dtype=float64}.0, FromFunctionOp{likelihood_op}.0, Sum{acc_dtype=float64}.0, FromFunctionOp{likelihood_op}.0)
Toposort index: 101
Inputs types: [TensorType(float64, scalar), TensorType(float64, scalar), TensorType(float64, scalar), TensorType(float64, scalar)]
Inputs shapes: [(), (), (), ()]
Inputs strides: [(), (), (), ()]
Inputs values: [array(-127.70516572917249), -13460.012199423296, array(-110.90354888959129), -13234.61313535326]
Outputs clients: [['output']]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
I appreciate any help.
Thanks!

Resources