Getting results of H2O neural nets from: h2o.grid.grid_search.H2OGridSearch - python-3.x

I have been training a neural net with hyperparameters but am unable get results out as I am getting the following error message.
nn
Error message: 'int' object is not iterable
Code:
nn = H2OGridSearch(model=H2ODeepLearningEstimator,
hyper_params = {
'activation' :[ "Rectifier","Tanh","Maxout","RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"],
'hidden':[[20,20],[50,50],[30,30,30],[25,25,25,25]], ## small network, runs faster
'epochs':1000000, ## hopefully converges earlier...
'rate' :[0.0005,0.001,0.0015,0.002,0.0025,0.003,0.0035,0.0040,0.0045,0.005],
'score_validation_samples':10000, ## sample the validation dataset (faster)
'stopping_rounds':2,
'stopping_metric':"misclassification", ## alternatives: "MSE","logloss","r2"
'stopping_tolerance':0.01})
nn.train(train1_x, train1_y,train1)

There is a slight problem with how you are defining the grid. You can only pass a dictionary of lists (of values to grid over for each hyperparamter) in the hyper_params argument. The reason you are seeing the Error message: 'int' object is not iterable error message is because you are trying to pass an integer instead of a list for both score_validation_samples and stopping_rounds.
If there are arguments that you don't intend to grid over, then they should be passed instead to the grid's train() method. I'd also recommend using a validation frame or cross-validation when doing grid search so you don't have to use training metrics to choose the best model. See example below.
import h2o
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.grid.grid_search import H2OGridSearch
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# Execute a grid search (also do 5-fold CV)
grid = H2OGridSearch(model=H2ODeepLearningEstimator, hyper_params = {
'activation' :["Rectifier","Tanh","Maxout","RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"],
'hidden':[[20,20],[50,50],[30,30,30],[25,25,25,25]]})
grid.train(x=x, y=y, training_frame=train, \
score_validation_samples=10000, \
stopping_rounds=2, \
stopping_metric="misclassification", \
stopping_tolerance=0.01, \
nfolds=5)
# Look at grid results
gridperf = grid.get_grid(sort_by='mean_per_class_error')
There are more examples of how to use grid search in the H2O Python Grid Search tutorial.

Related

How to get feature importances/feature ranking from summary plot in SHAP without crashing?

I am attempting to get shap values out of an array which was created by
explainer = shap.Explainer(xg_clf, X_train)
shap_values2 = explainer(X_train)
using my XGBoost data, to make a dataframe of feature_names and their SHAP importance, as they would appear in a SHAP bar or summary plot.
Following advice from how to extract the most important feature names? and How to get feature names of shap_values from TreeExplainer? specifically the comment by user Thoo, which shows how the values can be extracted to make a dataframe:
vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()
shap_values has 11595 persons with 595 features each, which I understand is large, but, creating the vals variable runs very slowly, about 58 minutes on my laptop. It uses almost all RAM on the computer.
After 58 minutes I get an error:
Command terminated by signal 9
which as far as I understand, means that the computer ran out of RAM.
I've tried converting the 2nd line in Thoo's code to
feature_importance = pd.DataFrame(list(zip(X_train.columns,np.abs(shap_values2).mean(0))),columns=['col_name','feature_importance_vals'])
so that vals isn't stored but this change doesn't reduce RAM at all.
I've also tried a different comment from the same GitHub issue (user "ba1mn"):
def global_shap_importance(model, X):
""" Return a dataframe containing the features sorted by Shap importance
Parameters
----------
model : The tree-based model
X : pd.Dataframe
training set/test set/the whole dataset ... (without the label)
Returns
-------
pd.Dataframe
A dataframe containing the features sorted by Shap importance
"""
explainer = shap.Explainer(model)
shap_values = explainer(X)
cohorts = {"": shap_values}
cohort_labels = list(cohorts.keys())
cohort_exps = list(cohorts.values())
for i in range(len(cohort_exps)):
if len(cohort_exps[i].shape) == 2:
cohort_exps[i] = cohort_exps[i].abs.mean(0)
features = cohort_exps[0].data
feature_names = cohort_exps[0].feature_names
values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
feature_importance = pd.DataFrame(
list(zip(feature_names, sum(values))), columns=['features', 'importance'])
feature_importance.sort_values(
by=['importance'], ascending=False, inplace=True)
return feature_importance
but global_shap_importance returns the feature importances in the wrong order, and I don't see how I can alter global_shap_importance so that the features are returned in the same order as summary_plot (beeswarm plot).
How can I get the feature importance ranking into a dataframe?
I pulled this straight from the source code. Confirmed identical to the summary_plot.
def shapley_feature_ranking(shap_values, X):
feature_order = np.argsort(np.mean(np.abs(shap_values), axis=0))
return pd.DataFrame(
{
"features": [X.columns[i] for i in feature_order][::-1],
"importance": [
np.mean(np.abs(shap_values), axis=0)[i] for i in feature_order
][::-1],
}
)
shapley_feature_ranking(shap_values[0], X)

GAN Model Summary Pytorch using TensorBoard?

Is there a way can I visualize the complete training loop for the GAN architecture in TensorBoard using Pytorch? I think it's possible using TF, but I am having a hard time to figure out one using Pytorch.
You can use TensorboardX for this.
You can make use of SummaryWriter from TensorboardX to create an event file in a given directory and add summaries and events to it.
The code below is an example that you can use but you have to add in the loss values, the ground truth images and the generated images yourself. I commented where they would have to go.
from tensorboardX import SummaryWriter
import torchvision.utils as vutils
import numpy as np
REPORT_EVERY_ITER = 100
SAVE_IMAGE_EVERY_ITER = 1000
if __name__ == "__main__":
writer = SummaryWriter()
gen_losses = []
dis_losses = []
iter_no = 0
// looping over the batches in the environment
for batch_v in iterate_batches(envs):
// getting the outputs
// getting the generators loss
// getting the discriminators loss
iter_no += 1
// save the loss values for both generators and the discriminator every 100 steps
if iter_no % REPORT_EVERY_ITER == 0:
log.info(
"Iter %d: gen_loss=%.3e, dis_loss=%.3e",
iter_no,
np.mean(gen_losses),
np.mean(dis_losses),
)
writer.add_scalar("gen_loss", np.mean(gen_losses), iter_no)
writer.add_scalar("dis_loss", np.mean(dis_losses), iter_no)
gen_losses = []
dis_losses = []
// save the images being produced from both the ground truth and the generator
// it is saved every 1000 iterations
if iter_no % SAVE_IMAGE_EVERY_ITER == 0:
// save the generated images from the generator
writer.add_image(
"fake",
vutils.make_grid(gen_output_v.data[:64], normalize=True),
iter_no
)
// add the ground truth images here
// these will be the same throughout the cycle
writer.add_image(
"real",
vutils.make_grid(batch_v.data[:64], normalize=True),
iter_no
)
To view the results just run the command: tensorboard --logdir runs in the same directory where you ran the model training(runs contains the results from the training). A link will be shown which you can go to view the plots such as the one below. If you want to run Tensorboard on a remote server then you would have to add in the command --bind_all in the command line to access it from the outside.
Viewing the generated images
Viewing the loss values

Plot clusters from LDA Gensim with Bokeh

I apologise in advance as I cannot reproduce the dataset I'm working with. So I am just going to describe steps and hope someone is familiar with the whole process.
I'm trying to use LDA Gensim to extract topics from a list of text documents.
from gensim.models import LdaModel
from gensim.corpora import Dictionary
I build dictionary and corpus:
dictionary = Dictionary(final_docs)
corpus = [dictionary.doc2bow(doc) for doc in final_docs]
where final_docs is a list of lists with cleaned tokens for each text like this:
final_docs = [['cat','dog','animal'],['school','university','education'],...['music','dj','pop']]
then I initiate the model like this:
# Set training parameters:
num_topics = 60
chunksize = 100
passes = 20
iterations = 400
eval_every = None
# Make an index to word dictionary
temp = dictionary[0] # This is only to "load" the dictionary.
id2word = dictionary.id2token
model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
alpha='auto', eta='auto', \
iterations=iterations, num_topics=num_topics, \
passes=passes, eval_every=eval_every)
I can print topics and terms (10 most important). And they make sense. So it seems working fine.
for idx in range(n_topics):
print("Topic #%s:" % idx, model.print_topic(idx, 10))
BUT I struggle to plot all the documents as clusters using Bokeh. (And I really need Bokeh because I compare the same plot from different models). I know I have to reduce dimensionality to 2. And I try to do it using CountVectorizer and then T-sne:
from sklearn.feature_extraction.text import CountVectorizer
docs_vect = [' '.join(txt) for txt in final_docs]
cvectorizer = CountVectorizer(min_df=6, max_df=0.50, max_features=10000, stop_words=stop)
cvz = cvectorizer.fit_transform(docs_vect)
X_lda = model.fit_transform(cvz)
But I get this error: AttributeError: 'LdaModel' object has no attribute 'fit_transform'
I'm definitely doing something wrong with CountVectorizer. Could anyone help me out?

How do I use Theanets LSTM RNN's on my time series data?

I have a simple dataframe consisting of one column. In that column are 10320 observations (numerical). I'm simulating Time-Series data by inserting the data into a plot with a window of 200 observations each. Here is the code for plotting.
import matplotlib.pyplot as plt
from IPython import display
fig_size = plt.rcParams["figure.figsize"]
import time
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
fig, axes = plt.subplots(1,1, figsize=(19,5))
df = dframe.set_index(arange(0,len(dframe)))
std = dframe[0].std() * 6
window = 200
iterations = int(len(dframe)/window)
i = 0
dframe = dframe.set_index(arange(0,len(dframe)))
while i< iterations:
frm = window*i
if i == iterations:
to = len(dframe)
else:
to = frm+window
df = dframe[frm : to]
if len(df) > 100:
df = df.set_index(arange(0,len(df)))
plt.gca().cla()
plt.plot(df.index, df[0])
plt.axhline(y=std, xmin=0, xmax=len(df[0]),c='gray',linestyle='--',lw = 2, hold=None)
plt.axhline(y=-std , xmin=0, xmax=len(df[0]),c='gray',linestyle='--', lw = 2, hold=None)
plt.ylim(min(dframe[0])- 0.5 , max(dframe[0]) )
plt.xlim(-50,window+50)
display.clear_output(wait=True)
display.display(plt.gcf())
canvas = FigureCanvas(fig)
canvas.print_figure('fig.png', dpi=72, bbox_inches='tight')
i += 1
plt.close()
This simulates a flow of real-time data and visualizes it. What I want is to apply theanets RNN LSTM to the data to detect anomalies unsupervised. Because I am doing it unsupervised I don't think that I need to split my data into training and test sets. I haven't found much of anything that makes sense to me so far and have been googling for about 2 hours. Just hoping that you guys may be able to help. I want to put the prediction output of the RNN on the graph as well and define a threshold that, if the error is too large, the values will be identified as anomalous. If you need more information please comment and let me know. Thank you!
READING
Like neurons, LSTM networks are build of interconnected LSTM Blocks whose training is done via BackPropogation Through Time.
Classical anomaly detection using time series required prediction of time series output in future (at one or more points) and finding error on these points with true values. Prediction Error above a threshold will reflect and amomly
SOLUTION
Having said this
You've to train network so you need training sets and test sets both
Use N inputs to predict M outputs (decide upon N and M with experimentation - values for which training error is low)
Scroll a window of (N+M) elements in input data and use this data array of (N+M) items also termed as frame to train or test network.
Typically we use 90% of starting series for training and 10% for testing.
This scheme will fail as if training is not proper there will be false prediction errors which are not-anomaly. So make sure to provide enough training, and most important shuffle training frames and consider all variations.

How to find key trees/features from a trained random forest?

I am using Scikit-Learn Random Forest Classifier and trying to extract the meaningful trees/features in order to better understand the prediction results.
I found this method which seems relevant in the documention (http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.get_params), but couldn't find an example how to use it.
I am also hoping to visualize those trees if possible, any relevant code would be great.
Thank you!
I think you're looking for Forest.feature_importances_. This allows you to see what the relative importance of each input feature is to your final model. Here's a simple example.
import random
import numpy as np
from sklearn.ensemble import RandomForestClassifier
#Lets set up a training dataset. We'll make 100 entries, each with 19 features and
#each row classified as either 0 and 1. We'll control the first 3 features to artificially
#set the first 3 features of rows classified as "1" to a set value, so that we know these are the "important" features. If we do it right, the model should point out these three as important.
#The rest of the features will just be noise.
train_data = [] ##must be all floats.
for x in range(100):
line = []
if random.random()>0.5:
line.append(1.0)
#Let's add 3 features that we know indicate a row classified as "1".
line.append(.77)
line.append(.33)
line.append(.55)
for x in range(16):#fill in the rest with noise
line.append(random.random())
else:
#this is a "0" row, so fill it with noise.
line.append(0.0)
for x in range(19):
line.append(random.random())
train_data.append(line)
train_data = np.array(train_data)
# Create the random forest object which will include all the parameters
# for the fit. Make sure to set compute_importances=True
Forest = RandomForestClassifier(n_estimators = 100, compute_importances=True)
# Fit the training data to the training output and create the decision
# trees. This tells the model that the first column in our data is the classification,
# and the rest of the columns are the features.
Forest = Forest.fit(train_data[0::,1::],train_data[0::,0])
#now you can see the importance of each feature in Forest.feature_importances_
# these values will all add up to one. Let's call the "important" ones the ones that are above average.
important_features = []
for x,i in enumerate(Forest.feature_importances_):
if i>np.average(Forest.feature_importances_):
important_features.append(str(x))
print 'Most important features:',', '.join(important_features)
#we see that the model correctly detected that the first three features are the most important, just as we expected!
To get the relative feature importances, read the relevant section of the documentation along with the code of the linked examples in that same section.
The trees themselves are stored in the estimators_ attribute of the random forest instance (only after the call to the fit method). Now to extract a "key tree" one would first require you to define what it is and what you are expecting to do with it.
You could rank the individual trees by computing there score on held out test set but I don't know what expect to get out of that.
Do you want to prune the forest to make it faster to predict by reducing the number of trees without decreasing the aggregate forest accuracy?
Here is how I visualize the tree:
First make the model after you have done all of the preprocessing, splitting, etc:
# max number of trees = 100
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
Make predictions:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
Then make the plot of importances. The variable dataset is the name of the original dataframe.
# get importances from RF
importances = classifier.feature_importances_
# then sort them descending
indices = np.argsort(importances)
# get the features from the original data set
features = dataset.columns[0:26]
# plot them with a horizontal bar chart
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
This yields a plot as below:

Resources