TensorFlow - Simple Feed Forward NN not training - python-3.x

I'm new to TensorFlow and just constructed my first very small network!! My code runs, but it has the same accuracy all along; it doesn't change with training. My data has 15 features and 6 classes. Maybe I'll add more features if that makes it easier and better. In short, my question is:
What's a general procedure for debugging TensorFlow code?
My network architecture was determined arbitrarily, so maybe I should change the number of neurons per layer, not entirely sure.
sess1 = tf.Session()
num_predictors = len(training_predictors_tf.columns)
num_classes = len(training_classes_tf.columns)
feature_data = tf.placeholder(tf.float32, [None, num_predictors])
actual_classes = tf.placeholder(tf.float32, [None, num_classes])
weights1 = tf.Variable(tf.truncated_normal([num_predictors, 50], stddev=0.0001))
biases1 = tf.Variable(tf.ones([50]))
weights2 = tf.Variable(tf.truncated_normal([50, 45], stddev=0.0001))
biases2 = tf.Variable(tf.ones([45]))
weights3 = tf.Variable(tf.truncated_normal([45, 25], stddev=0.0001))
biases3 = tf.Variable(tf.ones([25]))
weights4 = tf.Variable(tf.truncated_normal([25, num_classes], stddev=0.0001))
biases4 = tf.Variable(tf.ones([num_classes]))
hidden_layer_1 = tf.nn.relu(tf.matmul(feature_data, weights1) + biases1)
hidden_layer_2 = tf.nn.relu(tf.matmul(hidden_layer_1, weights2) + biases2)
hidden_layer_3 = tf.nn.relu(tf.matmul(hidden_layer_2, weights3) + biases3)
out = tf.matmul(hidden_layer_3, weights4) + biases4
model = tf.nn.softmax_cross_entropy_with_logits(labels=actual_classes, logits=out)
# cost = -tf.reduce_sum(actual_classes*tf.log(model))
cross_entropy = tf.reduce_mean( model)
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)
# train_step = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(cross_entropy)
sess1.run(tf.global_variables_initializer())
correct_prediction = tf.equal(tf.argmax(out, 1), tf.argmax(actual_classes, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
for i in range(1, 30001):
sess1.run(
train_step,
feed_dict={
feature_data: training_predictors_tf.values,
actual_classes: training_classes_tf.values.reshape(len(training_classes_tf.values), num_classes)
}
)
if i%5000 == 0:
print(i, sess1.run(
accuracy,
feed_dict={
feature_data: training_predictors_tf.values,
actual_classes: training_classes_tf.values.reshape(len(training_classes_tf.values), num_classes)
}
))
And this is my output:
5000 0.3627
10000 0.3627
15000 0.3627
20000 0.3627
25000 0.3627
30000 0.3627
EDIT: I scaled my data as explained here, in the range of [-5;0] but it still does not train the network any better :(
Snippet of unscaled data (one-hot encoding first 6 columns):
2017-06-27 0 0 0 1 0 0 20120.0 20080.0 20070.0 20090.0 ... 20050.0 20160.0 20130.0 20160.0 20040.0 20040.0 20040.0 31753.0 36927.0 41516.0
2017-06-28 0 0 1 0 0 0 20150.0 20120.0 20080.0 20150.0 ... 20060.0 20220.0 20160.0 20130.0 20130.0 20040.0 20040.0 39635.0 31753.0 36927.0
2017-06-29 0 0 0 1 0 0 20140.0 20150.0 20120.0 20140.0 ... 20090.0 20220.0 20220.0 20160.0 20100.0 20130.0 20040.0 50438.0 39635.0 31753.0
2017-06-30 0 1 0 0 0 0 20210.0 20140.0 20150.0 20130.0 ... 20150.0 20270.0 20220.0 20220.0 20050.0 20100.0 20130.0 58983.0 50438.0 39635.0
2017-07-03 0 0 0 1 0 0 20020.0 20210.0 20140.0 20210.0 ... 20140.0 20250.0 20270.0 20220.0 19850.0 20050.0 20100.0 88140.0 58983.0 50438.0

Debugging your network and improving it are two different things. To improve it, once you've chosen a type of classifier (for instance a neural network), you should use the training and validation accuracies, and adjust your hyperparameters in functions of both of these. See the chapter Practical methodology of Goodfellow and alii's book for instance to know how to tune the hyperparameters (it's a bit long, but pure gold !).
As to debugging, this is the harder part. You usually do that by printing some "key tensor" values every once in a while. You clearly have a bug somewhere, or your accuracy would change at least a bit during training. A common problem causing that is exploding gradient, that causes Nans to appear very early in training (sometimes infinity or even 0s at strange places), which then basically stops any update in your network. I'd suggest printing your loss, and maybe the norm of your gradients, that should tell you if this is the problem. If so, the quick-and-dirty solution is to use a smaller LR at start and then augment it. The real solution is to use gradient clipping.
Example of how to print multiple tensor values:
if i%5000 == 0:
acc_val, loss_val, predictions = sess1.run(
[accuracy, cross_entropy, tf.argmax(out, 1)],
feed_dict={
feature_data: training_predictors_tf.values,
actual_classes: training_classes_tf.values.reshape(len(training_classes_tf.values), num_classes)
})
print(i, acc_val, loss_val) # You can also print predictions, but it will be very big so it'll be harder to see the others. Doing it would allow you to check if, for instance, the model always predicts class 0...

Related

Split Train dataset based on labels

I would like to know how to go about splitting my multi-labeled class training data-set to a specific ratio like 80% (class_2), 15% (class_1) and 5% (class_0).
I have a balanced data-set. I originally split the pandas data-set: 80% train and 20% test via the command:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
I wanted to however further specify the ratio for the testing pandas data-set as: 80% (class_2), 15% (class_1) and 5% (class_0). How can this be accomplished?
Here is a snippet of my dataset:
Feat1 Feat2 Feat3 Feat4 Feat5 Label
-58.422504 37.966175 -4.8636584 1.6725544 1.9571232 0
-16.001776 12.794211 -1.1406443 1.3552929 -3.1035073 1
-35.907864 19.15079 -1.4540794 4.7229285 -1.3495653 0
-40.63919 11.879825 0.26731083 4.509876 -0.3005377 1
-82.577805 38.87009 -0.6941721 0.41522327 -3.7065275 0
-91.21994 13.109437 -7.270507 2.081625 -4.206697 0
-47.69479 17.02262 -24.102415 -0.9498974 -6.126767 2
-76.956795 17.869856 -1.6058419 4.2835464 -1.3354894 0
-52.443146 46.593403 -3.4466643 1.1810641 -1.9001787 2
-67.86523 14.28042 0.71933913 2.1071763 1.3627108 1
-47.336437 9.525495 -20.755278 6.523259 -3.422134 2
-42.978676 12.458537 0.07322929 1.3635784 0.09735282 1
-24.21139 38.562397 0.042716235 6.6496754 -1.9689865 2
-48.612396 11.766575 -0.748889 3.8106124 2.109056 1
-49.890644 14.508443 0.36204648 1.7602062 -0.42747113 1
-58.165733 18.751013 -3.8809242 5.257564 -1.4671975 0
-31.926224 8.061624 -0.9180617 3.1844578 1.3856677 1
-49.51432 13.603332 1.1162373 0.88059276 0.8680044 1
-38.187065 22.042477 -9.74126 3.464233 -1.4608487 2
-36.763634 11.885029 -0.3559528 1.2861489 -0.006563603 1
-59.474194 17.596613 -13.849893 2.5668569 -7.367901 2
-20.775812 8.021951 -5.8948507 -1.76145 -3.0236924 1
-44.744774 42.550343 -2.8213162 1.496162 -5.367485 2
-59.297913 15.10593 -15.805616 -0.8902338 -2.0228894 2
-43.05664 17.326857 -21.520315 -0.544733 -5.821276 2
-113.831566 10.970723 -1.0806333 2.6965592 -0.50331205 0
-67.71741 37.033604 -7.5146904 4.7712235 -0.88289934 0
-51.200836 20.278473 -9.158655 4.746186 -5.2653203 2
-43.760933 13.239898 -5.1588607 2.5003295 -2.2052805 0
-53.52218 12.309539 -0.24887963 4.237159 0.52248794 0
How to go about correctly splitting the train data-set based on the Label names off specific ratios?
Thanks for your help and time!
Sampling might be better for your purpose:
import numpy as np
class_0, class_1, class_2 = np.split(df.sample(frac=1, random_state=42),
[int(.05*len(df)), int(.20*len(df))])

Linear regression issue with categorical variables

I've built a linear regression model predicting recidivism among convicts based on the COMPAS dataset.
I've some issues regarding the categorical variables, specifically the gender variable.
This was transformed to dummy variables and dropping one of the two binary variables to prevent collinearity.
However it seems after training the model the female gender gets a higher recidivism score than the male gender.
It looks like this is not correct, since the male offenders have a higher score on the independent variables than females.
Also the target variable (recidivism score) is lower in the female category than the male.
I would expect that females would have a lower predicted score.
I get the feeling that there's something wrong with the model.
Can someone please help me out?
See below the dataset and code:
subset of the data after dummy transformation and data cleansing:
age;priors_count;juv_fel_count;sex_Male;race_Caucasian;race_Asian;race_Hispanic;race_NativeAmerican;race_Other;v_decile_score;event;is_recid;decile_score
69;0;0;1;0;0;0;0;1;1;0;0;1
69;0;0;1;0;0;0;0;1;1;0;0;1
34;0;0;1;0;0;0;0;0;1;1;1;3
24;4;0;1;0;0;0;0;0;3;0;1;4
24;4;0;1;0;0;0;0;0;3;0;1;4
24;4;0;1;0;0;0;0;0;3;0;1;4
24;4;0;1;0;0;0;0;0;3;0;1;4
24;4;0;1;0;0;0;0;0;3;0;1;4
41;14;0;1;1;0;0;0;0;2;0;1;6
41;14;0;1;1;0;0;0;0;2;0;1;6
43;3;0;1;0;0;0;0;1;3;0;0;4
43;3;0;1;0;0;0;0;1;3;0;0;4
#model
X = df[[
'age'
,'priors_count'
,'juv_fel_count'
,'sex_Male'
,'race_Caucasian'
,'race_Asian'
,'race_Hispanic'
,'race_Native American'
,'race_Other'
,'v_decile_score'
,'event'
,'is_recid'
]]
Y = df['decile_score']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
# prediction with sklearn
New_Age = 18
New_Priors_Count = 0
New_Juvenile_Count = 0
New_Sex_Male = 0
#Race Variables
# 0 in all the below means TRUE for Afro-American Race
New_Race_Caucasian = 0
New_Race_Asian = 0
New_Race_Hispanic = 0
New_Race_Native_American = 0
New_Race_Other = 0
#Violence & Events
New_Violent_Score = 0
New_Event_In_Custody = 0
New_Is_Recid = 0
print ('Recividism Score: \n',
regr.predict(
[[
New_Age
, New_Priors_Count
, New_Juvenile_Count
, New_Sex_Male
, New_Race_Caucasian
, New_Race_Asian
, New_Race_Hispanic
, New_Race_Native_American
, New_Race_Other
, New_Violent_Score
, New_Event_In_Custody
, New_Is_Recid
# , New_Days_In_Jail
]]
))
# with statsmodels
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)

EDITED Learning data not correctly

I'm studying deep-learning.
I'm making figure classifier: circle, rectangle, triangle, pentagon, star. And one-hot-encoded into label2idx = dict(rectangle=0, circle=1, pentagon=2, star=3, triangle=4)
But every learning rates per epoch are same and it do not learn about the image.
I made a Layer with using Relu function for activation function, Affine for each layer, Softmax for the last layer, and using Adam to optimizing the gradients.
I have totally 234 RGB images to learn, which has created on window paint 2D tool and it is 128 * 128 size but not using the whole canvas to draw the figure.
And the picture looks like:
The train result. left [] is predict, and the right [] is answer lable(I picked random images to print predict value and answer lable).:
epoch: 0.49572649572649574
[ 0.3149641 -0.01454905 -0.23183 -0.2493432 0.11655246] [0 0 0 0 1]
epoch: 0.6837606837606838
[ 1.67341673 0.27887525 -1.09800398 -1.12649948 -0.39533065] [1 0 0 0 0]
epoch: 0.7094017094017094
[ 0.93106499 1.49599772 -0.98549052 -1.20471573 -0.24997779] [0 1 0 0 0]
epoch: 0.7905982905982906
[ 0.48447043 -0.05460748 -0.23526179 -0.22869489 0.05468969] [1 0 0 0 0]
...
epoch: 0.9230769230769231
[14.13835867 0.32432293 -5.01623202 -6.62469261 -3.21594355] [1 0 0 0 0]
epoch: 0.9529914529914529
[ 1.61248239 -0.47768294 -0.41580036 -0.71899219 -0.0901478 ] [1 0 0 0 0]
epoch: 0.9572649572649573
[ 5.93142154 -1.16719891 -1.3656573 -2.19785097 -1.31258801] [1 0 0 0 0]
epoch: 0.9700854700854701
[ 7.42198941 -0.85870225 -2.12027192 -2.81081263 -1.83810873] [1 0 0 0 0]
I think the more it learn, prediction should like [ 0.00143 0.09357 0.352 0.3 0.253 ] [ 1 0 0 0 0 ], which means answer index should be close to 0, but it does not.
Even the train accuracy sometimes goes to 1.0 ( 100% ).
I'm loading and normalizing the images with below codes.
#data_list = data_list = glob('dataset\\training\\*\\*.jpg')
dataset['train_img'] = _load_img()
def _load_img():
data = [np.array(Image.open(v)) for v in data_list]
a = np.array(data)
a = a.reshape(-1, img_size * 3)
return a
#normalize
for v in dataset:
dataset['train_img'] = dataset['train_img'].astype(np.float32)
dataset['train_img'] /= dataset['train_img'].max()
dataset['train_img'] -= dataset['train_img'].mean(axis=1).reshape(len(dataset['train_img']), 1)
EDIT
I let the images to gray scale with Image.open(v).convert('LA')
and checking my prediction value, and it's example:
[-3.98576886e-04 3.41216374e-05] [1 0]
[ 0.00698861 -0.01111879] [1 0]
[-0.42003415 0.42222863] [0 1]
still not learning about the images. I removed 3 figures to test it, so I just have rectangle, and triangle total 252 images ( I drew more imges. )
And the prediction value is usually like opposite value( 3.1323, -3.1323 or 3.1323, -3.1303 ), I cannot figure out the reason.
Not just increasing numerical accuracy, when I use SGD for optimizer, the accuracy do not increase. Just same accuracy.
[ 0.02090227 -0.02085848] [1 0]
epoch: 0.5873015873015873
[ 0.03058879 -0.03086193] [0 1]
epoch: 0.5873015873015873
[ 0.04006064 -0.04004988] [1 0]
[ 0.04545139 -0.04547538] [1 0]
epoch: 0.5873015873015873
[ 0.05605123 -0.05595288] [0 1]
epoch: 0.5873015873015873
[ 0.06495255 -0.06500597] [1 0]
epoch: 0.5873015873015873
Yes. Your model is performing pretty well. The problem is not related to normalization(not even a problem). The model actually predicted outside of 0,1 which means the model is really confident.
The model will not try to optimize towards [1,0,0,0] because when it calculates the loss, it will firstly clip the values.
Hope this helps!

How to change prediction in H2O GBM and DRF

I am building a classification model in h2o DRF and GBM. I want to change probability of prediction such that if p0 <0.2 then predict= 0 else predict=1
Currently, you need to do this manually. It would be easier if we had a threshold argument for the predict() method, so I created a JIRA ticket ticket to make this a bit more straight-forward.
See a Python example below of how to do this manually below.
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
my_gbm.train(x=x, y=y, training_frame=train)
# Predict on a test set using default threshold
pred = my_gbm.predict(test_data=test)
Look at the pred frame:
In [16]: pred.tail()
Out[16]:
predict p0 p1
--------- -------- --------
1 0.484712 0.515288
0 0.693893 0.306107
1 0.319674 0.680326
0 0.582344 0.417656
1 0.471658 0.528342
1 0.079922 0.920078
1 0.150146 0.849854
0 0.835288 0.164712
0 0.639877 0.360123
1 0.54377 0.45623
[10 rows x 3 columns]
Here's how to manually create the predictions you want. More info on how to slice H2OFrames is available in the H2O User Guide.
# Binary column which is 1 if >=0.2 and 0 if <0.2
newpred = pred["p1"] >= 0.2
newpred.tail()
Look at the binary column:
In [23]: newpred.tail()
Out[23]:
p1
----
1
1
1
1
1
1
1
0
1
1
[10 rows x 1 column]
Now you have the predictions you want. You could also replace the "predict" column with the new predicted labels.
pred["predict"] = newpred
Now re-examine the pred frame:
In [24]: pred.tail()
Out[24]:
predict p0 p1
--------- -------- --------
1 0.484712 0.515288
1 0.693893 0.306107
1 0.319674 0.680326
1 0.582344 0.417656
1 0.471658 0.528342
1 0.079922 0.920078
1 0.150146 0.849854
0 0.835288 0.164712
1 0.639877 0.360123
1 0.54377 0.45623
[10 rows x 3 columns]

Ensemble model in H2O with fold_column argument

I am new to H2O in python. I am trying to model my data using ensemble model following the example codes from H2O's web site. (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html)
I have applied GBM and RF as base models. And then using stacking, I tried to merge them in ensemble model. In addition, in my training data I created one additional column named 'fold' to be used in fold_column = "fold"
I applied 10 fold cv and I observed that I received results from cv1. However, all the predictions coming from other 9 cvs, they are empty. What am I missing here?
Here is my sample data:
code:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init(port=23, nthreads=6)
train = h2o.H2OFrame(ens_df)
test = h2o.H2OFrame(test_ens_eq)
x = train.drop(['Date','EQUITY','fold'],axis=1).columns
y = 'EQUITY'
cat_cols = ['A','B','C','D']
train[cat_cols] = train[cat_cols].asfactor()
test[cat_cols] = test[cat_cols].asfactor()
my_gbm = H2OGradientBoostingEstimator(distribution="gaussian",
ntrees=10,
max_depth=3,
min_rows=2,
learn_rate=0.2,
keep_cross_validation_predictions=True,
seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column = "fold")
Then when I check cv results with
my_gbm.cross_validation_predictions():
Plus when I try the ensemble in the test set I get the warning below:
# Train a stacked ensemble using the GBM and GLM above
ensemble = H2OStackedEnsembleEstimator(model_id="mlee_ensemble",
base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
# Eval ensemble performance on the test data
perf_stack_test = ensemble.model_performance(test)
pred = ensemble.predict(test)
pred
/mgmt/data/conda/envs/python3.6_4.4/lib/python3.6/site-packages/h2o/job.py:69: UserWarning: Test/Validation dataset is missing column 'fold': substituting in a column of NaN
warnings.warn(w)
Am I missing something about fold_column?
Here is an example of how to use a custom fold column (created from a list). This is a modified version of the example Python code in the Stacked Ensemble page in the H2O User Guide.
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# Add a fold column, generate from a list
# The list has 10 unique values, so there will be 10 folds
fold_list = list(range(10)) * 1000
train['fold_id'] = h2o.H2OFrame(fold_list)
# Train and cross-validate a GBM
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
ntrees=10,
keep_cross_validation_predictions=True,
seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column="fold_id")
# Train and cross-validate a RF
my_rf = H2ORandomForestEstimator(ntrees=50,
keep_cross_validation_predictions=True,
seed=1)
my_rf.train(x=x, y=y, training_frame=train, fold_column="fold_id")
# Train a stacked ensemble using the GBM and RF above
ensemble = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
To answer your second question about how to view the cross-validated predictions in a model. They are stored in two places, however, the method that you probably want to use is: .cross_validation_holdout_predictions() This method returns a single H2OFrame of the cross-validated predictions, in the original order of the training observations:
In [11]: my_gbm.cross_validation_holdout_predictions()
Out[11]:
predict p0 p1
--------- -------- --------
1 0.323155 0.676845
1 0.248131 0.751869
1 0.288241 0.711759
1 0.407768 0.592232
1 0.507294 0.492706
0 0.6417 0.3583
1 0.253329 0.746671
1 0.289916 0.710084
1 0.524328 0.475672
1 0.252006 0.747994
[10000 rows x 3 columns]
The second method, .cross_validation_predictions() is a list which stores the predictions from each fold in an H2OFrame that has the same number of rows as the original training frame, but the rows that are not active in that fold have a value of zero. This is not usually the format that people find most useful, so I'd recommend using the other method instead.
In [13]: type(my_gbm.cross_validation_predictions())
Out[13]: list
In [14]: len(my_gbm.cross_validation_predictions())
Out[14]: 10
In [15]: my_gbm.cross_validation_predictions()[0]
Out[15]:
predict p0 p1
--------- -------- --------
1 0.323155 0.676845
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
[10000 rows x 3 columns]

Resources