Strange\unexpected behavior with class_weight and LightGBM - python-3.x

I have a LGBM model which does not use the 'class_weight' parameter.
When I run the model I currently achieve a score of 0.81659.
When I apply: class_weight='balanced'
the score drops substantially to 0.78134. A loss of 0.03525.
When I manually compute the weights and apply it with: class_weight={0: 0.61378, 2: 0.86751, 1: 4.58652}
the score drops to 0.78129.
I attribute the minor difference between my manual calculation and 'balanced'
to be rounding error, since I arbitrarily truncated the weights to 5 decimal places.
There are three labels distributed as follows: Counter({0: 32259, 2: 22824, 1: 4317})
Let's pretend the labels were distributed as 33%, 33%, 34%. I would expect applying class weights or
not applying them to have virtually the same impact. There is no reason to expect using class weights in this case to have much, if any, impact.
But the actual data is quite imbalanced.
I would expect setting up the model to have knowledge of this imbalance would allow the model to make a more informed, i.e., better prediction, even if 'better' is only a slight improvement.
I certainly would not expect a 3.5% drop in model performance.
Am I not applying the weights correctly?
Main code block:
model = lgb.LGBMClassifier(learning_rate=i,
num_leaves=j,
objective='multiclass',
colsample_bytree=c,
max_bin=512,
n_estimators=n,
class_weight={0: 0.61378, 2: 0.86751, 1: 4.58652},
random_state=13,
n_jobs=-1,
)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=13)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize performance
# print('Mean accuracy: %.3f' % np.mean(scores))
print(f'For {i} learning_rate, {j} num_leaves, {c} colsample_bytree, {n} n_estimators:\n'
f' The Mean accuracy is: {np.mean(scores):.5f}, The Standard Deviation is: {np.std(scores):.3f}')
mean_accuracy.append(np.mean(scores))

Related

Threshold does not work on numpy array for accuracy metric

I am trying to implement logistic regression from scratch using numpy. I wrote a class with the following methods to implement logistic regression for a binary classification problem and to score it based on BCE loss or Accuracy.
def accuracy(self, true_labels, predictions):
"""
This method implements the accuracy score. Where the accuracy is the number
of correct predictions our model has.
args:
true_labels: vector of shape (1, m) that contains the class labels where,
m is the number of samples in the batch.
predictions: vector of shape (1, m) that contains the model predictions.
"""
counter = 0
for y_true, y_pred in zip(true_labels, predictions):
if y_true == y_pred:
counter+=1
return counter/len(true_labels)
def train(self, score='loss'):
"""
This function trains the logistic regression model and updates the
parameters based on the Batch-Gradient Descent algorithm.
The function prints the training loss and validation loss on every epoch.
args:
X: input features with shape (num_features, m) or (num_features) for a
singluar sample where m is the size of the dataset.
Y: gold class labels of shape (1, m) or (1) for a singular sample.
"""
train_scores = []
dev_scores = []
for i in range(self.epochs):
# perform forward and backward propagation & get the training predictions.
training_predictions = self.propagation(self.X_train, self.Y_train)
# get the predictions of the validation data
dev_predictions = self.predict(self.X_dev, self.Y_dev)
# calculate the scores of the predictions.
if score == 'loss':
train_score = self.loss_function(training_predictions, self.Y_train)
dev_score = self.loss_function(dev_predictions, self.Y_dev)
elif score == 'accuracy':
train_score = self.accuracy((training_predictions==+1).squeeze(), self.Y_train)
dev_score = self.accuracy((dev_predictions==+1).squeeze(), self.Y_dev)
train_scores.append(train_score)
dev_scores.append(dev_score)
plot_training_and_validation(train_scores, dev_scores, self.epochs, score=score)
after testing the code with the following input
model = LogisticRegression(num_features=X_train.shape[0],
Learning_rate = 0.01,
Lambda = 0.001,
epochs=500,
X_train=X_train,
Y_train=Y_train,
X_dev=X_dev,
Y_dev=Y_dev,
normalize=False,
regularize = False,)
model.train(score = 'loss')
i get the following results
however when i swap the scoring metric to measure over time from loss to accuracy ass follows model.train(score='accuracy') i get the following result:
I have removed normalization and regularization to make sure i am using a simple implementation of logistic regression.
Note that i use an external method to visualize the training/validation score overtime in the LogisticRegression.train() method.
The trick you are using to create your predictions before passing into the accuracy method is wrong. You are using (dev_predictions==+1).
Your problem statement is a Logistic Regression model that would generate a value between 0 and 1. Most of the times, the values will NOT be exactly equal to +1.
So essentially, every time you are passing a bunch of False or 0 to the accuracy function. I bet if you check the number of classes in your datasets having the value False or 0 would be :
exactly 51.7 % in validation dataset
exactly 56.2 % in training dataset.
To fix this, you can use a in-between threshold like 0.5 to generate your labels. So use something like dev_predictions>0.5

Sklearn logistic regression - adjust cutoff point

I have a logistic regression model trying to predict one of two classes: A or B.
My model's accuracy when predicting A is ~85%.
Model's accuracy when predicting B is ~50%.
Prediction of B is not important however prediction of A is very important.
My goal is to maximize the accuracy when predicting A. Is there any way to adjust the default decision threshold when determining the class?
classifier = LogisticRegression(penalty = 'l2',solver = 'saga', multi_class = 'ovr')
classifier.fit(np.float64(X_train), np.float64(y_train))
Thanks!
RB
As mentioned in the comments, procedure of selecting threshold is done after training. You can find threshold that maximizes utility function of your choice, for example:
from sklearn import metrics
preds = classifier.predict_proba(test_data)
tpr, tpr, thresholds = metrics.roc_curve(test_y,preds[:,1])
print (thresholds)
accuracy_ls = []
for thres in thresholds:
y_pred = np.where(preds[:,1]>thres,1,0)
# Apply desired utility function to y_preds, for example accuracy.
accuracy_ls.append(metrics.accuracy_score(test_y, y_pred, normalize=True))
After that, choose threshold that maximizes chosen utility function. In your case choose threshold that maximizes 1 in y_pred.

keras: unsupervised learning with external constraint

I have to train a network on unlabelled data of binary type (True/False), which sounds like unsupervised learning. This is what the normalised data look like:
array([[-0.05744527, -1.03575495, -0.1940105 , -1.15348956, -0.62664491,
-0.98484037],
[-0.05497629, -0.50935675, -0.19396862, -0.68990988, -0.10551919,
-0.72375012],
[-0.03275552, 0.31480204, -0.1834951 , 0.23724946, 0.15504367,
0.29810553],
...,
[-0.05744527, -0.68482282, -0.1940105 , -0.87534175, -0.23580062,
-0.98484037],
[-0.05744527, -1.50366446, -0.1940105 , -1.52435329, -1.14777063,
-0.98484037],
[-0.05744527, -1.26970971, -0.1940105 , -1.33892142, -0.88720777,
-0.98484037]])
However, I do have a constraint on the total number of True labels in my data. This doesn't mean I can build a classical custom loss function in Keras taking (y_true, y_pred) arguments as required: my external constraint is just on the predicted total of True and False, not on the individual labels.
My question is whether there is a somewhat "standard" approach to this kind of problems, and how that is implementable in Keras.
POSSIBLE SOLUTION
Should I assign y_true randomly as 0/1, have a network return y_pred as 1/0 with a sigmoid activation function, and then define my loss function as
sum_y_true = 500 # arbitrary constant known a priori
def loss_function(y_true, y_pred):
loss = np.abs(y_pred.sum() - sum_y_true)
return loss
In the end, I went with the following solution, which worked.
1) Define batches in your dataframe df with a batch_id column, so that in each batch Y_train is your identical "batch ground truth" (in my case, the total number of True labels in the batch). You can then pass these instances together to the network. This can be done with a generator:
def grouper(g,x,y):
while True:
for gr in g.unique():
# this assigns indices to the entire set of values in g,
# then subsects to all the rows in which g == gr
indices = g == gr
yield (x[indices],y[indices])
# train set
train_generator = grouper(df.loc[df['set'] == 'train','batch_id'], X_train, Y_train)
# validation set
val_generator = grouper(df.loc[df['set'] == 'val','batch_id'], X_val, Y_val)
2) define a custom loss function, to track how close the total number of instances predicted as true matches the ground truth:
def custom_delta(y_true, y_pred):
loss = K.abs(K.mean(y_true) - K.sum(y_pred))
return loss
def custom_wrapper():
def custom_loss_function(y_true, y_pred):
return custom_delta(y_true, y_pred)
return custom_loss_function
Note that here
a) Each y_true label is already the sum of the ground truth in our batch (cause we don't have individual values). That's why y_true is not summed over;
b) K.mean is actually a bit of an overkill to extract a single scalar from this uniform tensor, in which all y_true values in each batch are identical - K.min or K.max would also work, but I haven't tested whether their performance is faster.
3) Use fit_generator instead of fit:
fmodel = Sequential()
# ...your layers...
# Create the loss function object using the wrapper function above
loss_ = custom_wrapper()
fmodel.compile(loss=loss_, optimizer='adam')
history1 = fmodel.fit_generator(train_generator, steps_per_epoch=total_batches,
validation_data=val_generator,
validation_steps=df.loc[encs.df['set'] == 'val','batch_id'].nunique(),
epochs=20, verbose = 2)
This way the problem is basically addressed as one of supervised learning, although without individual labels, which means that notions like true/false positive are meaningless here.
This approach not only managed to give me a y_pred that closely matches the totals I know per batch. It actually finds two groups (True/False) that occupy the expected different portions of parameter space.

Tensorflow- How to display accuracy rate for a linear regression model

I have a linear regression model that seems to work. I first load the data into X and the target column into Y, after that I implement the following...
X_train, X_test, Y_train, Y_test = train_test_split(
X_data,
Y_data,
test_size=0.2
)
rng = np.random
n_rows = X_train.shape[0]
X = tf.placeholder("float")
Y = tf.placeholder("float")
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
pred = tf.add(tf.multiply(X, W), b)
cost = tf.reduce_sum(tf.pow(pred-Y, 2)/(2*n_rows))
optimizer = tf.train.GradientDescentOptimizer(FLAGS.learning_rate).minimize(cost)
init = tf.global_variables_initializer()
init_local = tf.local_variables_initializer()
with tf.Session() as sess:
sess.run([init, init_local])
for epoch in range(FLAGS.training_epochs):
avg_cost = 0
for (x, y) in zip(X_train, Y_train):
sess.run(optimizer, feed_dict={X:x, Y:y})
# display logs per epoch step
if (epoch + 1) % FLAGS.display_step == 0:
c = sess.run(
cost,
feed_dict={X:X_train, Y:Y_train}
)
print("Epoch:", '%04d' % (epoch + 1), "cost=", "{:.9f}".format(c))
print("Optimization Finished!")
accuracy, accuracy_op = tf.metrics.accuracy(labels=tf.argmax(Y_test, 0), predictions=tf.argmax(pred, 0))
print(sess.run(accuracy))
I cannot figure out how to print out the model's accuracy. For example, in sklearn, it is simple, if you have a model you just print model.score(X_test, Y_test). But I do not know how to do this in tensorflow or if it is even possible.
I think I'd be able to calculate the Mean Squared Error. Does this help in any way?
EDIT
I tried implementing tf.metrics.accuracy as suggested in the comments but I'm having an issue implementing it. The documentation says it takes 2 arguments, labels and predictions, so I tried the following...
accuracy, accuracy_op = tf.metrics.accuracy(labels=tf.argmax(Y_test, 0), predictions=tf.argmax(pred, 0))
print(sess.run(accuracy))
But this gives me an error...
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value accuracy/count
[[Node: accuracy/count/read = IdentityT=DT_FLOAT, _class=["loc:#accuracy/count"], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
How exactly does one implement this?
Turns out, since this is a multi-class Linear Regression problem, and not a classification problem, that tf.metrics.accuracy is not the right approach.
Instead of displaying the accuracy of my model in terms of percentage, I instead focused on reducing the Mean Square Error (MSE) instead.
From looking at other examples, tf.metrics.accuracy is never used for Linear Regression, and only classification. Normally tf.metric.mean_squared_error is the right approach.
I implemented two ways of calculating the total MSE of my predictions to my testing data...
pred = tf.add(tf.matmul(X, W), b)
...
...
Y_pred = sess.run(pred, feed_dict={X:X_test})
mse = tf.reduce_mean(tf.square(Y_pred - Y_test))
OR
mse = tf.metrics.mean_squared_error(labels=Y_test, predictions=Y_pred)
They both do the same but obviously the second approach is more concise.
There's a good explanation of how to measure the accuracy of a Linear Regression model here.
I didn't think this was clear at all from the Tensorflow documentation, but you have to declare the accuracy operation, and then initialize all global and local variables, before you run the accuracy calculation:
accuracy, accuracy_op = tf.metrics.accuracy(labels=tf.argmax(Y_test, 0), predictions=tf.argmax(pred, 0))
# ...
init_global = tf.global_variables_initializer
init_local = tf.local_variables_initializer
sess.run([init_global, init_local])
# ...
# run accuracy calculation
I read something on Stack Overflow about the accuracy calculation using local variables, which is why the local variable initializer is necessary.
After reading the complete code you posted, I noticed a couple other things:
In your calculation of pred, you use
pred = tf.add(tf.multiply(X, W), b). tf.multiply performs element-wise multiplication, and will not give you the fully connected layers you need for a neural network (which I am assuming is what you are ultimately working toward, since you're using TensorFlow). To implement fully connected layers, where each layer i (including input and output layers) has ni nodes, you need separate weight and bias matrices for each pair of successive layers. The dimensions of the i-th weight matrix (the weights between the i-th layer and the i+1-th layer) should be (ni, ni + 1), and the i-th bias matrix should have dimensions (ni + 1, 1). Then, going back to the multiplication operation - replace tf.multiply with tf.matmul, and you're good to go. I assume that what you have is probably fine for a single-class linear regression problem, but this is definitely the way you want to go if you plan to solve a multiclass regression problem or implement a deeper network.
Your weight and bias tensors have a shape of (1, 1). You give the variables the initial value of np.random.randn(), which according to the documentation, generates a single floating point number when no arguments are given. The dimensions of your weight and bias tensors need to be supplied as arguments to np.random.randn(). Better yet, you can actually initialize these to random values in Tensorflow: W = tf.Variable(tf.random_normal([dim0, dim1], seed = seed) (I always initialize random variables with a seed value for reproducibility)
Just a note in case you don't know this already, but non-linear activation functions are required for neural networks to be effective. If all your activations are linear, then no matter how many layers you have, it will reduce to a simple linear regression in the end. Many people use relu activation for hidden layers. For the output layer, use softmax activation for multiclass classification problems where the output classes are exclusive (i.e., where only one class can be correct for any given input), and sigmoid activation for multiclass classification problems where the output classes are not exlclusive.

Reporting accuracy and loss issues with MonitoredTrainingSession

I am performing transfer learning on InceptionV3 for a dataset of 5 types of flowers. All layers are frozen except the output layer. My implementation is heavily based off of the Cifar10 tutorial from Tensorflow and the input dataset is formated in the same way as Cifar10.
I have added a MonitoredTrainingSession (like in the tutorial) to report the accuracy and loss after a certain number of steps. Below is the section of the code for the MonitoredTrainingSession (almost identical to the tutorial):
class _LoggerHook(tf.train.SessionRunHook):
def begin(self):
self._step = -1
self._start_time = time.time()
def before_run(self,run_context):
self._step+=1
return tf.train.SessionRunArgs([loss,accuracy])
def after_run(self,run_context,run_values):
if self._step % LOG_FREQUENCY ==0:
current_time = time.time()
duration = current_time - self._start_time
self._start_time = current_time
loss_value = run_values.results[0]
acc = run_values.results[1]
examples_per_sec = LOG_FREQUENCY/duration
sec_per_batch = duration / LOG_FREQUENCY
format_str = ('%s: step %d, loss = %.2f, acc = %.2f (%.1f examples/sec; %.3f sec/batch)')
print(format_str %(datetime.now(),self._step,loss_value,acc,
examples_per_sec,sec_per_batch))
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
if MODE == 'train':
file_writer = tf.summary.FileWriter(LOGDIR,tf.get_default_graph())
with tf.train.MonitoredTrainingSession(
save_checkpoint_secs=70,
checkpoint_dir=LOGDIR,
hooks=[tf.train.StopAtStepHook(last_step=NUM_EPOCHS*NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN),
tf.train.NanTensorHook(loss),
_LoggerHook()],
config=config) as mon_sess:
original_saver.restore(mon_sess,INCEPTION_V3_CHECKPOINT)
print("Proceeding to training stage")
while not mon_sess.should_stop():
mon_sess.run(train_op,feed_dict={training:True})
print('acc: %f' %mon_sess.run(accuracy,feed_dict={training:False}))
print('loss: %f' %mon_sess.run(loss,feed_dict={training:False}))
When the two lines printing the accuracy and loss under mon_sess.run(train_op... are removed, the loss and accuracy printed from after_run, after it trains for surprisingly only 20 min, report that the model is performing very well on the training set and the loss is decreasing. Even the moving average loss was reporting great results. It eventually approaches greater than 90% accuracy for multiple random batches.
After, the training session was reporting high accuracy for a while,I stopped the training session, restored the model, and ran it on random batches from the same training set. It performed poorly, only achieving between 50% and 85% accuracy. I confirmed it was restored properly because it did perform better than a model with an untrained output layer.
I then went back to training again from the last checkpoint. The accuracy was initially low but after about 10 mini batch runs the accuracy went back above 90%. I then repeated the process but this time added the two lines for evaluating the loss and accuracy after the training operation. Those two evaluations reported that the model was having issues converging and performing poorly. While the evaluations via before_run and after_run, now only occasionally showed high accuracy and low loss (the results jumped around). But still after_run sometimes reported 100% accuracy (the fact that it is no longer consistent I think is because after_run is getting called also for mon_sess.run(accuracy...) and mon_sess.run(loss...)).
Why would the results reported from MonitoredTrainingSession be indicating the model is performing well when it really isn't? Aren't the two operations in SessionRunArgs being fed with the same mini batch as train_op, indicating model performance on the batch before gradient update?
Here is the code I used for restoring and testing the model(based of the cifar10 tutorial):
elif MODE == 'test':
init = tf.global_variables_initializer()
ckpt = tf.train.get_checkpoint_state(LOGDIR)
if ckpt and ckpt.model_checkpoint_path:
with tf.Session(config=config) as sess:
init.run()
saver = tf.train.Saver()
print(ckpt.model_checkpoint_path)
saver.restore(sess,ckpt.model_checkpoint_path)
global_step = tf.contrib.framework.get_or_create_global_step()
coord = tf.train.Coordinator()
threads =[]
try:
for qr in tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS):
threads.extend(qr.create_threads(sess, coord=coord, daemon=True,start=True))
print('model restored')
i =0
num_iter = 4*NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN/BATCH_SIZE
print(num_iter)
while not coord.should_stop() and i < num_iter:
print("loss: %.2f," %loss.eval(feed_dict={training:False}),end="")
print("acc: %.2f" %accuracy.eval(feed_dict={training:False}))
i+=1
except Exception as e:
print(e)
coord.request_stop(e)
coord.request_stop()
coord.join(threads,stop_grace_period_secs=10)
Update :
So I was able to fix the issue. However, i am not sure why it worked. In the arg_scope for the inception model i was passing in an is_training Boolean placeholder for Batch Norm and dropout used by inception. However, when I removed the placeholder and just set the is_training keyword to true, the accuracy on the training set when the model was restored was extremely high. This was the same model checkpoint that previously performed poorly. When i trained it i always had the is_training placeholder set to true. Having the is_training set to true while testing would mean batch Norm is now using th sample mean and variance.
Why would telling Batch Norm to now use the sample average and sample standard deviation like it does during training increase the accuracy?
This would also mean that the dropout layer is dropping units and that the model's accuracy during testing on both the training set and test set is higher with the dropout layer enabled.
Update 2
I went through the tensorflow slim inceptionv3 model code that the arg_scope in the code above is referencing. I removed the final dropout layer after the Avg pool 8x8 and the accuracy remained at around 99%. However, when I set is_training to False only for the batch norm layers, the accuracy dropped back to around 70%. Here is the arg_scope from slim\nets\inception_v3.py and my modification.
with variable_scope.variable_scope(
scope, 'InceptionV3', [inputs, num_classes], reuse=reuse) as scope:
with arg_scope(
[layers_lib.batch_norm],is_training=False): #layers_lib.dropout], is_training=is_training):
net, end_points = inception_v3_base(
inputs,
scope=scope,
min_depth=min_depth,
depth_multiplier=depth_multiplier)
I tried this with both the dropout layer removed and the dropout layer kept with passing in is_training=True to the dropout layer.
(Summarizing from dylan7's debugging in the question's comments)
Batch norm relies on variables to save the summary statistics it normalizes with. These are only updated when is_training is True through an UPDATE_OPS collection (see the batch_norm documentation). If these update ops don't get run (or the variables are overwritten), there may be transient "reasonable" statistics based on each batch which get lost when is_training is False (testing data is not, and should not be, used to inform batch_norm summary statistics).

Resources