Why acc of char-level cnn for text classification stay unchanged - nlp

I misused binary cross-entropy for softmax, changed to categorical cross-entropy. And did some reviewing about details of the problem below in my own answer
I am trying to using open source data: sogou_news_csv(converted to pinyin using jieba from for text classification following https://arxiv.org/abs/1502.01710 "Text understanding from scratch" by Xiang Zhang and Yann LeCun. (mainly follow the idea of using character level CNN, but the structure proposed in the paper).
I did the preprocessing by using one-hot encoding according to a alphabet collection and filling all those not in the alphabet collection with 0s.
As a result, I got the training data with the shape of (450000, 1000, 70),(data_size, sequence_length, alphabet_size).
Then I feed the data into a cnn structure following http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/.
Problem is
During the training, the loss and acc merely change, I tried preprocessing again for the data, and tried different learning rate settings, but not helpful, So what went wrong?
Below is one-hot encoding:
import numpy as np
all_letters = "abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_##$%^&*~`+-=<>()[]{}\n"
n_letters = len(all_letters)
def letterToIndex(letter):
"""
'c' -> 2
"""
return all_letters.find(letter)
def sets2tensors(clean_train, n_letters=n_letters, MAX_SEQUENCE_LENGTH=1000):
"""
From lists of cleaned passages to np.array with shape(len(train),
max_sequence_length, len(dict))
Arg:
obviously
"""
m = len(clean_train)
x_data = np.zeros((m, MAX_SEQUENCE_LENGTH, n_letters))
for ix in range(m):
for no, letter in enumerate(clean_train[ix]):
if no >= 1000:
break
letter_index = letterToIndex(letter)
if letter != -1:
x_data[ix][no][letter_index] = 1
else:
continue
return x_data
This is the Model:
num_classes = 5
from keras.models import Sequential
from keras.layers import Activation, GlobalMaxPool1D, Merge, concatenate, Conv1D, Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.optimizers import SGD
submodels = []
for kw in (3, 4, 5): # kernel sizes
submodel = Sequential()
submodel.add(Conv1D(32,
kw,
padding='valid',
activation='relu',
strides=1, input_shape=(1000, n_letters)))
submodel.add(GlobalMaxPool1D())
submodels.append(submodel)
big_model = Sequential()
big_model.add(Merge(submodels, mode="concat"))
big_model.add(Dense(64))
big_model.add(Dropout(0.5))
big_model.add(Activation('relu'))
big_model.add(Dense(num_classes))
big_model.add(Activation('softmax'))
print('Compiling model')
opt = SGD(lr=1e-6) # tried different learning rate from 1e-6 to 1e-1
# changed from binary crossentropy to categorical_crossentropy
big_model.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Some results
Train on 5000 samples, validate on 5000 samples
Epoch 1/5
5000/5000 [==============================] - 54s - loss: 0.5198 - acc: 0.7960 - val_loss: 0.5001 - val_acc: 0.8000
Epoch 2/5
5000/5000 [==============================] - 56s - loss: 0.5172 - acc: 0.7959 - val_loss: 0.5000 - val_acc: 0.8000
Epoch 3/5
5000/5000 [==============================] - 56s - loss: 0.5198 - acc: 0.7965 - val_loss: 0.5000 - val_acc: 0.8000
Epoch 4/5
5000/5000 [==============================] - 57s - loss: 0.5222 - acc: 0.7950 - val_loss: 0.4999 - val_acc: 0.8000
Epoch 5/5
5000/5000 [==============================] - 59s - loss: 0.5179 - acc: 0.7960 - val_loss: 0.4999 - val_acc: 0.8000

I found that the problem is I accidentally used binary cross-entropy(that I used for another dataset) with softmax, which should be categorical cross-entropy. Initially, I figured it is a stupid bug since I didn't carefully check the code and logic.
But then I found I don't really understand what is going on here, I mean, I know the difference between binary cross-entropy and categorical cross-entropy, but I don't really understand the details why softmax and categorical cross-entropy can't be chained together.
Luckily, I found a very nice explanation here(did not expect anyone would actually ask or answer this question)
https://www.reddit.com/r/MachineLearning/comments/39bo7k/can_softmax_be_used_with_cross_entropy/#cs2b4jx
Basically what it is saying is that in binary cross-entropy case, the loss function is treating two different values of a single bit as two different class: like 1 for A and 0 for B, despite that with categorical cross-entropy case, the loss function is taking a vector like [0,0,0,1,0] a label, in which the value of a bit stands for the confidence or probability for the corresponding training example being that particular class.
With description above, when we apply binary cross-entropy to softmax, we are misusing the definition of what one bit means in that setting, thus make no sense.

You have set SGD optimizer to 0.000001 (opt = SGD(lr=1e-6))
The default learning rate for SGD is 0.01
keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)
I suspect that 1e-6 is to small, try increase it and/or try a different optimizer

Related

Zero validation loss and validation accuracy at classification problem

I'm running a multiclass classification problem using the below resnet model:
resnet = tf.keras.applications.ResNet50(
include_top=False ,
weights='imagenet' ,
input_shape=(96, 96, 3) ,
pooling="avg"
)
for layer in resnet.layers:
layer.trainable = True
model_resnet = tf.keras.Sequential()
model_resnet.add(resnet)
model_resnet.add(tf.keras.layers.Flatten())
model_resnet.add(tf.keras.layers.Dense(8, activation='softmax',name='output') )
model_resnet.compile( loss="sparse_categorical_crossentropy" , optimizer=tf.keras.optimizers.Adam(learning_rate=0.001) ,metrics=['accuracy'])
I also used a train and a test generator as below:
train_generator=img_gen.flow_from_dataframe(dataframe=train_dataset,x_col="file_loc",y_col='expr',target_size=(96, 96),batch_size=91,class_mode="raw")
test_generator=img_gen.flow_from_dataframe(dataframe=test_dataset,x_col="file_loc",target_size=(96, 96),batch_size=93,y_col=None,shuffle=False,class_mode=None)
when I am running the code below I get the wanted results and everything works fine
model_resnet.fit_generator(train_generator,
steps_per_epoch=STEP_SIZE_TRAIN_resnet,
epochs=20
)
I wanted to compute the validation accuracy of every epoch so I wrote something like this
model_path = f"/content/weights" + "{val_accuracy:.4f}.hdf5"
checkpoint = tf.keras.callbacks.ModelCheckpoint(
model_path,
monitor='val_accuracy',
save_best_only=True,
mode='max',
verbose=1
)
history = model_resnet.fit_generator(
train_generator,
epochs=5,
steps_per_epoch=STEP_SIZE_TRAIN_resnet,
validation_data=test_generator,
validation_steps=STEP_SIZE_TEST_resnet,
max_queue_size=1,
shuffle=True,
callbacks=[checkpoint],
verbose=1
)
The problem is that for every epoch the validation loss and validation accuracy remain zero even though the training loss and accuracy change. I ran this code for over 20 epochs and it doesn't change at all. I can't find what am I doing wrong since without this it works perfectly,does anyone have any idea?
Epoch 1: val_accuracy improved from -inf to 0.00000, saving model to /content/weights0.0000.hdf5
500/500 [==============================] - 30s 60ms/step - loss: 1.0213 - accuracy: 0.6546 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/5
500/500 [==============================] - ETA: 0s - loss: 0.9644 - accuracy: 0.6672
Epoch 2: val_accuracy did not improve from 0.00000
500/500 [==============================] - 29s 58ms/step - loss: 0.9644 - accuracy: 0.6672 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Edit: I didn't specify the test labels of the test dataset because I used to compute the accuracy score as below:
y_pred = model_resnet.predict(test_generator)
y_pred_max = np.argmax(y_pred, axis=1)
y_true = test_dataset["expr"].to_numpy()
print("accuracy",accuracy_score(y_true, y_pred_max))
I changed the test_generator as below:
test_generator=img_gen.flow_from_dataframe(dataframe=test_dataset,x_col="file_loc",target_size=(96, 96),batch_size=93,y_col='expr',shuffle=False,class_mode=None)
but nothing has changed, it still results in zero
As #Dr.Snoopy said, the problems were that I didn't specify the test labels in these generator (which are required to compute accuracy) and I had different class modes in the generator,the correct was "raw" in both.

Why i've got a three different MSE values

I wrote an mlp and want start to tune it to fit a best results. But i've stucked with several different MSE.
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import metrics
import numpy
import joblib
# load dataset
#dataframe = read_csv("housing.csv", delim_whitespace=True, header=None)
dataframe = read_csv("100.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:6]
Y = dataset[:,6]
# define the model
def larger_model():
# create model
model = Sequential()
model.add(Dense(20, input_dim=6, kernel_initializer='normal', activation='relu'))
model.add(Dense(50, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='linear'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae','mse'])
return model
# evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=larger_model, epochs=100, batch_size=5, verbose=1)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=2)
results = cross_val_score(pipeline, X, Y, cv=kfold)
pipeline.fit(X, Y)
prediction = pipeline.predict(X)
result_test = Y
print("%.2f (%.2f) MSE" % (results.mean(), results.std()))
print('Mean Absolute Error:', metrics.mean_absolute_error(prediction, result_test))
print('Mean Squared Error:', metrics.mean_squared_error(prediction, result_test))
Gives me that result:
Epoch 98/100
200/200 [==============================] - 0s 904us/step - loss: 0.0086 - mae: 0.0669 - mse: 0.0086
Epoch 99/100
200/200 [==============================] - 0s 959us/step - loss: 0.0032 - mae: 0.0382 - mse: 0.0032
Epoch 100/100
200/200 [==============================] - 0s 894us/step - loss: 0.0973 - mae: 0.2052 - mse: 0.0973
200/200 [==============================] - 0s 600us/step
21.959478
-0.03 (0.02) MSE
Mean Absolute Error: 0.1959771416462339
Mean Squared Error: 0.0705598179059006
So i see here a 3 different mse results. Why so and which one i should take in mind to understand an overall model score when i willbe tune it?
Basically what I understood was if you print the results variable then you will get 2 MSE because you used n_splits=2.
-0.03 (0.02) MSE
Above output is the mean or average of the results(MSE) and std of the results(MSE).
Epoch 100/100
200/200 [==============================] - 0s 894us/step - loss: 0.0973 - mae: 0.2052 - mse: 0.0973
Above outputs mse = 0.0973 this is I think for split=2 and it will take only 50% of whole data(X) because remaining 50% it will take as validation data.
Mean Squared Error: 0.0705598179059006
Above output is coming where you are predicting on whole data, not 50% by using best model so obviously, you will get 3 different MSEs for the above 3 prints.
I am also solving a very similar kind of problem, so do one thing divide the dataset into train and test and use train data for training and when you are predicting use test dataset then calculate MSE on test data or else keep this as it is and take Mean Squared Error: 0.0705598179059006 as your final mse.

Validation and Test accuracy at random performance, whereas Train accuracy very high

I am trying to build a classifier in TensorFlow2.1 for CIFAR10 using ResNet50 pre-trained over imagenet from keras.application and then stacking a small FNN on top of it:
# Load ResNet50 pre-trained on imagenet
resn = applications.resnet50.ResNet50(weights='imagenet', input_shape=(IMG_SIZE, IMG_SIZE, 3), pooling='avg', include_top=False)
# Load CIFAR10
(c10_train, c10_test), info = tfds.load(name='cifar10', split=['train', 'test'], with_info=True, as_supervised=True)
# Make sure all the layers are not trainable
for layer in resn.layers:
layer.trainable = False
# Transfert Learning for CIFAR10: fine-tune the network by stacking a trainable FNN on top of Resnet
from tensorflow.keras import models, layers
def build_model():
model = models.Sequential()
# Feature extractor
model.add(resn)
# Small FNN
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dropout(0.4))
model.add(layers.Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.SGD(learning_rate=0.1),
metrics=['accuracy'])
return model
# Build the resulting net
resn50_c10 = build_model()
I am facing the following issue when it comes to validate or test the accuracy:
history = resn50_c10.fit_generator(c10_train.shuffle(1000).batch(BATCH_SIZE), validation_data=c10_test.batch(BATCH_SIZE), epochs=20)
Epoch 1/20
25/25 [==============================] - 113s 5s/step - loss: 0.9659 - accuracy: 0.6634 - val_loss: 2.8157 - val_accuracy: 0.1000
Epoch 2/20
25/25 [==============================] - 109s 4s/step - loss: 0.8908 - accuracy: 0.6920 - val_loss: 2.8165 - val_accuracy: 0.1094
Epoch 3/20
25/25 [==============================] - 116s 5s/step - loss: 0.8743 - accuracy: 0.7038 - val_loss: 2.7555 - val_accuracy: 0.1016
Epoch 4/20
25/25 [==============================] - 132s 5s/step - loss: 0.8319 - accuracy: 0.7166 - val_loss: 2.8398 - val_accuracy: 0.1013
Epoch 5/20
25/25 [==============================] - 132s 5s/step - loss: 0.7903 - accuracy: 0.7253 - val_loss: 2.8624 - val_accuracy: 0.1000
Epoch 6/20
25/25 [==============================] - 132s 5s/step - loss: 0.7697 - accuracy: 0.7325 - val_loss: 2.8409 - val_accuracy: 0.1000
Epoch 7/20
25/25 [==============================] - 132s 5s/step - loss: 0.7515 - accuracy: 0.7406 - val_loss: 2.7697 - val_accuracy: 0.1000
#... (same for the remaining epochs)
Although the model seems to learn adequately from the training split, both the accuracy and loss for the validation set does not improve at all. What is causing this behavior?
I am excluding this is overfitting since I am applying Dropout and since the model seems to never really improve on the test set.
What I have done so far:
Check the one-hot labelling is consistent throughout train and test
Tried different FNN configurations
Tried the method fit_generator instead of fit
Preprocess the image, resized the images w/ different input_shapes
and experienced always the same problem.
Any hint would be extremely appreciated.
The problem is likely due to loading data using tfds and then passing to Keras .fit
Try to load your data with
from keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
And then
fit(x=x_train, y=y_train, batch_size=BATCH_SIZE, epochs=20, verbose=1, callbacks=None, validation_split=0.2, validation_data=None, shuffle=True)
Apparently the problem was caused uniquely by the use of ResNet50.
As a workaround, I downloaded and used other pre-trained deep networks such as keras.applications.vgg16.VGG16, keras.applications.densenet.DenseNet121 and the accuracy on the test set increased as expected.
UPDATE
The above part of this answer is just a palliative. In order to understand what is really happening and eventually use transfer learning properly with ResNet50, keep on reading.
The root cause appears to be found in how Keras handles the Batch Normalization layer:
During fine-tuning, if a Batch Normalization layer is frozen it uses the mini-batch statistics. I believe this is incorrect and it can lead to reduced accuracy especially when we use Transfer learning. A better approach in this case would be to use the values of the moving mean and variance.
As explained more in-depth here: https://github.com/keras-team/keras/pull/9965
Even though the correct approach has been implemented in TensorFlow 2 when we use tf.keras.applications we reference the TensorFlow 1.0 behavior for Batch Normalization. That's why we need to explicitly inject the reference to TensorFlow 2 by adding the argument layers=tf.keras.layers when loading modules. So in my case, the loading of ResNet50 will become
history = resn50_c10.fit_generator(c10_train.shuffle(1000).batch(BATCH_SIZE), validation_data=c10_test.batch(BATCH_SIZE), epochs=20, layers=tf.keras.layers)
and that will do the trick.
Credits for the solution to #rpeloff: https://github.com/keras-team/keras/pull/9965#issuecomment-549126009

loss function returns Nan in join mode of CRF keras-contrib

I use a BiLSTM-CRF architecture to assign some labels to a sequence of the sentences in a paper. We have 150 papers each of which contains 380 sentences and each sentence is represented by a double array with size 11 in range (0,1) and the number of class labels is 11.
input = Input(shape=(None,11))
mask = Masking (mask_value=0)(input)
lstm = Bidirectional(LSTM(50, return_sequences=True))(mask)
lstm = Dropout(0.3)(lstm)
lstm= TimeDistributed(Dense(50, activation="relu"))(lstm)
crf = CRF(11 , sparse_target=False , learn_mode='join') # CRF layer
out = crf(lstm)
model = Model(input, out)
model.summary()
model.compile('adam', loss=crf.loss_function, metrics=[crf.accuracy])
I use keras-contrib package to implement CRF layer. CRF layer has two learning modes: join mode and marginal mode. I know that join mode is a real CRF that uses viterbi algorithm to predict the best path. While, marginal mode is not a real CRF that uses categorical-crossentropy for computing loss function.
When I use marginal mode, the output is as follows:
Epoch 4/250: - 6s - loss: 1.2289 - acc: 0.5657 - val_loss: 1.3459 - val_acc: 0.5262
But, in the join mode, the value of loss function is nan:
Epoch 2/250 : - 5s - loss: nan - acc: 0.1880 - val_loss: nan - val_acc: 0.2120
I do not understand why this happens and would be grateful to anybody who could give a hint.

How to interpret Keras model.fit output?

I've just started using Keras. The sample I'm working on has a model and the following snippet is used to run the model
from sklearn.preprocessing import LabelBinarizer
label_binarizer = LabelBinarizer()
y_one_hot = label_binarizer.fit_transform(y_train)
model.compile('adam', 'categorical_crossentropy', ['accuracy'])
history = model.fit(X_normalized, y_one_hot, nb_epoch=3, validation_split=0.2)
I get the following response:
Using TensorFlow backend. Train on 80 samples, validate on 20 samples Epoch 1/3
32/80 [===========>..................] - ETA: 0s - loss: 1.5831 - acc:
0.4062 80/80 [==============================] - 0s - loss: 1.3927 - acc:
0.4500 - val_loss: 0.7802 - val_acc: 0.8500 Epoch 2/3
32/80 [===========>..................] - ETA: 0s - loss: 0.9300 - acc:
0.7500 80/80 [==============================] - 0s - loss: 0.8490 - acc:
0.8000 - val_loss: 0.5772 - val_acc: 0.8500 Epoch 3/3
32/80 [===========>..................] - ETA: 0s - loss: 0.6397 - acc:
0.8750 64/80 [=======================>......] - ETA: 0s - loss: 0.6867 - acc:
0.7969 80/80 [==============================] - 0s - loss: 0.6638 - acc:
0.8000 - val_loss: 0.4294 - val_acc: 0.8500
The documentation says that fit returns
A History instance. Its history attribute contains all information
collected during training.
Does anyone know how to interpret the history instance?
For example, what does 32/80 mean? I assume 80 is the number of samples but what is 32? ETA: 0s ??
ETA = Estimated Time of Arrival.
80 is the size of your training set, 32/80 and 64/80 mean that your batch size is 32 and currently the first batch (or the second batch respectively) is being processed.
loss and acc refer to the current loss and accuracy of the training set.
At the end of each epoch your trained NN is evaluated against your validation set. This is what val_loss and val_acc refer to.
The history object returned by model.fit() is a simple class with some fields, e.g. a reference to the model, a params dict and, most importantly, a history dict. It stores the values of loss and acc (or any other used metric) at the end of each epoch. For 2 epochs it will look like this:
{
'val_loss': [16.11809539794922, 14.12947562917035],
'val_acc': [0.0, 0.0],
'loss': [14.890108108520508, 12.088571548461914],
'acc': [0.0, 0.25]
}
This comes in very handy if you want to visualize your training progress.
Note: if your validation loss/accuracy starts increasing while your training loss/accuracy is still decreasing, this is an indicator of overfitting.
Note 2: at the very end you should test your NN against some test set that is different from you training set and validation set and thus has never been touched during the training process.
32 is your batch size. 32 is the default value that you can change in your fit function if you wish to do so.
After the first batch is trained Keras estimates the training duration (ETA: estimated time of arrival) of one epoch which is equivalent to one round of training with all your samples.
In addition to that you get the losses (the difference between prediction and true labels) and your metric (in your case the accuracy) for both the training and the validation samples.

Resources