GradCam applied to video sequence classification with TimeDistributed CNN and LSTM - keras

after days working on it, I have found any reasonable way of doing, so here I am.
I have a network that aims to predict the next video class, given the features of the current one. Each video is composed by 30 frames. The idea is to apply a feature extraction method to each input, then feed into an LSTM + Dense layer to make prediction.
Here the code:
video = Input(shape=(30,299,299,3))
inc = InceptionV3(weights='imagenet', include_top=False, input_shape=(299, 299, 3))
cnn_out = GlobalAveragePooling2D()(inc.output)
cnn = Model(inputs=inc.input, outputs=cnn_out)
encoded_frames = TimeDistributed(cnn)(video)
encoded_sequence = LSTM(128, activation='relu', return_sequences=False, kernel_initializer=he_uniform(), bias_initializer='zeros', dropout=0.5)(encoded_frames)
hidden_layer = Dense(1024, activation='relu', kernel_initializer=he_uniform(), bias_initializer='zeros')(encoded_sequence)
outputs = Dense(4, activation="softmax", kernel_initializer=glorot_normal(), bias_initializer='zeros')(hidden_layer)
model = Model(inputs=[video], outputs=outputs)
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
I would like to visualize the feature activations at the CNN stage for each image. So if I look at the saliency map for each input image I can understand which features are more importante than others to make this kind of prediction.
All the examples on internet are facing with just one CNN and one input image, is there any way of doing this?
Any help is really appreciated, thanks!

Related

Image Regression: Width of Squares

I have a dataset with lots of pictures and each of these pictures shows me a rectangle with a certain width. My task now is to automatically detect the width of these rectangles by image recognition, and I have trained a CNN for an image regression like in the code below.
However, this CNN gives me very bad values, i.e. mses in the range of 4,000,000 and also a very imprecise estimation of the actual widths. During my experiments I even used the training data set as test data set for the time being, but even here the CNN doesn't seem to learn anything useful.
Do you have an idea what I could be doing wrong? Is it possible that I somehow distort the images themselves while reading them in?
I'm rather new to Machine Learning, so I'm happy about every input you give me! :-)
This is the model:
def create_model():
model = Sequential()
model.add(Convolution2D(32, (3, 3), input_shape=(64, 64, 3), activation='relu'))
model.add(Flatten())
model.add(Dense(64, activation="relu"))
model.add(Dense(1))
model.compile(loss="mse", optimizer="adam")
return model
And this is the training code:
classifier = create_model()
// Getting image id and its corresponding square width
data = pd.read_csv('../data/data.csv')
id_width = data[['id', 'width']]
// Training the model
train_datagen = ImageDataGenerator()
training_set = train_datagen.flow_from_dataframe(dataframe=id_width, directory='../data/images',
x_col="id", y_col="width", has_ext=True,
class_mode="raw", target_size=(64, 64),
batch_size=32)
classifier.fit_generator(
training_set,
epochs=50,
validation_data=training_set)

Data preparation for variable length video classification

I am doing video classification for action detection using Keras (v.2.3.1). My model is CNN and LSTM. My dataset consists of videos of 3-7 seconds, each representing specific action. I use OpenCv to get frames from each video. But since video lengths are different, I get different number of frames for each video. However, I need to have the same number of frames for the LSTM layer. I searched a little bit and looks like padding along with a masking layer should do it, but I can’t figure out how to do this with Keras. Any help is appreciated. Here is my model:
conv_base = VGG16(weights= 'imagenet', include_top=False)
model = models.Sequential()
model.add(TimeDistributed(conv_base, input_shape = (n_timesteps, img_length, img_height, channel)))
model.add(TimeDistributed((Flatten())))
model.add(LSTM(units = lstm_cells))
model.add(Dropout(0.5))
model.add(Dense(512, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Return states Keras- multivariate output

With reference to this blog in the return state section:
https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/
I am trying to implement a multivariate (predicting 2 outputs- y1 & y2) stateful LSTM model. Here is the snippet:
## defining the model
def my_model():
input_x = Input(batch_shape=(batch_size, look_back, x_train.shape[2]), name='input')
drop = Dropout(0.5)
lstm_1, state_h, state_c = LSTM(50, return_sequences=False,batch_input_shape=(batch_size, look_back, x_train.shape[2]),return_state=True)(input_x)
lstm_1_drop = drop(lstm_1)
y1 = Dense(1, activation='linear', name='op1')(lstm_1_drop)
y2 = Dense(1, activation='linear', name='op2')(lstm_1_drop)
model = Model(inputs=input_x, outputs=[y1,y2])
optimizer = Adam(lr=0.0005, decay=0.00001)
model.compile(loss='mse', optimizer=optimizer,metrics=['mse'])
model.summary()
return model
model = my_model()
history = model.fit(x_train, [y_11_train,y_22_train], epochs=1, batch_size=batch_size, verbose=0, shuffle=False)
Question
I have some issues here: I am not sure if it has been properly implemented here yet as I didn't feed the lstm_1, state_h, state_c in the outputs of Model() as explained in the blog, because here I have two different predictions instead of 1 as stated in the blog.
If I have to feed lstm_1, state_h, state_c in the outputs of Model() how can it be implemented and how it will affect the model.fit section.
Any help will be highly appreciated.
Thanks
Reading the Blog post it seems like the author did implement state_h in the output to investigate the insides of the lstm layers. Such an output can be usefull for an already trained network but not for training.
For training you can safely leave the information out of your output.
If yu want to have the information during prediction, simply define a second model:
model = Model(inputs=input_x, outputs=[y1,y2,state_h,state_c])
Keras will then reuse your already trained layers and you have the information in your output without worrying about your training.

Emotion detection on text

I am a newbie in ML and was experimenting with emotion detection on the text.
So I have an ISEAR dataset which contains tweets with their emotion labeled.
So my current accuracy is 63% and I want to increase to at least 70% or even more maybe.
Heres the code :
inputs = Input(shape=(MAX_LENGTH, ))
embedding_layer = Embedding(vocab_size,
64,
input_length=MAX_LENGTH)(inputs)
# x = Flatten()(embedding_layer)
x = LSTM(32, input_shape=(32, 32))(embedding_layer)
x = Dense(10, activation='relu')(x)
predictions = Dense(num_class, activation='softmax')(x)
model = Model(inputs=[inputs], outputs=predictions)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc'])
model.summary()
filepath="weights-simple.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit([X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.1,
shuffle=True, epochs=10, callbacks=[checkpointer])
That's a pretty general question, optimizing the performance of a neural network may require tuning many factors.
For instance:
The optimizer chosen: in NLP tasks rmsprop is also a popular
optimizer
Tweaking the learning rate
Regularization - e.g dropout, recurrent_dropout, batch norm. This may help the model to generalize better
More units in the LSTM
More dimensions in the embedding
You can try grid search, e.g. using different optimizers and evaluate on a validation set.
The data may also need some tweaking, such as:
Text normalization - better representation of the tweets - remove unnecessary tokens (#, #)
Shuffle the data before the fit - keras validation_split creates a validation set using the last data records
There is no simple answer to your question.

Input for LSTM in case of time series data

I have a (not text) dataset of size (1152,151). I wished to divide this into 8 batches, each containing 144 samples for training a LSTM network with it in Keras. I reshaped the data for sending into a LSTM as (8,144,151). Is this the right input shape? Because when I sent this as input and had return_sequences=False for this layer and the next LSTM layer as well, I got an error:
expected n_dim=3, got n_dim=2.
X_train = X_train.reshape((8,144,151))
def deepmodel():
model = Sequential()
model.add(LSTM(8,input_shape=(144,151),return_sequences=False))
model.add(LSTM(8,return_sequences=False))
model.add(Dense(8))
model.add(Activation('softmax'))
adam=Adam()
model.compile(loss = 'categorical_crossentropy', optimizer = adam)
return model
You will set your batch size in model.fit(...., batch_size=8). Look at the example below, it should clear your error message. If you are looking for multiple time lag, be sure to check out this wonderful blog post.
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
def deepmodel():
model = Sequential()
model.add(LSTM(8,input_shape=(train_X.shape[1], train_X.shape[2]), return_sequences=False))
model.add(LSTM(8,return_sequences=False))
model.add(Dense(8))
model.add(Activation('softmax'))
adam=Adam()
model.compile(loss = 'categorical_crossentropy', optimizer = adam)
return model

Resources