how effective is transfer learning? keeping only two specific output features without resetting features - pytorch

I want to keep only two specific output features without resetting features.
Resetting features would lose the pre-trained weights.
For example, I don't want to do...
# https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html?highlight=transfer%20learning%20ant%20bees
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 2)
Here is code (following the transfer learning tutorial on Pytorch)
I want to do this to see how effective transfer learning is.
Even without transfer learning, a model might be effective. Removing 998 out of 1000 categories and leaving only two categories, ant and bee, could be a great categorical model since you are left with only two choices.
I do not want to re-train the model, I want to use the weights as it is, otherwise, it will be the same as transfer learning.

You can certainly try this. You can reduce the model output to just the two logits you want to compare with:
chosen_cats = torch.Tensor([ant_index, bee_index]).long()
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
outputs = torch.index_select(output, 1, chosen_cats)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
In this scenario, the preds will be 0 or 1, with 0 predicting ant and 1 predicting bee, so you will need to also modify your labels to reflect this.

Related

Keras LSTM network predictions align with input

The above may sound ideal, but I'm trying to predict a step in front - i.e. with a look_back of 1. My code is as follows:
def create_scaled_datasets(data, scaler_transform, train_perc = 0.9):
# Set training size
train_size = int(len(data)*train_perc)
# Reshape for scaler transform
data = data.reshape((-1, 1))
# Scale data to range (-1,1)
data_scaled = scaler_transform.fit_transform(data)
# Reshape again
data_scaled = data_scaled.reshape((-1, 1))
# Split into train and test data keeping time order
train, test = data_scaled[0:train_size + 1, :], data_scaled[train_size:len(data), :]
return train, test
# Instantiate scaler transform
scaler = MinMaxScaler(feature_range=(0, 1))
model.add(LSTM(5, input_shape=(1, 1), activation='tanh', return_sequences=True))
model.add(Dropout(0.1))
model.add(LSTM(12, input_shape=(1, 1), activation='tanh', return_sequences=True))
model.add(Dropout(0.1))
model.add(LSTM(2, input_shape=(1, 1), activation='tanh', return_sequences=False))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# Create train/test data sets
train, test = create_scaled_datasets(data, scaler)
trainY = []
for i in range(len(train) - 1):
trainY = np.append(trainY, train[i + 1])
train = np.reshape(train, (train.shape[0], 1, train.shape[1]))
plotting_test = test
test = np.reshape(test, (test.shape[0], 1, test.shape[1]))
model.fit(train[:-1], trainY, epochs=150, verbose=0)
testPredict = model.predict(test)
plt.plot(testPredict, 'g')
plt.plot(plotting_test, 'r')
plt.show()
with output plot of:
In essence, what I want to achieve is for the model to predict the next value, and I attempt to do this by training on the actual values as the features, and the labels being the actual values shifted along one (look_back of 1). Then I predict on the test data. As you can see from the plot, the model does a pretty good job, except it doesn't seem to be predicting the future, but instead seems to be predicting the present... I would expect the plot to look similar, except the green line (the predictions) to be shifted one point to the left. I have tried increasing the look_back value, but it seems to always do the same thing, which makes me think I'm training the model wrong, or attempting to predict incorrectly. If I am reading this wrong and the model is indeed doing what I want but I'm interpreting wrong (also very possible) how do I then predict further into the future?
To add on #MSalters' comment, and somewhat basing on this, it is possible, although not guaranteed that you could "help" your model learn something better than the identity, if you force it to learn not the actual value of the next step, but instead, make it learn the difference from the current step to the next.
To take this one step further, you could also keep an exponential moving average and learn the difference from that, somewhat like was done here.
In short, it makes statistical sense to predict the same value, as it is a low-risk guess. Maybe learning a difference won't converge to zero.
Other things I noticed:
Dropout - no need to use any normalization before you were able to over-fit. It just complicates debugging.
Just one step into the past - it is likely you are losing a lot of required information, thus in fact forcing your net to have no idea what to do, and thus guess the same value. If you even gave it a single value more into the past, it could have a nice approximation of the derivative. That sounds important (only you know)

Anomalies have similar error values to normal data

I have inertial measurement unit (IMU) data for which I am building an anomaly detection autoencoder neural net. I have about 5k training samples of which I am using 10% for validation. I also have about 50 (though I can make more) samples to test anomaly detection. My dataset has 12 IMU features. I train for about 10,000 epochs and I attain mean squared errors for reconstruction (MSE) of about 0.004 during training. After training, I perform an MSE calculation on the test data and I get values very similar to those in the train data (0.003) and I do not know why!
I am making my test set by slicing 50 samples from the overall data (not part of X_train) and changing one of the features to all zeros. I have also tried adding noise to one of the features as well as making multiple features zero.
np.random.seed(404)
np.random.shuffle(all_imu_data)
norm_imu_data = all_imu_data[:len_slice]
anom_imu_data = all_imu_data[len_slice:]
anom_imu_data[:,6] = 0
scaler = MinMaxScaler()
norm_data = scaler.fit_transform(norm_imu_data)
anom_data = scaler.transform(anom_imu_data)
X_train = pd.DataFrame(norm_data)
X_test = pd.DataFrame(anom_data)
I have tried many different network sizes by ranging number of hidden layers and number of hidden nodes/layer. As an example, I show a topology like [12-7-4-7-12]:
input_dim = num_features
input_layer = Input(shape=(input_dim, ))
encoder = Dense(int(7), activation="tanh", activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoder = Dense(int(4), activation="tanh")(encoder)
decoder = Dense(int(7), activation="tanh")(encoder)
decoder = Dense(int(input_dim), activation="tanh")(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer='adam', loss='mse', metrics=['mse'])
history = autoencoder.fit(X_train, X_train,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
validation_split=0.1,
verbose=1,
callbacks=[checkpointer, tensorboard]).history
pred_train = autoencoder.predict(X_train)
pred_test = autoencoder.predict(X_test)
mse_train = np.mean(np.power(X_train - pred_train, 2), axis=1)
mse_test = np.mean(np.power(X_test - pred_test, 2), axis=1)
print('MSE mean() - X_train:', np.mean(mse_train))
print('MSE mean() - X_test:', np.mean(mse_test))
After doing this, I get MSE mean numbers of 0.004 for Train and 0.003 for Test. Therefore, I cannot select a good threshold for anomalous data, as there are a lot of normal points that have larger MSE scores than the 'anomalous' data.
Any thoughts as to why this network is unable to detect these anomalies?
It is completely normal. You train your autoencoder on a sub sample of your whole data. Therefore, there are also anomalies contaminating your training data. The purpose of the autoencoder is to find a perfect reconstruction of your original data which it does including the anomalies. It is a very powerful tool, so if you show it anomalies in the training data, it will reconstruct them easily.
You need to remove 5% of your anomalous data with another anomaly detection algorithm (for example isolation forest) and do the subsampling on that part of the data (without outliers).
After that, you can find your outliers easily.

Is it possible to predict a certain numerical value given a DNA sequence using LSTM?

I have 16 letters of DNA sequence. From this 16-letter DNA sequence, there is an output value so called 'Inhibition value' which ranges from 0 to 100. When I tried using LSTM, the prediction only output a constant. Is the problem lies in the code or is it just not a suitable task for LSTM or RNN in general to solve?
I have tried to increase batch size and epochs, make the LSTM deeper, change the number of LSTM units, but none of them works.
I was also wondering whether the labeling method matters or not. I tried to use One-hot encoder at first, but it didn't work. Then, I changed it to LabelEncoder, but it's also not working. Same constant output is produced.
Below here is the code for my model structure
def create_model():
input1 = Input(shape=(16,1))
classifier = LSTM(64, input_shape=(16,1), return_sequences=True)(input1)
for i in range(2):
classifier = LSTM(32, return_sequences=True)(classifier)
classifier = LSTM(32)(classifier)
classifier = Dense(1, activation='relu')(classifier)
model = Model(inputs = [input1], outputs = classifier)
adam = keras.optimizers.adam(lr=0.01)
model.compile(loss='mean_squared_error', optimizer=adam)
return model
If anyone wondering why I use functional API instead of sequential, it is because there is a possible modification where I need to use 2 input variables that needs to be processed independently before concatenating it at the end.
Thank you in advance.

Why does multi layer perceprons outperform RNN in CartPole?

Recently, I compared two models for a DQN on CartPole-v0 environment. One of them is a multilayer perceptron with 3 layers and the other is an RNN built up from an LSTM and 1 fully connected layer. I have an experience replay buffer of size 200000 and the training doesn't start until it is filled up.
Although MLP has solved the problem under a reasonable amount of training steps (this means to achieve a mean reward of 195 for the last 100 episodes), the RNN model could not converge as quickly and its maximum mean reward did not even reach 195 too!
I have already tried to increase batch size, add more neurons to the LSTM'S hidden state, increase the RNN'S sequence length and making the fully connected layer more complex - but every attempt failed as I saw enormous fluctuations in mean reward so the model hardly converged at all. May these are the sings of early overfitting?
class DQN(nn.Module):
def __init__(self, n_input, output_size, n_hidden, n_layers, dropout=0.3):
super(DQN, self).__init__()
self.n_layers = n_layers
self.n_hidden = n_hidden
self.lstm = nn.LSTM(input_size=n_input,
hidden_size=n_hidden,
num_layers=n_layers,
dropout=dropout,
batch_first=True)
self.dropout= nn.Dropout(dropout)
self.fully_connected = nn.Linear(n_hidden, output_size)
def forward(self, x, hidden_parameters):
batch_size = x.size(0)
output, hidden_state = self.lstm(x.float(), hidden_parameters)
seq_length = output.shape[1]
output1 = output.contiguous().view(-1, self.n_hidden)
output2 = self.dropout(output1)
output3 = self.fully_connected(output2)
new = output3.view(batch_size, seq_length, -1)
new = new[:, -1]
return new.float(), hidden_state
def init_hidden(self, batch_size, device):
weight = next(self.parameters()).data
hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device),
weight.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device))
return hidden
Contrarily to what I expected, the simpler model gave a much better result than the other; even though RNN is supposed to be better in processing time series data.
Can anybody tell me what's the reason for this?
Also, I have to state that I applied no feature engineering and both DQN's worked with raw data. Could RNN outperform the MLP on using normalized features? (I mean feeding both models with normalized data)
Is there anything you can recommend me to improve training efficiency on RNN's to achieve the best results?
Contrary to what I expected the simpler model gave much better result that the other; even though RNN's supposed to be better in processing time series data.
There is no time series in the cart-pole, the state contains all the information needed for optimal decision. It would be different if, for instance, you would learn from images and you would need to estimate the pole velocity from a series of images.
Also, it is not true that the more complex model should perform better. On the contrary, it is more likely to overfit. For the cart-pole you don't even need a NN, a simple linear approximator with RBFs or random Fourier features would suffice. A RNN + LSTM is for sure an overkill for such a simple problem.

Why do I get the same prediction for all training samples?

I have a neural network with num_labels separate outputs where each output consists of a softmax layer with two nodes (Yes/No).
I am taking the output of a convolution_layer and feed it as input for a simple softmax_layer which I further feed into each of said outputs:
softmax_layer = Dense(num_labels, activation='softmax', name='softmax_layer')(convolution_layer)
outputs = list()
for i in range(num_labels):
out_y = Dense(2, activation='softmax', name='out_{:d}'.format(i))(softmax_layer)
outputs.append(out_y)
So far I was able to train the model by providing a list of training samples but now I noticed that I am getting the exact same output for completely different samples in a batch:
Please note: Here, each column consists of (2,1) arrays. Each column is the prediction for one sample.
I've checked the samples, they are different. I've also tried to e.g. feed the convolution_layer into the outputs. In that case the predictions are different. I can only see this outcome if I do it the way shown above.
I could live with the fact that the outputs are "similar". In that case I'd think that the network is just learning not what I want it to learn but since they are really the same I am not quite sure what the problem here is.
I've tried something similar with a simple feed forward network:
class FeedForward:
def __init__(self, input_dim, nb_classes):
in_x = Input(shape=(input_dim, ), name='in_x')
h1 = Dense(14, name='h1', activation='relu')(in_x)
h2 = Dense(8, name='h2', activation='relu')(h1)
out = Dense(nb_classes, name='out', activation='softmax')(h2)
self.model = Model(input=[in_x], output=[out])
def compile_model(self, optimizer='adam', loss='binary_crossentropy'):
self.model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])
But it behaves similarly. I can't imagine it's due to imbalanced data. There are 13 classes. There is some imbalance but it's not like that one class has 90% of the mass.
Am I doing this right?

Resources