Related
I'm struggling since about 3 weeks on my One-Shot learning project. I'm trying to unlock my computer with my face. Unfortunately, I'm far from this task.
First, I wanted to understand well the concepts behind one-shot learning and especially triplet loss before anything else. So know I try to train a network (in PyTorch) with transfert learning which will lead me I hope to my goal.
What I understood until now :
One shot learning
It's a method where a model should be able to minimise the euclidean distance between two embeddings of two faces of the same person and at the contrary, maximise the euclidean distance between two faces of different persons. In other words, the model should put any face in a d-dimensional Euclidean space, same persons are close to each other and different are fare away from each other.
This model should not especially be trained with known identity. In other words, once well trained, anyone could use it to compare a fixed, unchanged, photo of his face to another face of him/her.
Face verification is the ability to maximise the distances between any face which doesn't belong to (let's say) the authorized person and minimise only the distances of faces belonging to the authorized person (1:1 problem).
Face recognition is the ability to maximise the distances between any face which doesn't belong to (let's say) the authorized persons and minimise any distances of faces belonging to a set of authorized persons (1:K problem).
Triplet mining
To ensure that the model learns information, one needs to feed it with triplets that are well defined and not obvious. For a dataset of faces this leads to :
triplets such as [for all (i,j,k) distincts] : face[i] == face[j]; and face[i] != face[k]; and face[j] != face[k]
Those triplets are called "valid triplets" and the faces are defined as Anchors; Positives and Negatives.
triplets such as the faces in the euclidean space are not already far away from each others (prevent trivial losses which collapses to zero). They are defined as semi-hard and hard triplets.
From those base lines, I looked for examples on the internet. I understood that the usual ways to produce triplets are online mining or offline mining. I used the marvelous code of https://omoindrot.github.io/triplet-loss to implement batch hard strategy and batch all strategy which are online mining.
My questions :
From this point I'm kind of lost. I've tried different approaches to build my dataset but my loss never converges. The model doesn't seem to learn anything.
Description of my approach (through PyTorch)
Model and dataset
I'm using InceptionResnetV1 from the pytorch_facenet library, pretrained with Casia-Webfaces. I unfreeze the last two layers : linear layer model.last_linear(1792, 512) and model.last_bn() which lead me to 918,528 trainable parameters and an output embedding of dim (512,1).
For the dataset, I'm using the HeadPoseImageDatabase which is a dataset containing 15 persons with for each : 2 front pictures and 186 various head pose pictures. This leads to a set of 2797 pictures (one person has 193 pictures) and 30 front pictures.
My work
I understood that the model should see various identities. So first, I tried the
nn.TripletMarginLoss of PyTorch and provide an anchor (one of the two front pictures of each identity); positive (one of the 183 pictures relative to the anchor's identity); and a negative (a random other face with a different identity).
This was unsuccessful : the loss seems to reduce but the model doesn't generalize on test set.
I thought maybe I didn't provide enough semi-hard or hard triplets to the loss so I constructed 15 datasets relative to each identity "i" : each dataset contains the positive faces of the identity "i" and other negative faces. So that, each dataset contains 2797 images and returns an image with its label (1 if the identity of the face correspond to the dataset I else 0). I made a loop over each identity dataset (and there was a batch loop inside each dataset). I used batch hard this time (https://omoindrot.github.io/triplet-loss) but again, unsuccessful.
Questions
Do I need to create a much simpler model and train it from scratch ?
Is my method seem correct : does the Anchor should pass through the same model than the Positive and Negative's ?
How should I set the margin ?
About the face verification is my statements above correct ? I expect to train my model without pictures of me, and then be capable of minimise/maximise euclidean distances between any embedding faces. Is it correct ?
Is this work feasible with a decent accuracy as a small project (i.e. smth around 95%) ?
Thanks all for your time, I hope my explanations were clear. I let you a piece of code below.
model = InceptionResnetV1(pretrained='casia-webface')
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
num_epochs = 10
for param in model.parameters():
param.requires_grad = False
model.last_linear.weight.requires_grad = True
model.last_bn.weight.requires_grad = True
. . .
. . .
. . .
Block8-511 [-1, 1792, 3, 3] 0
AdaptiveAvgPool2d-512 [-1, 1792, 1, 1] 0
Dropout-513 [-1, 1792, 1, 1] 0
Linear-514 [-1, 512] 917,504
BatchNorm1d-515 [-1, 512] 1,024
================================================================
Total params: 23,482,624
Trainable params: 918,528
Non-trainable params: 22,564,096
----------------------------------------------------------------
Input size (MB): 0.29
Forward/backward pass size (MB): 88.63
Params size (MB): 89.58
Estimated Total Size (MB): 178.50
----------------------------------------------------------------
def model_loop(model, epochs, trainloader, batch_size, pytorchLoss, optimizer, device):
### This model uses the PyTorchMarginLoss with a training set of 4972 images
### separate between Anchors / Positives / Negatives
delta_time = datetime.timedelta(hours=1)
timezone = datetime.timezone(offset=delta_time)
model.to(device)
train_loss_list = []
size_train = len(trainloader.dataset)
for epoch in range(num_epochs):
t = datetime.datetime.now(tz=timezone)
str_t = '{:%Y-%m-%d %H:%M:%S}'.format(t)
print(f"{str_t} : Epoch {epoch+1} on {device} \n---------------------------")
train_loss = 0.0
model.train()
for batch, (imgsA, imgsP, imgsN) in enumerate(trainloader):
# Transfer Data to GPU if available
imgsA, imgsP, imgsN = imgsA.to(device), imgsP.to(device), imgsN.to(device)
# Clear the gradients
optimizer.zero_grad()
# Make prediction & compute the mini-batch training loss
predsA, predsP, predsN = model(imgsA), model(imgsP), model(imgsN)
loss = pytorchLoss(predsA, predsP, predsN)
# Compute the gradients
loss.backward()
# Update Weights
optimizer.step()
# Aggregate mini-batch training losses
train_loss += loss.item()
train_loss_list.append(train_loss)
if batch == 0 or batch == 19:
loss, current = loss.item(), batch*BATCH_SIZE + len(imgsA)
print(f"mini-batch loss for training : \
{loss:>7f} [{current:>5d}/{size_train:>5d}]")
# Compute the global training & as the mean of the mini-batch losses
train_loss /= len(trainloader)
print(f"--Fin Epoch {epoch+1}/{epochs} \n Training Loss: {train_loss:>7f}" )
print('\n')
return train_loss_list
train_loss = model_loop(model = model,
epochs = num_epochs,
trainloader = train_dataloader,
batch_size = 256,
pytorchLoss = nn.TripletMarginLoss(margin=0.1),
optimizer = optimizer,
device = device)
2022-02-18 20:26:30 : Epoch 1 on cuda
-------------------------------
mini-batch loss for training : 0.054199 [ 256/ 4972]
mini-batch loss for training : 0.007469 [ 4972/ 4972]
--Fin Epoch 1/10
Training Loss: 0.026363
2022-02-18 20:27:48 : Epoch 5 on cuda
-------------------------------
mini-batch loss for training : 0.005694 [ 256/ 4972]
mini-batch loss for training : 0.011877 [ 4972/ 4972]
--Fin Epoch 5/10
Training Loss: 0.004944
2022-02-18 20:29:24 : Epoch 10 on cuda
-------------------------------
mini-batch loss for training : 0.002713 [ 256/ 4972]
mini-batch loss for training : 0.001007 [ 4972/ 4972]
--Fin Epoch 10/10
Training Loss: 0.003000
Stats through a dataset of 620 images :
TP : 11.25%
TN : 98.87%
FN : 88.75%
FP : 1.13%
The model accuracy is 55.06%
I have a neural network with one hidden layer implemented in both Keras and scikit-learn for solving a regression problem. In scikit-learn I used the MLPregressor class with mostly default parameters and in Keras I have a hidden Dense layer with parameters set to the same defaults as scikit-learn (which uses Adam with same learning rate and epsilon and a batch_size of 200). When I train the networks the scikit-learn model has a loss value that is about half of keras and its accuracy (measured in mean absolute error) is also better. Shouldn't the loss values be similar if not identical and the accuracies also be similar? Has anyone experienced something similar and able to make the Keras model more accurate?
Scikit-learn model:
clf = MLPRegressor(hidden_layer_sizes=(1600,), max_iter=1000, verbose=True, learning_rate_init=.001)
Keras model:
inputs = keras.Input(shape=(cols,))
x = keras.layers.Dense(1600, activation='relu', kernel_initializer="glorot_uniform", bias_initializer="glorot_uniform", kernel_regularizer=keras.regularizers.L2(.0001))(inputs)
outputs = keras.layers.Dense(1,kernel_initializer="glorot_uniform", bias_initializer="glorot_uniform", kernel_regularizer=keras.regularizers.L2(.0001))(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(epsilon=1e-8, learning_rate=.001),loss="mse")
model.fit(x=X, y=y, epochs=1000, batch_size=200)
It is because the formula of mean squared loss(MSE) from scikit-learn is different from that of tensorflow.
From the source code of scikit-learn:
def squared_loss(y_true, y_pred):
return ((y_true - y_pred) ** 2).mean() / 2
while MSE from tensorflow:
backend.mean(math_ops.squared_difference(y_pred, y_true), axis=-1)
As you can see the scikit-learn one is divided by 2, coherent with what you said:
the scikit-learn model has a loss value that is about half of keras
That implied the models from keras and scikit-learn actually achieved similar performance. That also implied learning rate 0.001 in scikit-learn is not equivalent to the same learning rate in tensorflow.
Also, another smaller but significant difference is the formula of L2 regularization.
From the source code of scikit-learn,
# Add L2 regularization term to loss
values = 0
for s in self.coefs_:
s = s.ravel()
values += np.dot(s, s)
loss += (0.5 * self.alpha) * values / n_samples
while that of tensorflow is loss = l2 * reduce_sum(square(x)).
Therefore, with the same l2 regularization parameter, tensorflow one has stronger regularization, which will result in poorer fit to the training data.
I am training a U-NET model on 238 satellite images.
my val_loss is not decreasing below 0.3, despite of the different architectures that I tried.
Conv2D(8-16-32-64-128-64-32-16-8)
Conv2D(16-32-64-128-256-128-64-32-16)
Conv2D(32-64-128-256-512-256-128-64-32)
activation function = relu
sigmoid (outputs)
validation_split=0.10,batch_size=10, epochs=30)
loss='binary_crossentropy'
optimizers.Adam(learning_rate=0.001) -- also i try 0.01 and 0.0001
if you have a lead I'm interested
UpDate = i have 968 images now
After training model with ImageDataGenerator(1/255.), do I need to rescale image before predicting ?
I thought it is necessary but experiment result said NO.
I trained a Resnet50 model which has 37 class on top layer.
Model was trained with ImageDataGenerator like this.
datagen = ImageDataGenerator(rescale=1./255)
generator=datagen.flow_from_directory(
directory=os.path.join(os.getcwd(), data_folder),
target_size=(224,224),
batch_size=256,
classes=None,
class_mode='categorical')
history = model.fit_generator(generator, steps_per_epoch=generator.n / 256, epochs=10)
Accuracy achieved 98% after 10 epochs on my train dataset.
The problem is, when i tried to predict each image in TRAIN dataset, prediction was wrong ( result is 33 whatever input image was )
img_p = './data/pets/shiba_inu/shiba_inu_27.jpg'
img = cv2.imread(img_p, cv2.IMREAD_COLOR)
img = cv2.resize(img, (224,224))
img_arr = np.zeros((1,224,224,3))
img_arr[0, :, :, :] = img / 255.
pred = model.predict(img_arr)
yhat = np.argmax(pred, axis=1)
yhat is 5, but y is 33
When I replace this line
img_arr[0, :, :, :] = img / 255.
by this
img_arr[0, :, :, :] = img
yhat is exactly 33.
Someone might suggest to use predict_generator() instead of predict(), but I want to understand what I did wrong here.
I knew what's wrong here.
I'm using Imagenet pretrained model, which DO NOT rescale image by divide it to 255. I have to use resnet50.preprocess_input before train/test.
preprocess_input function can be found here.
https://github.com/keras-team/keras-applications/blob/master/keras_applications/imagenet_utils.py
You must do every preprocessing that you do on train data, on each data that you want to feed to your trained network. actually when, for example, you rescale train images and train a network, your network train to get a matrix with entries between 0 and 1 and find the proper category. so if after training phase, you feed an image without rescaling, you feed a matrix with entries between 0 and 255 to your trained network while your network did not learn how treat with such matrix.
If you are following pre-processing exactly same as at the time of training then, you might look at the part of your code where you are predicting class using yhat = np.argmax(pred, axis=1) my hunch is that there might be class mismatch in accordance to indexing, to check how your classes are indexed when you use flow_from_directory use class_map = generator.class_indices this will return you a dictionary which will show you how your classes are mapped against index.
Note: The reason I state this because I've faced similar problem, using Keras flow_from_directory doesn't sort classes and hence it's quite possible that your prediction class 1 lies on the index 10 while np.argmax will return you class 1'.
I am implementing a custom loss function in keras. The model is an autoencoder. The first layer is an Embedding layer, which embed an input of size (batch_size, sentence_length) into (batch_size, sentence_length, embedding_dimension). Then the model compresses the embedding into a vector of a certain dimension, and finaly must reconstruct the embedding (batch_size, sentence_lenght, embedding_dimension).
But the embedding layer is trainable, and the loss must use the weights of the embedding layer (I have to sum over all word embeddings of my vocabulary).
For exemple, if I want to train on the toy exemple : "the cat". The sentence_length is 2 and suppose embedding_dimension is 10 and the vocabulary size is 50, so the embedding matrix has shape (50,10). The Embedding layer's output X is of shape (1,2,10). Then it passes in the model and the output X_hat, is also of shape (1,2,10). The model must be trained to maximize the probability that the vector X_hat[0] representing 'the' is the most similar to the vector X[0] representing 'the' in the Embedding layer, and same thing for 'cat'. But the loss is such that I have to compute the cosine similarity between X and X_hat, normalized by the sum of cosine similarity of X_hat and every embedding (50, since the vocabulary size is 50) in the embedding matrix, which are the columns of the weights of the embedding layer.
But How can I access the weights in the embedding layer at each iteration of the training process?
Thank you !
It seems a bit crazy but it seems to work : instead of creating a custom loss function that I would pass in model.compile, the network computes the loss (Eq. 1 from arxiv.org/pdf/1708.04729.pdf) in a function that I call with Lambda :
loss = Lambda(lambda x: similarity(x[0], x[1], x[2]))([X_hat, X, embedding_matrix])
And the network has two outputs: X_hat and loss, but I weight X_hat to have 0 weight and loss to have all the weight :
model = Model(input_sequence, [X_hat, loss])
model.compile(loss=mean_squared_error,
optimizer=optimizer,
loss_weights=[0., 1.])
When I train the model :
for i in range(epochs):
for j in range(num_data):
input_embedding = model.layers[1].get_weights()[0][[data[j:j+1]]]
y = [input_embedding, 0] #The embedding of the input
model.fit(data[j:j+1], y, batch_size=1, ...)
That way, the model is trained to tend loss toward 0, and when I want to use the trained model's prediction I use the first output which is the reconstruction X_hat