The aim of my project is extracting specific facial features on mobile phone. This a verification application using user's face. Given two different images of the same person, extracting the features as close as possible.
Right now, I use the pretrained model and weights of VGGFace team as a feature extractor, you can download the model in here. However, when I extracted features based on the model, the result was not good enough, I described what I did and what I want as below:
I extract features from Emma Watson' images, image_1 returns feature_1, image2 returns feature_2 and so on (vector length = 2048). If feature[i] > 0.0, convert it to 1.
for i in range(0, 2048):
if feature1[0][i] > 0.0:
feature1[0][i] = 1
Then, I compare the two features vector using Hamming distance. Hamming distance is just a naive way to compare, in real project, I will quantize those features before comparing. However, the distance between 2 images of Emma still large even though I use 2 neural facial expression images (same emotion, different emotion type return worse result).
My question is how could I train the model to extract features of target user. Imaging, Emma is a target user, and her phone only need to extract her features. When someone try to unlock Emma's phone, her phone extract this person's face then compare with saved Emma's features. In addition, I don't want to train a model to classify 2 classes Emma and not Emma. The thing I need is comparing extracted features.
To sum up, If we compare features from different images of the same person, the distance (differences) should be "close" (small). If we compare features from different images of different people, the distance should be "far" (large).
Thank you so much.
I'd do the following: We want to compute the features from a deep layer from a ConvNet to ultimately compare new images with a base image. Let's say this deep layer gives you the feature vector f. Now, create a dataset with pairs of images and a label y. Say, y = 1 if both images are of same person as the base image and y = 0 if they are different. Then, calculate the element wise difference and feed it into a logistic regression unit to get your y_hat: y_hat = sigmoid(np.multiply(W, np.sum(abs(f1 - f2)) + b). You will have to create a "Siamese" network where you have two same ConvNets, one giving you f1 for one image and another one for f2 for another image from the same example pair. Siamese networks need to have the exact weights at all times so you will need to ensure that their weights are same as each other at all times. As you train this new network, you should get desired results.
Related
So I've got a simple pytorch example of how to train a ResNet CNN to learn MNIST labeling from this link:
https://zablo.net/blog/post/using-resnet-for-mnist-in-pytorch-tutorial/index.html
It's working great, but I want to hack it a bit so that it does 2 things. First, instead of predicting digits, it predicts animal shapes/colors for a project I'm working on. That's already working quite well already and am happy with it.
Second, I'd like to hack the training (and possibly layers) so that predictions is done in parallel on multiple images at a time. In the MNIST example, basically prediction (or output) would be done for an image that has 10 digits at a time concatenated by me. For clarity, each 10-image input will have the digits 0-9 appearing only once each. The key here is that each of the 10 digit gets a unique class/label from the CNN/ResNet and each class gets assigned exactly once. And that digits that have high confidence will prevent other digits with lower confidence from using that label (a Hungarian algorithm type of approach).
So in my use case I want to train on concatenated images (not single images) as in Fig A below and force the classifier to learn to predict the best unique label for each of the concatenated images and do this all at once. Such an approach should outperform single image classification - and it's particularly useful for my animal classification because otherwise the CNN can sometimes return the same ID for multiple animals which is impossible in my application.
I can already predict in series as in Fig B below. And indeed looking at the confidence of each prediction I am able to implement a Hungarian-algorithm like approach post-prediction to assign the best (most confident) unique IDs in each batch of 4 animals. But this doesn't always work and I'm wondering if ResNet can try and learn the greedy Hungarian assignment as well.
In particular, it's not clear that implementing A simply requires augmenting the data input and labels in the training set will do it automatically - because I don't know how to penalize or dissalow returning the same label twice for each group of images. So for now I can generate these training datasets like this:
print (train_loader.dataset.data.shape)
print (train_loader.dataset.targets.shape)
torch.Size([60000, 28, 28])
torch.Size([60000])
And I guess I would want the targets to be [60000, 10]. And each input image would be [1, 28, 28, 10]? But I'm not sure what the correct approach would be.
Any advice or available links?
I think this is a specific type of training, but I forgot the name.
I am doing a image classification project using CNN in keras. I have a dataset of about 900 photos of about 70 people .Each person has multiple photos of his different age.
My goal is to predict the correct ID of the person if any one of his photo is in the input.
Here is the glimpse of the data.
My questions are:
What should be my target column ?Is Target 'AGE' or 'ID'? 2-Do I
need to do hot-encoding of the target column? For example if I used
ID as my target,then do I have to do one-hot-encoding of ID column?
If I used ID as my target,then after one-hot-encoding, does it
mean,I will be having 70 classes?
I need information about the
output layer. My goal is to find whether the photo belong to the
same ID or not,so what should be the output layer? Shall I use
softmax with 70 outputs ?
Another question about the output layer
is that can I use a softmax with 70 outputs and then feed it to a
layer of sigmoid with single output ?
You are going to identify the same person using different age images. For example, in the dataset, you have 100 different images of khan and you trained a model. Now you provide the 101st image of khan, the model will detect it. So your target column should be ID.
yes, there are 70 classes and you get one hot encoded vector of 900x70
It should be a softmax layer because the sigmoid layer is used for binary class or multilabel problem. As you have to detect 70 different people from each other, you need a softmax class.
I don't think so, in this way your model would not be capable of telling which person image is this (the one provided as a test)
Let start by saying that i have 2 pre-trained models (in hdf5 files):
The first model is a YOLO-based model, trained on dataset A, which is used to locate human in any images (note that: a trained images o this model may contain many people inside)
The second model is a CNN model which is used to detect gender of a person (male or female) based on the image which only contains 1 person.
Suppose that i only want to use these 2 models and do not want to re-train or modify anything on the dataset. How could i locate female person in a picture of Dataset A?
A possible solution that i think could work:
First use the first model to detect, that is to create bounding boxes around persons in the images.
Crop the bounding boxes into unique images. Feed those images to the second model to see if that person is Female/Male
However, this solution is slow in performance. So is there anyway that can festen this solution or perform this task in different ways?
There are 2 parts to this question. Suppose we are looking at sales S of a product across $> 1000$ stores where a it sells. For each of these 1000 stores we have 24 months recorded data.
We want to be able to predict S_t <- f(S_{t-1}). We could build a RNN for each of the store time series, calculate test RMSE and then take an average after taking care of normalizing values etc. But the problem is there are very few samples per time series. If we were to segment stores into groups (by say Dynamic Time Warping) then could we create a monologue of text sentiment mining where like in text two sentences are separated by a dot here we would have two time series separated by a special symbol (let's say). In that case, we would generate a RNN model on
Train_1 | Train_2 |...|Train_t
data and predict on
Test_1 | Test_2 |...|Test_t
After this, we would like to set it up as a panel data problem where S_t <- f(x_{t1},x_{t2},...,x_{tn}). In that case should I build a separate neural network for each t and then connect the hidden layers from t -> t+1 -> t+2 ....
How should I implement these through packages like Keras/Theano/Mxnet etc.? Any help would be great!
For your first question, implementing this in MXNet Gluon is very straight forward. You can formulate your problem as an auto regression problem so that it doesn't have any dependency on the sequence length during or you can formulate it as single prediction and require a specific sequence length for S in order to predict S_t. Either way, this gluon tutorial can help get you started.
I'm using SVM to classify clinical images of patients belonging to two different groups (patients vs. controls). I use PCA to extract a vector of features from each image, but I'd like to add other clinical information (for example, the output value of a clinical exam) in order to include it in the classification process.
Is there a way to do this?
I didn't find exhaustive suggestions in literature.
Thanks in advance.
You could just append the new information at the end of each sample. Other approach that you could try is having two additional classifiers, one that you could train with the additional information and a third classifier that would take the output of the other two classifiers as input to product a final prediction.
The question is pretty old, I' post my answer though.
If you have to scale your values, make sure that the new values are scaled to the similar range of your values in PCA-vector.
If your PCA vectors of features have constant length, you just start enumerating your features from length+1 e.g. for SVM input (libsvm):
1 1:<PCAval1> ... N:<PCAvalN> N+1:<Clinical exam value 1> ...
I've made a test adding such general features for cell recognition and the accuracy raised.
This Guide describes how to use enumerator-features.
P.S.:
In my test I've isolated, and squeezed cells from microscope image to a matrix 16x16. Each pixel in this matrix was a feature - 256 features. Additionally I've added some features as original size, moments, etc.