when to use to_categorical in keras - keras

Please someone explain when keras.utils.np_utils.to_categorical is to be used?
I understand that it converts class vector to binary matrix, probably used in Deep Learning model.
But if we still go ahead and use the class vector itself, and probably use model.predict_classes - what is the drawback?

A classification model with multiple classes doesn't work well if you don't have classes distributed in a binary matrix.
Suppose you have three clasess, the vector goes like this:
[1, 0, 0] = class 1
[0, 1, 0] = class 2
[0, 0, 1] = class 3
You use to_categorical to transform your training data before you pass it to your model.
If your training data uses classes as numbers, to_categorical will transform those numbers in proper vectors for using with models. You can't simply train a classification model without that.
Unfortunately, predict_classes is not documented, so it's probably better not to use it. But I suppose it does exactly the inverse operation to_categorical does. Your model outputs the vectors, and predict_classes transforms those vectors in human readable classes.

I know this is an old thread, but figured I'd help clarify.
The reason you want to_categorical (even on numeric labels) is due to how the relationship between your labels is understood by the algorithm.
For example, suppose you made a color classifier. You mark red as 1, blue as 2, and orange as 3.
Now you feed them into the machine learning algorithm to help decide what your input matches. The math is going to say that orange is higher than red. This obviously isn't your intent, but the network will know that orange is greater than red.

Related

Training/Predicting with CNN / ResNet on all classes each iteration - concatenation of input data + Hungarian algorithm

So I've got a simple pytorch example of how to train a ResNet CNN to learn MNIST labeling from this link:
https://zablo.net/blog/post/using-resnet-for-mnist-in-pytorch-tutorial/index.html
It's working great, but I want to hack it a bit so that it does 2 things. First, instead of predicting digits, it predicts animal shapes/colors for a project I'm working on. That's already working quite well already and am happy with it.
Second, I'd like to hack the training (and possibly layers) so that predictions is done in parallel on multiple images at a time. In the MNIST example, basically prediction (or output) would be done for an image that has 10 digits at a time concatenated by me. For clarity, each 10-image input will have the digits 0-9 appearing only once each. The key here is that each of the 10 digit gets a unique class/label from the CNN/ResNet and each class gets assigned exactly once. And that digits that have high confidence will prevent other digits with lower confidence from using that label (a Hungarian algorithm type of approach).
So in my use case I want to train on concatenated images (not single images) as in Fig A below and force the classifier to learn to predict the best unique label for each of the concatenated images and do this all at once. Such an approach should outperform single image classification - and it's particularly useful for my animal classification because otherwise the CNN can sometimes return the same ID for multiple animals which is impossible in my application.
I can already predict in series as in Fig B below. And indeed looking at the confidence of each prediction I am able to implement a Hungarian-algorithm like approach post-prediction to assign the best (most confident) unique IDs in each batch of 4 animals. But this doesn't always work and I'm wondering if ResNet can try and learn the greedy Hungarian assignment as well.
In particular, it's not clear that implementing A simply requires augmenting the data input and labels in the training set will do it automatically - because I don't know how to penalize or dissalow returning the same label twice for each group of images. So for now I can generate these training datasets like this:
print (train_loader.dataset.data.shape)
print (train_loader.dataset.targets.shape)
torch.Size([60000, 28, 28])
torch.Size([60000])
And I guess I would want the targets to be [60000, 10]. And each input image would be [1, 28, 28, 10]? But I'm not sure what the correct approach would be.
Any advice or available links?
I think this is a specific type of training, but I forgot the name.

Keras - understand example

I am currently trying to learn Deep Learning by focussing on Keras and the book "Deep Learning with Python-Keras"
I do have an example - I do understand the code but not the result - where I need your help. The example is about analyzing movie review from the imdB dataset which is included in Keras. The code goes as follows
def vectorize_sequences(sequences,dimension=10000):
results=np.zeros((len(sequences),dimension))
for i, sequence in enumerate(sequences):
results[i,sequence]=1.
return results
X_train=vectorize_sequences(train_data)
X_test=vectorize_sequences(test_data)
y_train=np.asarray(train_labels)
y_test=np.asarray(test_labels)
model=models.Sequential()
model.add(layers.Dense(16,activation="relu",input_shape=(10000,)))
model.add(layers.Dense(16,activation="relu"))
model.add(layers.Dense(1,activation="sigmoid"))
model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"])
history=model.fit(X_train,y_train,epochs=4,batch_size=512)
In the explanation it is written, that "the final layer will use a sigmoid activation so as to output a probability indicating how likely the sample is to have the target “1”"
I know that the sigmoid function ranges between [0,1]. Suppose the output of my network is 0.6
Why am I allowed to say that this value gives the probability to have the target "1" and not the target "0"?
I am kind of stucked and need some help :)
The interpretation of your output depends on the labels you used during your training. So train_labels and test_labels are concluded of 0s and 1s.
During training, the network is optimized to yield the correct label corresponding to an input sequence. So if your output is 0 or 1, the network is giving a confident classification. But, if your output is e.g. 0.5, the network is totally unsure to which class your input belongs.
Now we make the assumption that your input corresponds to class 1. In case of an output like 0.6, the class might be 1, but only with a confidence of 60 percent. It describes the probability to be class 1, since an output of 1 is a correct interpretation of the input to its label. If the output would be a 0, it would be the worst classification of the input since the label is 1. So this in the end corresponds to values ranging from 0 to 1, while the closer to 1 you are the better the classification - so it is a probability in the end.
But keep in mind that this definition only holds if you know that your input belongs to class 1. If it instead is part of class 0, the previous definition has to be turned around.
So in the end, you got two options. First, you can take these values as they are and treat them as a probability an input corresponds to one of the classes. Second, you can introduce a threshold - in this case it makes sense to set it to 0.5 - and say that if you are larger than the threshold, categorize your input to class 1, else to class 0. The closer your output is to 0.5 the more the network is just guessing the class in the end.
The choice of the threshold has a direct influence on the performance of your network in the end. This can be evaluated for example with a ROC curve (https://en.wikipedia.org/wiki/Receiver_operating_characteristic).

Using regression instead of classification for multi class classification

I have a multi class classification problem. I am using random forest classifier. My boss has asked if it is possible to also view our problem with regression. I understand that for a classification task, it is of course better to use a classifier, but is it possible to implement a regression model.
My data is as such:
I have a dataset consisting of software requirements, these are rated as either 1, 2, 3, 4 or 5.
I am then creating a feature matrix to use for training the model to make predictions on the class, with 10 features such as: num_words, num_sentences, num_syllables, weak_words, flesh_idx etc
My model works quite well with 93% accuracy.
Is there a way I can view this problem using regression? Such that the model would make predictions such as 1.5 for example, where the prediction doesn't fall into the class 1 or 2 but somewhere in the middle? Or maybe 2.2, 3.3 etc as opposed to 1, 2, 3, 4, or 5?
I guess the reason is just to see if we can see the software requirement scores in a continuous way.
try Softmax regression (or multinomial logistic regression) with mxnet or with tensorflow
The way you can use regression in classification problems is with Logistic Regressions. You can use this individually to classify 1 vs not 1, 2 vs not 2, and so on for each classification (don't do this), or use Softmax that in simple words, weights each class and returns a probability for each given class, then you just pick the one with the max probability and that will be your predicted class. There are a lot of neural networks that use softmax when working with mutli-class classification.
Here is a great article from scikit-learn's documentation:
https://scikit-learn.org/stable/modules/neural_networks_supervised.html

How to do weight imbalanced classes for cross entropy loss in Keras?

Using Keras for image segmentation on a highly imbalanced dataset, and I want to re-weight the classes proportional to pixels values in each class as described here. If a have binary classes with weights = [0.8, 0.2], how can I modify K.sparse_categorical_crossentropy(y_true, y_pred) to re-weight the loss according to the class which the pixel belongs to?
The input has shape (4, 256, 256, 1) (batch, height, width, channels) and the output is a vector of 0's and 1's (4, 65536, 1) (positive and negative class). The model and data is similar to the one here with the difference being the images are grayscale and the masks are binary (2 classes).
This is the custom loss function I used for my semantic segmentation project. It is modified from the categorical_crossentropy function found in keras/backend/tensorflow_backend.py.
def class_weighted_pixelwise_crossentropy(target, output):
output = tf.clip_by_value(output, 10e-8, 1.-10e-8)
weights = [0.8, 0.2]
return -tf.reduce_sum(target * weights * tf.log(output))
Note that my final version did not use class weighting - I found that it encouraged the model to use the underrepresented classes as filler for patches of the image that it was unsure about instead of making more realistic guesses, and thereby hurt performance.
Jessica's answer is clean and works well. I generally recommend it. But for the sake of variety:
I have found that sampling regions of interest that include a better ratio between the classes is an effective way to quickly learn skewed pixelwise classes.
In my case, I had two classes like you which makes things easier. I look for areas in the image that have appearances of the less represented class. I crop around it with some random offset a constantly sized bounding box ( i repeat the process multiple times per image). This yields a large set of small images that have fairly equal ratios of each class.
I should probably add here that the network will have to be set to input shape of (None, None, num_chanals) for this to then work on your original images.
Because you skip out on the vast majority of pixels ( that belong to the majority class) the training is very fast but doesn't leverage all of the data for the majority class.
In tensorflow 2.x the model.fit method has a class_weight argument to do this natively, passing a dictionary of weights for each class. Documentation

Quantifying Text Keywords for Neural Network Analysis

I am working on a small research project. I am looking to write a program that
a) Takes a large number of short texts (~100 words / several thousand texts)
b) Identify keywords in the texts
c) Presents all of them to a group of users who indicate if they found them interesting or not
d) Have the software learn what keywords or combinations are likely to be preferable. Let's assume that the target group is uniform for this example.
Now, there are two main challenges. The first one I have an answer to, the second one I am looking for help with.
1) Keyword identification.
Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.
2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords.
How would this problem commonly be addressed?
This is a way to start with:
clean your input text (remove special tokens etc)
use n-grams as features (can just start with 1-gram).
treat user's feedback "preferrable or not" as a binary label.
learn a binary classifier (whatever model is fine, naive bayesian, logistic regression).
1) Keyword identification. Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.
You can skip this part in the first model you built. Treat the sentence as bag of words(n-grams) to simplify the first working model. If you want, you can add this as feature weight later.
2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords
You can just use a dictionary mapping n-grams to integer ids. For each training example, the feature would be sparse hence you have training examples like below:
34, 68, 79293, 23232 -> 0 (negative label)
340, 608, 3, 232 -> 1 (positive label)
Imagine you have a dictionary (or vocabulary) mapping:
3: foo
34: movie
68: in-stock
232: bar
340: barz
...
TO use neural networks, you will need to have an embedding layer to turn sparse features into dense features by aggregating (for instance, averaging) the embedding vectors of all features.
Use the same example as above, suppose we just use 4-dimensional embedding:
34 -> [0.1, 0.2, -0.3, 0]
68 -> [0, 0.1, -0.1, 0.2]
79293 -> [0.3, 0.0, 0.12, 0]
23232 -> [0.4, 0.0, 0.0, 0]
------------------------------- sum
sum -> [0.8, 0.3, -0.28, 0.2]
------------------------------- L1-normalize
l1 -> [0.8, 0.3, -0.28, 0.2] ./ (0.8 + 0.3 + 0.28 + 0.2)
-> [0.51,0.19,-0.18,0.13]
At prediction time, you will need to use the dictionary and the same way of feature extraction (cleanup/n-gram generation/mapping n-gram to ids) so that your model understands the input.
You can simply use sklearn to learn a TFIDF bag of words model of your texts which returns a sparse matrix n_samplesxn_features like this:
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = TfidfTransformer(smooth_idf=False)
X_train = vectorizer.fit_transform(list_of_texts)
print(X_train.shape)
X_train is a scipy csr sparse matrix. If your NN implementation doesn't support sparse matrices you can convert it to a numpy dense matrix but it might fill your RAM; better to use an implementation that supports sparseinput (e.g. I know Lasagne/Theano does that).
After training, you can use the parameters of the NN to find out which features have a high/low weight and so are more/less important for the particular label.

Resources