How to predict target label using scikit learn - python-3.x

Let's say I have a dataset, I will provide a toy example in this instance...
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
target = "A"
...which generates...
A B C D
0 75 38 81 58
1 36 92 80 79
2 22 40 19 3
... ...
This is clearly not enough data to give a good accuracy, but nevertheless, let's say I feed data and target to a random forest algorithm provided by scikit learn...
def random_forest(target, data):
# Drop the target label, which we save separately.
X = data.drop([target], axis=1).values
y = data[target].values
# Run Cross Validation on Random Forest Classifier.
clf_tree = ske.RandomForestClassifier(n_estimators=50)
unique_permutations_cross_val(X, y, clf_tree)
unique_permutations_cross_val is simply a cross-validation function I made, this is the function (it prints out the accuracy of the model as well)...
def unique_permutations_cross_val(X, y, model):
# Split data 20/80 to be used in a K-Fold Cross Validation with unique permutations.
shuffle_validator = model_selection.ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
# Calculate the score of the model after Cross Validation has been applied to it.
scores = model_selection.cross_val_score(model, X, y, cv=shuffle_validator)
# Print out the score (mean), as well as the variance.
print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std()))
Anyways, my main question is, how can I predict the target label using this model I created. For example, let's say I feed the model [28, 12, 33]. I want the model to predict the target which in this case is "A".

This model in the posted code is not fitted yet. You did cross validation, which will tell you how well (or not) the model is training on your data, but it will not fit the model object as you want. cross_val_score() uses clones of the supplied model object to find the scores.
For predicting the data, you need to explicitly call fit() on the model.
So maybe you can edit your random_forest method to return the fitted model. Something like this:
unique_permutations_cross_val(X, y, clf_tree)
clf_tree.fit(X, y)
return clf_tree
And then wherever you are calling the random_forest method, you can do this:
fitted_model = random_forest(target, data)
predictions = fitted_model.predict([data to predict])

Related

Threshold does not work on numpy array for accuracy metric

I am trying to implement logistic regression from scratch using numpy. I wrote a class with the following methods to implement logistic regression for a binary classification problem and to score it based on BCE loss or Accuracy.
def accuracy(self, true_labels, predictions):
"""
This method implements the accuracy score. Where the accuracy is the number
of correct predictions our model has.
args:
true_labels: vector of shape (1, m) that contains the class labels where,
m is the number of samples in the batch.
predictions: vector of shape (1, m) that contains the model predictions.
"""
counter = 0
for y_true, y_pred in zip(true_labels, predictions):
if y_true == y_pred:
counter+=1
return counter/len(true_labels)
def train(self, score='loss'):
"""
This function trains the logistic regression model and updates the
parameters based on the Batch-Gradient Descent algorithm.
The function prints the training loss and validation loss on every epoch.
args:
X: input features with shape (num_features, m) or (num_features) for a
singluar sample where m is the size of the dataset.
Y: gold class labels of shape (1, m) or (1) for a singular sample.
"""
train_scores = []
dev_scores = []
for i in range(self.epochs):
# perform forward and backward propagation & get the training predictions.
training_predictions = self.propagation(self.X_train, self.Y_train)
# get the predictions of the validation data
dev_predictions = self.predict(self.X_dev, self.Y_dev)
# calculate the scores of the predictions.
if score == 'loss':
train_score = self.loss_function(training_predictions, self.Y_train)
dev_score = self.loss_function(dev_predictions, self.Y_dev)
elif score == 'accuracy':
train_score = self.accuracy((training_predictions==+1).squeeze(), self.Y_train)
dev_score = self.accuracy((dev_predictions==+1).squeeze(), self.Y_dev)
train_scores.append(train_score)
dev_scores.append(dev_score)
plot_training_and_validation(train_scores, dev_scores, self.epochs, score=score)
after testing the code with the following input
model = LogisticRegression(num_features=X_train.shape[0],
Learning_rate = 0.01,
Lambda = 0.001,
epochs=500,
X_train=X_train,
Y_train=Y_train,
X_dev=X_dev,
Y_dev=Y_dev,
normalize=False,
regularize = False,)
model.train(score = 'loss')
i get the following results
however when i swap the scoring metric to measure over time from loss to accuracy ass follows model.train(score='accuracy') i get the following result:
I have removed normalization and regularization to make sure i am using a simple implementation of logistic regression.
Note that i use an external method to visualize the training/validation score overtime in the LogisticRegression.train() method.
The trick you are using to create your predictions before passing into the accuracy method is wrong. You are using (dev_predictions==+1).
Your problem statement is a Logistic Regression model that would generate a value between 0 and 1. Most of the times, the values will NOT be exactly equal to +1.
So essentially, every time you are passing a bunch of False or 0 to the accuracy function. I bet if you check the number of classes in your datasets having the value False or 0 would be :
exactly 51.7 % in validation dataset
exactly 56.2 % in training dataset.
To fix this, you can use a in-between threshold like 0.5 to generate your labels. So use something like dev_predictions>0.5

keras: unsupervised learning with external constraint

I have to train a network on unlabelled data of binary type (True/False), which sounds like unsupervised learning. This is what the normalised data look like:
array([[-0.05744527, -1.03575495, -0.1940105 , -1.15348956, -0.62664491,
-0.98484037],
[-0.05497629, -0.50935675, -0.19396862, -0.68990988, -0.10551919,
-0.72375012],
[-0.03275552, 0.31480204, -0.1834951 , 0.23724946, 0.15504367,
0.29810553],
...,
[-0.05744527, -0.68482282, -0.1940105 , -0.87534175, -0.23580062,
-0.98484037],
[-0.05744527, -1.50366446, -0.1940105 , -1.52435329, -1.14777063,
-0.98484037],
[-0.05744527, -1.26970971, -0.1940105 , -1.33892142, -0.88720777,
-0.98484037]])
However, I do have a constraint on the total number of True labels in my data. This doesn't mean I can build a classical custom loss function in Keras taking (y_true, y_pred) arguments as required: my external constraint is just on the predicted total of True and False, not on the individual labels.
My question is whether there is a somewhat "standard" approach to this kind of problems, and how that is implementable in Keras.
POSSIBLE SOLUTION
Should I assign y_true randomly as 0/1, have a network return y_pred as 1/0 with a sigmoid activation function, and then define my loss function as
sum_y_true = 500 # arbitrary constant known a priori
def loss_function(y_true, y_pred):
loss = np.abs(y_pred.sum() - sum_y_true)
return loss
In the end, I went with the following solution, which worked.
1) Define batches in your dataframe df with a batch_id column, so that in each batch Y_train is your identical "batch ground truth" (in my case, the total number of True labels in the batch). You can then pass these instances together to the network. This can be done with a generator:
def grouper(g,x,y):
while True:
for gr in g.unique():
# this assigns indices to the entire set of values in g,
# then subsects to all the rows in which g == gr
indices = g == gr
yield (x[indices],y[indices])
# train set
train_generator = grouper(df.loc[df['set'] == 'train','batch_id'], X_train, Y_train)
# validation set
val_generator = grouper(df.loc[df['set'] == 'val','batch_id'], X_val, Y_val)
2) define a custom loss function, to track how close the total number of instances predicted as true matches the ground truth:
def custom_delta(y_true, y_pred):
loss = K.abs(K.mean(y_true) - K.sum(y_pred))
return loss
def custom_wrapper():
def custom_loss_function(y_true, y_pred):
return custom_delta(y_true, y_pred)
return custom_loss_function
Note that here
a) Each y_true label is already the sum of the ground truth in our batch (cause we don't have individual values). That's why y_true is not summed over;
b) K.mean is actually a bit of an overkill to extract a single scalar from this uniform tensor, in which all y_true values in each batch are identical - K.min or K.max would also work, but I haven't tested whether their performance is faster.
3) Use fit_generator instead of fit:
fmodel = Sequential()
# ...your layers...
# Create the loss function object using the wrapper function above
loss_ = custom_wrapper()
fmodel.compile(loss=loss_, optimizer='adam')
history1 = fmodel.fit_generator(train_generator, steps_per_epoch=total_batches,
validation_data=val_generator,
validation_steps=df.loc[encs.df['set'] == 'val','batch_id'].nunique(),
epochs=20, verbose = 2)
This way the problem is basically addressed as one of supervised learning, although without individual labels, which means that notions like true/false positive are meaningless here.
This approach not only managed to give me a y_pred that closely matches the totals I know per batch. It actually finds two groups (True/False) that occupy the expected different portions of parameter space.

How to define custom cost function that depends on input when using ImageDataGenerator in Keras?

I would like to define a custom cost function
def custom_objective(y_true, y_pred):
....
return L
that will depend not only on y_true and y_pred, but on some feature of the corresponding x that produced y_pred. The only way I can think of doing this is to "hide" the relevant features in y_true, so that y_true = [usual_y_true, relevant_x_features], or something like that.
There are two main problems I am having with implementing this:
1) Changing the shape of y_true means I need to pad y_pred with some garbage so that their shapes are the same. I can do this by modyfing the last layer of my model
2) I used data augmentation like so:
datagen = ImageDataGenerator(preprocessing_function=my_augmenter)
where my_augmenter() is the function that should also give me the relevant x features to use in custom_objective() above. However, training with
model.fit_generator(datagen.flow(x_train, y_train, batch_size=1), ...)
doesn't seem to give me access to the features calculated with my_augmenter.
I suppose I could hide the features in the augmented x_train, copy them right away in my model setup, and then feed them directly into y_true or something like that, but surely there must be a better way to do this?
Maybe you could create a two part model with:
Inner model: original model that predicts desired outputs
Outer model:
Takes y_true data as inputs
Takes features as inputs
Outputs the loss itself (instead of predicted data)
So, suppose you already have the originalModel defined. Let's define the outer model.
#this model has three inputs:
originalInputs = originalModel.input
yTrueInputs = Input(shape_of_y_train)
featureInputs = Input(shape_of_features)
#the original outputs will become an input for a custom loss layer
originalOutputs = originalModel.output
#this layer contains our custom loss
loss = Lambda(innerLoss)([originalOutputs, yTrueInputs, featureInputs])
#outer model
outerModel = Model([originalInputs, yTrueInputs, featureInputs], loss)
Now, our custom inner loss:
def innerLoss(x):
y_pred = x[0]
y_true = x[1]
features = x[2]
.... calculate and return loss here ....
Now, for this model that already contains a custom loss "inside" it, we don't actually want a final loss function, but since keras demands it, we will use the final loss as just return y_pred:
def finalLoss(true,pred):
return pred
This will allow us to train passing just a dummy y_true.
But of course, we also need a custom generator, otherwise we can't get the features.
Consider you already have originalGenerator =datagen.flow(x_train, y_train, batch_size=1) defined:
def customGenerator(originalGenerator):
while True: #keras needs infinite generators
x, y = next(originalGenerator)
features = ____extract features here____(x)
yield (x,y,features), y
#the last y will be a dummy output, necessary but not used
You could also, if you want the extra functionality of randomizing batch order and use multiprocessing, implement a class CustomGenerator(keras.utils.Sequence) following the same logic. The help page shows how.
So, let's compile and train the outer model (this also trains the inner model so you can use it later for predicting):
outerModel.compile(optimizer=..., loss=finalLoss)
outerModel.fit_generator(customGenerator(originalGenerator), batchesInOriginalGenerator,
epochs=...)

Multi-label classification with class weights in Keras

I have a 1000 classes in the network and they have multi-label outputs. For each training example, the number of positive output is same(i.e 10) but they can be assigned to any of the 1000 classes. So 10 classes have output 1 and rest 990 have output 0.
For the multi-label classification, I am using 'binary-cross entropy' as cost function and 'sigmoid' as the activation function. When I tried this rule of 0.5 as the cut-off for 1 or 0. All of them were 0. I understand this is a class imbalance problem. From this link, I understand that, I might have to create extra output labels.Unfortunately, I haven't been able to figure out how to incorporate that into a simple neural network in keras.
nclasses = 1000
# if we wanted to maximize an imbalance problem!
#class_weight = {k: len(Y_train)/(nclasses*(Y_train==k).sum()) for k in range(nclasses)}
inp = Input(shape=[X_train.shape[1]])
x = Dense(5000, activation='relu')(inp)
x = Dense(4000, activation='relu')(x)
x = Dense(3000, activation='relu')(x)
x = Dense(2000, activation='relu')(x)
x = Dense(nclasses, activation='sigmoid')(x)
model = Model(inputs=[inp], outputs=[x])
adam=keras.optimizers.adam(lr=0.00001)
model.compile('adam', 'binary_crossentropy')
history = model.fit(
X_train, Y_train, batch_size=32, epochs=50,verbose=0,shuffle=False)
Could anyone help me with the code here and I would also highly appreciate if you could suggest a good 'accuracy' metric for this problem?
Thanks a lot :) :)
I have a similar problem and unfortunately have no answer for most of the questions. Especially the class imbalance problem.
In terms of metric there are several possibilities: In my case I use the top 1/2/3/4/5 results and check if one of them is right. Because in your case you always have the same amount of labels=1 you could take your top 10 results and see how many percent of them are right and average this result over your batch size. I didn't find a possibility to include this algorithm as a keras metric. Instead, I wrote a callback, which calculates the metric on epoch end on my validation data set.
Also, if you predict the top n results on a test dataset, see how many times each class is predicted. The Counter Class is really convenient for this purpose.
Edit: If found a method to include class weights without splitting the output.
You need a numpy 2d array containing weights with shape [number classes to predict, 2 (background and signal)].
Such an array could be calculated with this function:
def calculating_class_weights(y_true):
from sklearn.utils.class_weight import compute_class_weight
number_dim = np.shape(y_true)[1]
weights = np.empty([number_dim, 2])
for i in range(number_dim):
weights[i] = compute_class_weight('balanced', [0.,1.], y_true[:, i])
return weights
The solution is now to build your own binary crossentropy loss function in which you multiply your weights yourself:
def get_weighted_loss(weights):
def weighted_loss(y_true, y_pred):
return K.mean((weights[:,0]**(1-y_true))*(weights[:,1]**(y_true))*K.binary_crossentropy(y_true, y_pred), axis=-1)
return weighted_loss
weights[:,0] is an array with all the background weights and weights[:,1] contains all the signal weights.
All that is left is to include this loss into the compile function:
model.compile(optimizer=Adam(), loss=get_weighted_loss(class_weights))

tflearn DNN gives zero loss

I am using pandas to extract my data. To get an idea of my data I replicated an example dataset...
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
which yields a dataset of shape=(100,4)...
A B C D
0 75 38 81 58
1 36 92 80 79
2 22 40 19 3
... ...
I am using tflearn so I will need a target label as well. So I created a target label by extracting one of the columns from data and then dropped it out of the data variable (I also converted everything to numpy arrays)...
# Target label used for training
labels = np.array(data['A'].values, dtype=np.float32)
# Reshape target label from (100,) to (100, 1)
labels = np.reshape(labels, (-1, 1))
# Data for training minus the target label.
data = np.array(data.drop('A', axis=1).values, dtype=np.float32)
Then I take the data and the labels and feed it into the DNN...
# Deep Neural Network.
net = tflearn.input_data(shape=[None, 3])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 1, activation='softmax')
net = tflearn.regression(net)
# Define model.
model = tflearn.DNN(net)
model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)
This seems like it should work, but the output I get is as follows...
Notice that the loss remains at 0, so I am definitely doing something wrong. I don't really know what form my data should be in. How can I get my training to work?
Your actual output is in range 0 to 100 while the activation softmax in the outermost layer outputs in range [0, 1]. You need to fix that. Also the default loss for tflearn.regression is categorical cross entropy which is used for classification problems and makes no sense in your scenario. You should try L2 loss. The reason you are getting zero error in this setting is that your network predicts 0 for all training examples and if you fit that value in formula for sigmoid cross entropy, loss indeed is zero. Here is its formula , where t[i] denotes the actual probabilities (which doesnt make sense in your problem) and o[i] is the predicted probabilities.
Here is more reasoning about why default choice of loss function is not suitable for your case

Resources