how to add unrelated training data with an embedding layer? - keras

I am beginner in RNNs and would like to build a model gated recurrent unit GRU for predicting a user's action on an E-commerce website called google merchandize store that sells Google branded merchandise.
We have 5 different actions:
Add to cart
Quickview click
Product click
Remove from cart
Onsite click
My data_y which the target looks like this as we have different actions
array([[0, 0, 0, 1, 0],
[0, 0, 0, 1, 0],
[0, 0, 1, 0, 0],
...,
[0, 0, 0, 1, 0],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0]], dtype=uint8)
By using only the url or the page path the user has accessed, I have achieved 68% prediction accuracy but still trying to improve it by adding another inputs to the model.
My data_X looks like
pagePath
[googleredesign, bags]
[googleredesign, bags]
[googleredesign, electronics]
...
...
[googleredesign, bags, backpacks, home]
[googleredesign, bags, backpacks, googlealpine...
53087 rows × 2 columns
After getting the vocab length and the max sequence length I tokenized it
tokenizer = Tokenizer(num_words=vocab_length)
tokenizer.fit_on_texts(data_X['pagePath'])
sequences = tokenizer.texts_to_sequences(data_X['pagePath'])
word_index = tokenizer.word_index
model_inputs = pad_sequences(sequences, maxlen=max_seq_length)
data_X=model_inputs
That's how it looks like after tokenization
array([[ 0, 0, 0, 1, 3],
[ 0, 0, 0, 1, 3],
[ 0, 0, 0, 1, 3],
...,
[ 0, 1, 3, 12, 9],
[ 0, 1, 3, 12, 9],
[ 0, 1, 3, 12, 81]], dtype=int32)
After that I have splitted that data and trained the model
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.3,
random_state=2)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(37160, 5) (15927, 5) (37160, 5) (15927, 5)
embedding_dim = 64
inputs = tf.keras.Input(shape=(max_seq_length,))
embedding = tf.keras.layers.Embedding(
input_dim=vocab_length,
output_dim=embedding_dim,
input_length=max_seq_length
)(inputs)
gru = tf.keras.layers.GRU(units=embedding_dim)(embedding)
outputs = tf.keras.layers.Dense(5, activation='sigmoid')(gru)
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=[
'accuracy',
tf.keras.metrics.AUC(name='auc')
]
)
batch_size = 32
epochs = 3
history = model.fit(
X_train,
y_train,
validation_split=0.2,
batch_size=batch_size,
epochs=epochs,
callbacks=[
tf.keras.callbacks.ReduceLROnPlateau(),
tf.keras.callbacks.ModelCheckpoint('model.h5', save_best_only=True)
]
)
So my question is how to add another input to the model for example: if I want to add a column which represents the total time the user spent on the website. How to add it with the embedding layer and it is not tokenized and unrelated to the pagePath column which is tokenized?

you can tokenize the main row in the dataset i guess, and after that you can feed the model with the updated dataset and try also to fine tune the validation split. Increasing the number of epoch may also result in a better results

Related

torchmetrics represent uncertainty

I am using torchmetrics to calculate metrics such as F1 score, Recall, Precision and Accuracy in multilabel classification setting. With random initiliazed weights the softmax output (i.e. prediction) might look like this with a batch size of 8:
import torch
y_pred = torch.tensor([[0.1944, 0.1931, 0.2184, 0.1968, 0.1973],
[0.2182, 0.1932, 0.1945, 0.1973, 0.1968],
[0.2182, 0.1932, 0.1944, 0.1973, 0.1969],
[0.2182, 0.1931, 0.1945, 0.1973, 0.1968],
[0.2184, 0.1931, 0.1944, 0.1973, 0.1968],
[0.2181, 0.1932, 0.1941, 0.1970, 0.1976],
[0.2183, 0.1932, 0.1944, 0.1974, 0.1967],
[0.2182, 0.1931, 0.1945, 0.1973, 0.1968]])
With the correct labels (one-hot encoded):
y_true = torch.tensor([[0, 0, 1, 0, 1],
[0, 1, 0, 0, 1],
[0, 1, 0, 0, 1],
[0, 0, 1, 1, 0],
[0, 0, 1, 1, 0],
[0, 1, 0, 1, 0],
[0, 1, 0, 1, 0],
[0, 0, 1, 0, 1]])
And I can calculate the metrics by taking argmax:
import torchmetrics
torchmetrics.functional.f1_score(y_pred.argmax(-1), y_true.argmax(-1))
output:
tensor(0.1250)
The first prediction happens to be correct while the rest are wrong. However, none of the predictive probabilities are above 0.3, which means that the model is generally uncertain about the predictions. I would like to encode this and say that the f1 score should be 0.0 because none of the predictive probabilities are above a 0.3 threshold.
Is this possible with torchmetrics or sklearn library?
Is this common practice?
You need to threshold you predictions before passing them to your torchmetrics
t0, t1, mask_gt = batch
mask_pred = self.forward(t0, t1)
loss = self.criterion(mask_pred.squeeze().float(), mask_gt.squeeze().float())
mask_pred = torch.sigmoid(mask_pred).squeeze()
mask_pred = torch.where(mask_pred > 0.5, 1, 0)
# integers to comply with metrics input type
mask_pred = mask_pred.long()
mask_gt = mask_gt.long()
f1_score = self.f1(mask_pred, mask_gt)
precision = self.precision_(mask_pred, mask_gt)
recall = self.recall(mask_pred, mask_gt)
jaccard = self.jaccard(mask_pred, mask_gt)
The defined torchmetrics
self.f1 = F1Score(num_classes=2, average='macro', mdmc_average='samplewise')
self.recall = Recall(num_classes=2, average='macro', mdmc_average='samplewise')
self.precision_ = Precision(num_classes=2, average='macro', mdmc_average='samplewise') # self.precision exists in torch.nn.Module. Hence '_' symbol
self.jaccard = JaccardIndex(num_classes=2)

Keras model, training data format

I have some data in a cvs. I made a preprocess, now they look like that:
[array([66, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([18, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([26, 34, 9, 41, 19, 23, 29, 30, 1, 0]), array([15, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
for one line, it's an array of arrays
And I have a array of results 0 if it's false and 1 if it's true.
with keras and a lot of data I want in result a float between 0 and 1.
For now keras gave me an error:
ValueError: setting an array element with a sequence.
So I was thinking that my data aren't in good format.
If I take only one column it's working...
Have I to concat all arrays in one for each row or I have the wrong keras model?
here my definition of keras model:
df = dataset.values.tolist()
X = df
y = dataset['result']
X = np.array(X)
y = np.array(y)
model = Sequential()
model.add(Dense(12, activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset
_, accuracy = model.evaluate(X, y)
if I didn't convert list to numpy I get this error:
Please provide as model inputs either a single array or a list of arrays.

Not able to use Stratified-K-Fold on multi label classifier

The following code is used to do KFold Validation but I am to train the model as it is throwing the error
ValueError: Error when checking target: expected dense_14 to have shape (7,) but got array with shape (1,)
My target Variable has 7 classes. I am using LabelEncoder to encode the classes into numbers.
By seeing this error, If I am changing the into MultiLabelBinarizer to encode the classes. I am getting the following error
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
The following is the code for KFold validation
skf = StratifiedKFold(n_splits=10, shuffle=True)
scores = np.zeros(10)
idx = 0
for index, (train_indices, val_indices) in enumerate(skf.split(X, y)):
print("Training on fold " + str(index+1) + "/10...")
# Generate batches from indices
xtrain, xval = X[train_indices], X[val_indices]
ytrain, yval = y[train_indices], y[val_indices]
model = None
model = load_model() //defined above
scores[idx] = train_model(model, xtrain, ytrain, xval, yval)
idx+=1
print(scores)
print(scores.mean())
I don't know what to do. I want to use Stratified K Fold on my model. Please help me.
MultiLabelBinarizer returns a vector which is of the length of your number of classes.
If you look at how StratifiedKFold splits your dataset, you will see that it only accepts a one-dimensional target variable, whereas you are trying to pass a target variable with dimensions [n_samples, n_classes]
Stratefied split basically preserves your class distribution. And if you think about it, it does not make a lot of sense if you have a multi-label classification problem.
If you want to preserve the distribution in terms of the different combinations of classes in your target variable, then the answer here explains two ways in which you can define your own stratefied split function.
UPDATE:
The logic is something like this:
Assuming you have n classes and your target variable is a combination of these n classes. You will have (2^n) - 1 combinations (Not including all 0s). You can now create a new target variable considering each combination as a new label.
For example, if n=3, you will have 7 unique combinations:
1. [1, 0, 0]
2. [0, 1, 0]
3. [0, 0, 1]
4. [1, 1, 0]
5. [1, 0, 1]
6. [0, 1, 1]
7. [1, 1, 1]
Map all your labels to this new target variable. You can now look at your problem as simple multi-class classification, instead of multi-label classification.
Now you can directly use StartefiedKFold using y_new as your target. Once the splits are done, you can map your labels back.
Code sample:
import numpy as np
np.random.seed(1)
y = np.random.randint(0, 2, (10, 7))
y = y[np.where(y.sum(axis=1) != 0)[0]]
OUTPUT:
array([[1, 1, 0, 0, 1, 1, 1],
[1, 1, 0, 0, 1, 0, 1],
[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 1, 1, 1],
[1, 1, 0, 0, 0, 1, 1],
[1, 1, 1, 1, 0, 1, 1],
[0, 0, 1, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 1, 1],
[0, 1, 1, 1, 1, 0, 0]])
Label encode your class vectors:
from sklearn.preprocessing import LabelEncoder
def get_new_labels(y):
y_new = LabelEncoder().fit_transform([''.join(str(l)) for l in y])
return y_new
y_new = get_new_labels(y)
OUTPUT:
array([7, 6, 3, 3, 2, 5, 8, 0, 4, 1])

How to apply sequentinal MSE model to data that is not binary?

I have been using this model with binary data to predict likely hood of play from this guide.
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
model = keras.Sequential()
input_layer = keras.layers.Dense(3, input_shape=[3], activation='tanh')
model.add(input_layer)
output_layer = keras.layers.Dense(1, activation='sigmoid')
model.add(output_layer)
gd = tf.train.GradientDescentOptimizer(0.01)
model.compile(optimizer=gd, loss='mse')
sess = tf.Session() #NEW LINE
training_x = np.array([[1, 1, 0], [1, 1, 1], [0, 1, 0], [-1, 1, 0], [-1, 0, 0], [-1, 0, 1],[0, 0, 1], [1, 1, 0], [1, 0, 0], [-1, 0, 0], [1, 0, 1], [0, 1, 1], [0, 0, 0], [-1, 1, 1]])
training_y = np.array([[0], [0], [1], [1], [1], [0], [1],[0], [1], [1], [1], [1], [1], [0]])
init_op = tf.initializers.global_variables()
sess.run(init_op) #NEW LINE
model.fit(training_x, training_y, epochs=1000, steps_per_epoch = 10)
text_x = np.array([[1, 0, 0]])
test_y = model.predict(text_x, verbose=0, steps=1)
print(test_y)
All the current data is binary and model works with binary, is there any model or way to convert non-binary data to binary predict likelihood of product_sold in the below data set?
dataset:
number_infants cost_of_infants estimated_cost_infants product_sold
5 1000 2000 0
6 8919 1222 1
7 10000 891 1
product_sold
1 = yes
0 = no
edit:
lst = array of the first three columns of the df
[[5,1000,2000],[6,8919,1222]]
lst_1 = array of only the 4th column
[[0,1,1]]
training_x = np.array(lst)
training_y = np.array(lst_1)

How to calculate pos tagger accuracy by each tag

I've made a keras model pos tagger using the NLP4Hackers' article as baseline. Currently I can computate the accuracy directly over the keras model.evaluate method. Acctualy, I would like to calculate the accuracy by tag as show below:
'JJ': 98.56 accuracy,
'NNS': 99.01 accuracy
'NN': 96.43 accuracy
...
Any suggestion will be appreciated.
Thank you.
All the evaluation metrics that you can imagine are in Scikit-learn
You have two possibilities, either you compute the confusion matrix
and you look the diagonal values.
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
Or you compute the F1 score label by label
from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average=None)

Resources