How to feed Bert embeddings to LSTM - keras

I am working on a Bert + MLP model for text classification problem. Essentially, I am trying to replace the MLP model with a basic LSTM model.
Is it possible to create a LSTM with embedding? Or, is best to create a LSTM with embedded layer?
More specifically, I am having a hard time trying to create embedded matrix so I can create embedding layer using Bert embedding.
def get_bert_embeddings(dataset='gap_corrected_train',
dataset_path=TRAIN_PATH,
bert_path=BERT_UNCASED_LARGE_PATH,
bert_layers=BERT_LAYERS):
"""Get BERT embeddings for all files in dataset_path and specified BERT layers and write them to file."""
df = None
for file in os.listdir(dataset_path):
if df is None:
df = pd.read_csv(dataset_path+'/'+file, sep='\t')
else:
next_df = pd.read_csv(dataset_path+'/'+file, sep='\t')
df = pd.concat([df, next_df], axis=0)
df.reset_index(inplace=True, drop=True)
for i, layer in enumerate(bert_layers):
embeddings_file = INTERIM_PATH + 'emb_bert' + str(layer) + '_' + dataset + '.h5'
if not os.path.exists(embeddings_file):
print('Embeddings file: ', embeddings_file)
print('Extracting BERT Layer {0} embeddings for {1}...'.format(layer, dataset))
print("Started at ", time.ctime())
emb = get_bert_token_embeddings(df, bert_path, layer)
emb.to_hdf(embeddings_file, 'table')
print("Finished at ", time.ctime())
def build_mlp_model(input_shape):
input_layer = layers.Input(input_shape)
input_features = layers.Input((len(FEATURES),))
x = layers.Concatenate(axis=1, name="concate_layer")([input_layer, input_features])
x = layers.Dense(HIDDEN_SIZE, name='dense1')(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.Dropout(DROPOUT, seed=RANDOM)(x)
x = layers.Dense(HIDDEN_SIZE//2, name='dense2')(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.Dropout(DROPOUT//2, seed=RANDOM)(x)
x = layers.Dense(HIDDEN_SIZE//4, name='dense3')(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.Dropout(DROPOUT//2, seed=RANDOM)(x)
output_layer = layers.Dense(3, name='output', kernel_regularizer = regularizers.l2(LAMBDA))(x)
output_layer = layers.Activation('softmax')(output_layer)
model = models.Model(input=[input_layer, input_features], output=output_layer, name="mlp")
return model

You can create model that uses first the Embedding layer which is followed by LSTM and then Dense.
Such as here:
deep_inputs = Input(shape=(length_of_your_data,))
embedding_layer = Embedding(vocab_size, output_dim = 3000, trainable=True)(deep_inputs)
LSTM_Layer_1 = LSTM(512)(embedding_layer)
dense_layer_1 = Dense(number_of_classes, activation='softmax')(LSTM_Layer_1)
model_AdGroups = Model(inputs=deep_inputs, outputs=dense_layer_1)

Related

LSTM multi-sequence input to one sequence output

I am new with neural networks and am currently trying to make an LSTM model that predicts an output sequence based on multiple parameters. Excuse my ignorance and dummyness in advance.
I have obtained training and validation datasets, which look somewhat like the following:
For every ID four rows are recorded, which uses columns holding certain parameters and the corresponding Y output. Practically, there are thus ~122,000 / 4 = ~30,500 samples (I mistakenly put 122,000 as ID, it is in fact the number of rows). Since the parameter values and the corresponding Y values follow temporal patterns, I am interested if a model such as LSTM improves the prediction.
I want to predict the Y in my validation dataset (~73,000/4 = ~18,000 samples), based on the temporal patterns of the parameters. But is this possible? Most tutorials I followed use a single sequence, for which an LSTM is used to extend a similar input sequence. I thus want an LSTM with 'multi-sequence' input, which outputs one sequence. How do I go about this?
I'm using PyTorch as framework. A simple LSTM model I created using a tutorial, which would not incorporate the parameters:
training_y = traindf.reset_index()['Y']
validation_y = traindf.reset_index()['Y']
Then create a dataset for this:
class YDataset(Dataset):
def __init__(self, data, seq_len = 100):
self.data = data
self.data = torch.from_numpy(data).float().view(-1)
self.seq_len = seq_len
def __len__(self):
return len(self.data)-self.seq_len-1
def __getitem__(self,index):
return self.data[index : index+self.seq_len] , self.data[index+self.seq_len]
train_y = YDataset(training_y_df)
vali_y = YDataset(validation_y_df)
batch_size = 64
train_dataloader = DataLoader(train_y, batch_size, drop_last=True)
vali_dataloader = DataLoader(vali_y, batch_size, drop_last=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
Then create the model:
class Lstm_model(nn.Module):
def __init__(self, input_dim, hidden_size, num_layers):
super(Lstm_model, self).__init__()
self.num_layers = num_layers
self.input_size = input_dim
self.hidden_size = hidden_size
self.lstm = nn.LSTM(input_size=input_dim, hidden_size = hidden_size, num_layers = num_layers)
self.fc = nn.Linear(hidden_size, 1)
def forward(self,x,hn,cn):
out , (hn,cn) = self.lstm(x, (hn, cn))
final_out = self.fc(out[-1])
return final_out, hn,cn
def predict(self,x):
hn, cn = self.init()
final_out = self.fc(out[-1])
return final_out
def init(self):
h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device)
c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device)
return h0 , c0
input_dim = 1
hidden_size = 50
num_layers = 3
model = Lstm_model(input_dim , hidden_size , num_layers).to(device)
Loss function and training loop (more or less same as for validation):
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
def train(dataloader):
hn, cn = model.init()
model.train()
for batch , item in enumerate(dataloader):
x , y = item
x = x.to(device)
y = y.to(device)
out , hn , cn = model(x.reshape(100,batch_size,1),hn,cn)
loss = loss_fn(out.reshape(batch_size), y)
hn = hn.detach()
cn = cn.detach()
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch == len(dataloader)-1:
loss = loss.item
print(f"train loss: {loss:>7f} ")
Epochs and loss metrics:
epochs = 200 # Takes really long for me
for epoch in range(epochs):
print(f"epoch {epoch} ")
train(train_dataloader)
test(vali_dataloader)
Final metrics:
import math
from sklearn.metrics import mean_squared_error
import numpy as np
def calculate_metrics(data_loader):
pred_arr = []
y_arr = []
with torch.no_grad():
hn , cn = model.init()
for batch , item = in enumerate(data_loader):
x , y = item
x , y = x.to(device) , y.to(device)
x = x.view(100,64,1)
pred = model(x, hn, cn)[0]
pred = scalar.inverse_transform(pred.detach().cpu().numpy().reshape(-1))
y = scalar.inverse_transform(y.detach().cpu().numpy().reshape(1,-1)).reshape(-1)
pred_arr = pred_arr + list(pred)
y_arr = y_arr + list(y)
return math.sqrt(mean_squared_error(y_arr,pred_arr))
I used this code more as an example of how LSTM would work. Nevertheless, I don't know if this is the right track for me. Does someone know what I should do or a tutorial that does work for my example? Thanks in advance!

Loss Not Decreasing for a Bert from Scratch PyTorch Model

I followed Aladdin Persson's Youtube video to code up just the encoder portion of the transformer model in PyTorch, except I just used the Pytorch's multi-head attention layer. The model seems to produce the correct shape of data. However, during training, the training loss does not drop and the resulting model always predicts the same output of 0.4761. Dataset used for training is from the Sarcasm Detection Dataset from Kaggle. Would appreciate any help you guys can give on errors that I have made.
import pandas as pd
from transformers import BertTokenizer
import torch.nn as nn
import torch
from sklearn.model_selection import train_test_split
from torch.optim.lr_scheduler import ReduceLROnPlateau
import math
df = pd.read_json("Sarcasm_Headlines_Dataset_v2.json", lines=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
encoded_input = tokenizer(df['headline'].tolist(), return_tensors='pt',padding=True)
X = encoded_input['input_ids']
y = torch.tensor(df['is_sarcastic'].values).float()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
torch.cuda.empty_cache()
class TransformerBlock(nn.Module):
def __init__(self,embed_dim, num_heads, dropout, expansion_ratio):
super(TransformerBlock, self).__init__()
self.attention = nn.MultiheadAttention(embed_dim, num_heads)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.feed_forward = nn.Sequential(
nn.Linear(embed_dim, expansion_ratio*embed_dim),
nn.ReLU(),
nn.Linear(expansion_ratio*embed_dim,embed_dim)
)
self.dropout = nn.Dropout(dropout)
def forward(self, value, key, query):
attention, _ = self.attention(value, key, query)
x=self.dropout(self.norm1(attention+query))
forward = self.feed_forward(x)
out=self.dropout(self.norm2(forward+x))
return out
class Encoder(nn.Module):
#the vocab size is one more than the max value in the X matrix.
def __init__(self,vocab_size=30109,embed_dim=128,num_layers=1,num_heads=4,device="cpu",expansion_ratio=4,dropout=0.1,max_length=193):
super(Encoder,self).__init__()
self.device = device
self.word_embedding = nn.Embedding(vocab_size,embed_dim)
self.position_embedding = nn.Embedding(max_length,embed_dim)
self.layers = nn.ModuleList(
[
TransformerBlock(embed_dim,num_heads,dropout,expansion_ratio) for _ in range(num_layers)
]
)
self.dropout = nn.Dropout(dropout)
self.classifier1 = nn.Linear(embed_dim,embed_dim)
self.classifier2 = nn.Linear(embed_dim,1)
self.relu = nn.ReLU()
def forward(self,x):
N, seq_length = x.shape
positions = torch.arange(0,seq_length).expand(N, seq_length).to(self.device)
out = self.dropout(self.word_embedding(x) + self.position_embedding(positions))
for layer in self.layers:
#print(out.shape)
out = layer(out,out,out)
#Get the first output for classification
#Pooled output from hugging face is: Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function.
#Pooled output from hugging face will be different from out[:,0,:], which is the output from the CLS token.
out = self.relu(self.classifier1(out[:,0,:]))
out = self.classifier2(out)
return out
torch.cuda.empty_cache()
net = Encoder(device=device)
net.to(device)
batch_size = 32
num_train_samples = X_train.shape[0]
num_val_samples = X_test.shape[0]
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(net.parameters(),lr=1e-5)
scheduler = ReduceLROnPlateau(optimizer, 'min', patience=5)
val_loss_hist=[]
loss_hist=[]
epoch = 0
min_val_loss = math.inf
print("Training Started")
patience = 0
for _ in range(100):
epoch += 1
net.train()
epoch_loss = 0
permutation = torch.randperm(X_train.size()[0])
for i in range(0,X_train.size()[0], batch_size):
indices = permutation[i:i+batch_size]
features=X_train[indices].to(device)
labels=y_train[indices].reshape(-1,1).to(device)
output = net.forward(features)
loss = criterion(output, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss+=loss.item()
epoch_loss = epoch_loss / num_train_samples * num_val_samples
loss_hist.append(epoch_loss)
#print("Eval")
net.eval()
epoch_val_loss = 0
permutation = torch.randperm(X_test.size()[0])
for i in range(0,X_test.size()[0], batch_size):
indices = permutation[i:i+batch_size]
features=X_test[indices].to(device)
labels = y_test[indices].reshape(-1,1).to(device)
output = net.forward(features)
loss = criterion(output, labels)
epoch_val_loss+=loss.item()
val_loss_hist.append(epoch_val_loss)
scheduler.step(epoch_val_loss)
#if epoch % 5 == 0:
print("Epoch: " + str(epoch) + " Train Loss: " + format(epoch_loss, ".4f") + ". Val Loss: " + format(epoch_val_loss, ".4f") + " LR: " + str(optimizer.param_groups[0]['lr']))
if epoch_val_loss < min_val_loss:
min_val_loss = epoch_val_loss
torch.save(net.state_dict(), "torchmodel/weights_best.pth")
print('\033[93m'+"Model Saved"+'\033[0m')
patience = 0
else:
patience += 1
if (patience == 10):
break
print("Training Ended")

Custom noise layer in Keras DQN model

I'm trying to implement Noisy-networks in my DDDQN model as a replacement for epsilon as mentioned in this paper: https://arxiv.org/pdf/1706.10295.pdf
But I'm not sure how I implement my own noisy layer in Keras. So far I have tried this by using Lambda layers in two hidden layers as well as in my output layers:
def create_model(self):
input_node = tf.keras.Input(shape=(STACK_SIZE, ENVIROMENT_OBSERVATION_SPACE))
input_layer = input_node
#define state value function
out = GRU(64, return_sequences=True, stateful=False, activation='tanh')(input_layer)
out = Dropout(0.2)(out)
out = GRU(32, return_sequences=False, stateful=False, activation='tanh')(out)
out = Dropout(0.2)(out)
out = Lambda(self.noisy_dense(12, out))
out = Activation('relu')(out)
out = Lambda(self.noisy_dense(8, out))
out = Activation('relu')(out)
state_value = Lambda(self.noisy_dense(1, out))
state_value = Lambda(lambda s: K.expand_dims(s[:, 0], axis=-1), output_shape=(ACTION_SPACE,))(state_value)
#define acion advantage
action_advantage = Lambda(self.noisy_dense(ACTION_SPACE, out))
action_advantage = Lambda(lambda a: a[:, :] - K.mean(a[:, :], keepdims=True), output_shape=(ACTION_SPACE,))(action_advantage)
#merge by adding
Q = tf.keras.layers.add([state_value,action_advantage])
#define model
model = tf.keras.Model(inputs=input_node, outputs=Q)
#Model compile settings:
opt = tf.keras.optimizers.Adam(learning_rate = learning_rate)
# Compile model
model.compile(
loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy']
)
print(model.summary())
return model
def noisy_dense(self, units, input):
w_shape = [units, input.shape[1]]
mu_w = tf.Variable(initial_value=tf.random.truncated_normal(shape=w_shape))
sigma_w = tf.Variable(initial_value=tf.constant(0.017, shape=w_shape))
epsilon_w = tf.random.uniform(shape=w_shape)
b_shape = [units]
mu_b = tf.Variable(initial_value=tf.random.truncated_normal(shape=b_shape))
sigma_b = tf.Variable(initial_value=tf.constant(0.017, shape=b_shape))
epsilon_b = tf.random.uniform(shape=b_shape)
w = tf.add(mu_w, tf.multiply(sigma_w, epsilon_w))
b = tf.add(mu_b, tf.multiply(sigma_b, epsilon_b))
return tf.matmul(input, tf.transpose(w)) + b
But when I run this code I get an error in my first Lambda layer "TypeError: Unsupported callable"

ValueError: Error when checking input: expected embedding_1_input to have shape (4,) but got array with shape (1,)

I was working on seq2seq translation and got stuck here:-
def createModel(engVocab, frVocab, size, englishMaxlength, frenchMaxLength):
model = Sequential()
model.add(Embedding(input_dim = engVocab, output_dim = size, input_length = englishMaxlength, mask_zero = True))
model.add(LSTM(units = size))
model.add(RepeatVector(frenchMaxLength))
model.add(LSTM(units = size, return_sequences = True))
model.add(TimeDistributed(Dense(frenchVocabsize, activation = 'softmax')))
return model
def DataGenerator(trainingDataEnglish, trainingDataFrench):
while True:
l = len(trainingDataFrench)
for i in range(l):
yield(trainingDataEnglish[i], trainingDataFrench[i])
I created my test and training data as follows:-
def encodeSequences(trainingData, tokenizer, maxlength):
encoder = tokenizer.texts_to_sequences(trainingData)
encoder = pad_sequences(encoder, maxlen=maxlength, padding='pre')
return encoder
def encodeOutput(testData, vocabSize):
y = []
for sequence in testData:
Seq = to_categorical(sequence, num_classes=vocabSize)
y.append(Seq)
y = np.array(y)
return y
samples = 7000
trainingSize = 6000
trainEng = english[:trainingSize] #array of strings
trainFr = french[:trainingSize] #array of strings
testEng = english[trainingSize:samples] #array of strings
testFr = french[trainingSize:samples] #array of strings
englishTokenizer = createTokenizer(trainEng)
frenchTokenizer = createTokenizer(trainFr)
englishVocabSize = len(englishTokenizer.word_index) + 1
The use of encodeSequences and encodeOutput is as follows:-
trainX = encodeSequences(trainEng, englishTokenizer, englishMaxlength)
trainY = encodeSequences(trainFr, frenchTokenizer, frenchMaxLength)
trainY = encodeOutput(trainY, frenchVocabsize)
testX = encodeSequences(testEng, englishTokenizer, englishMaxlength)
testY = encodeSequences(testFr, frenchTokenizer, frenchMaxLength)
testY = encodeOutput(testY, frenchVocabsize)
And finally :-
model = createModel(engVocab = englishVocabSize, frVocab = frenchVocabsize, size = 256, englishMaxlength = englishMaxlength, frenchMaxLength = frenchMaxLength)
print(model.summary())
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy')
steps = len(trainX)
generator = DataGenerator(trainX, trainY)
model.fit_generator(generator, epochs = epochs, steps_per_epoch = steps, validation_data = (testX, testY))
model.save('Model.h5')
And I get the following error:-
ValueError: Error when checking input: expected embedding_1_input to have shape (4,) but got array with shape (1,)
How do I fix this?
Where did I go wrong?
Please help.
Thanks in advance.

Why do I get a low accuracy in this neural network (tensorflow)?

I have made a convolutional neural network with tensorflow, i've trained it and tested it (about 98% accuracy)... I saved the model with
saver = tf.train.Saver()
saver.save(sess, 'model.ckpt')
Then i restored with the saver, but i always get an accuracy lower than 50%... why ?
Here's the code:
import tensorflow as tf
import matplotlib.pyplot as plt
import pickle
import numpy as np
with open('X_train.pickle', 'rb') as y:
u = pickle._Unpickler(y)
u.encoding = 'latin1'
X_train = u.load()
with open('X_test.pickle', 'rb') as y:
u = pickle._Unpickler(y)
u.encoding = 'latin1'
X_test = u.load()
X_test = np.array(X_test).reshape(-1, 2500)
with open('y_train.pickle', 'rb') as y:
u = pickle._Unpickler(y)
u.encoding = 'latin1'
y_train = u.load()
with open('y_test.pickle', 'rb') as y:
u = pickle._Unpickler(y)
u.encoding = 'latin1'
y_test = u.load()
n_classes = 3
batch_size = 100
x = tf.placeholder('float', [None, 2500])
y = tf.placeholder('float')
keep_rate = 0.8
keep_prob = tf.placeholder(tf.float32)
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1,1,1,1], padding='SAME')
def maxpool2d(x):
# size of window movement of window
return tf.nn.max_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')
def convolutional_neural_network(x):
weights = {'W_conv1':tf.Variable(tf.random_normal([5,5,1,32])),
'W_conv2':tf.Variable(tf.random_normal([5,5,32,64])),
'W_fc':tf.Variable(tf.random_normal([13*13*64,1024])),
'out':tf.Variable(tf.random_normal([1024, n_classes]))}
biases = {'b_conv1':tf.Variable(tf.random_normal([32])),
'b_conv2':tf.Variable(tf.random_normal([64])),
'b_fc':tf.Variable(tf.random_normal([1024])),
'out':tf.Variable(tf.random_normal([n_classes]))}
x = tf.reshape(x, shape=[-1, 50, 50, 1])
conv1 = tf.nn.relu(conv2d(x, weights['W_conv1']) + biases['b_conv1'])
conv1 = maxpool2d(conv1)
conv2 = tf.nn.relu(conv2d(conv1, weights['W_conv2']) + biases['b_conv2'])
conv2 = maxpool2d(conv2)
fc = tf.reshape(conv2,[-1, 13*13*64])
fc = tf.nn.relu(tf.matmul(fc, weights['W_fc'])+biases['b_fc'])
#fc = tf.nn.dropout(fc, keep_rate)
output = tf.matmul(fc, weights['out'])+biases['out']
return output
def use_neural_network(input_data):
prediction = convolutional_neural_network(x)
sess.run(tf.global_variables_initializer())
result = (sess.run(tf.argmax(prediction.eval(feed_dict={x:[input_data]}),1)))
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print('Accuracy:',accuracy.eval({x:X_test, y:y_test}))
return result
with tf.Session() as sess:
c = convolutional_neural_network(x)
saver = tf.train.Saver()
saver.restore(sess, "model.ckpt")
sample = X_train[432].reshape(2500)
res = use_neural_network(sample)
if res == [0]: print('Go straight')
elif res == [1]: print('Turn right')
else: print('Turn left')
img = sample.reshape(50,50)
plt.imshow(img)
plt.show()
sample = X_train[1222].reshape(2500)
res = use_neural_network(sample)
if res == [0]: print('Go straight')
elif res == [1]: print('Turn right')
else: print('Turn left')
img = sample.reshape(50,50)
plt.imshow(img)
plt.show()
sample = X_train[2986].reshape(2500)
res = use_neural_network(sample)
if res == [0]: print('Go straight')
elif res == [1]: print('Turn right')
else: print('Turn left')
img = sample.reshape(50,50)
plt.imshow(img)
plt.show()
The problem can't be overfitting, since i'm testing it with elements of the training dataset ...
I'm quite sure that the problem is the saver, but i can't figure out how to solve it ...
When you train a model using tensorlfow , make sure that your using the tensorflow version 1.0 and above. once you trained model using latest version 3 file will be created named as follows :
modelname.data
It is TensorBundle collection, save the values of all variables.
modelname.index
.index stores the list of variable names and shapes saved.
modelname.meta
this file describes the saved graph structure, includes GraphDef, SaverDef, and so on.
To reload/restore your model use model.load(modelname) it not only loads your model but also the accuracy won't be fluctuated.
Note : Please use TFLearn , TFLearn introduces a High-Level API that makes neural network building and training fast and easy.For more detail visit http://tflearn.org/getting_started/
The simple and Generalized way of building and using CNN using tensorflow is as follows:
Construct Network :
Here your will create n convolution , max-poll layer and fully connected layer then apply whatever activation function you want and return your model object
Train model :
fit your training data into your model using model.fit(X,Y)
Save Model :
Save your model using model.save(modelName)
Reload Model :
Reload your model using model.load(modelName)
This is the generic and simplified way to build and use CNN.
Hope it may help you :)

Resources