Unable to create custom dataset and dataloader using torchtext - python-3.x

I have questions regarding building custom dataset and iterator using torchtext. I used the following code found in this post and modified based on my case:
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
text_field = Field(sequential=True, eos_token="[CLS]", tokenize=tokenizer)
label_field = Field(sequential=False, use_vocab=False)
data_fields = [("file", None),
("text", text_field),
("label", label_field)]
train, val = train_test_split(input_dt, test_size=0.1)
train.to_csv("train_output_path", index=False)
val.to_csv("val_output_path", index=False)
train, val = TabularDataset(path="path", train="train.csv", validation="val.csv",
format="csv", skip_header=True, fields=data_fields)
When it comes to text_field.build_vocab(train), I got this error: TypeError: '<' not supported between instances of 'list' and 'int'.
The only difference between my code and the post is the pre-trained word embeddings. In the post, the author used glove, which I use XLNetTokenizer from transformers package. I also searched for other posts who used the similar method, but they all used the pre-trained word embeddings, therefore they did have such an issue.
Does anyone know how to fix this issue? Many thanks!

I think as you are using a predefined tokenizer you dont't need to build vocab, instead you can follow this steps. Showing an example of how to do it using BERT tokenizer.
Sentences: it is a list of of text data
lables: is the label associated
###tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []
# For every sentence...
for sent in sentences:
# `encode_plus` will:
# (1) Tokenize the sentence.
# (2) Prepend the `[CLS]` token to the start.
# (3) Append the `[SEP]` token to the end.
# (4) Map tokens to their IDs.
# (5) Pad or truncate the sentence to `max_length`
# (6) Create attention masks for [PAD] tokens.
encoded_dict = tokenizer.encode_plus(
sent, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = 100, # Pad & truncate all sentences.
pad_to_max_length = True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
)
# Add the encoded sentence to the list.
input_ids.append(encoded_dict['input_ids'])
# And its attention mask (simply differentiates padding from non-padding).
attention_masks.append(encoded_dict['attention_mask'])
# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)
# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])
### Not combine the input id , mask and labels and divide the dataset
#:
from torch.utils.data import TensorDataset, random_split
# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)
# Create a 90-10 train-validation split.
# Calculate the number of samples to include in each set.
train_size = int(0.90 * len(dataset))
val_size = len(dataset) - train_size
# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))
### Not you call loader of these datasets
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
# The DataLoader needs to know our batch size for training, so we specify it
# here. For fine-tuning BERT on a specific task, the authors recommend a batch
# size of 16 or 32.
batch_size = 32
# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
train_dataset, # The training samples.
sampler = RandomSampler(train_dataset), # Select batches randomly
batch_size = batch_size # Trains with this batch size.
)
# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
val_dataset, # The validation samples.
sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
batch_size = batch_size # Evaluate with this batch size.
)

Related

Results from Pytorch tutorial using Google collab not matching results in PyCharm

I'm following a tutorial on Youtube for Pytorch which uses torch.manual_seed to ensure results are the same or least in the same ballpark.
Now admittedly I'm no expert but on running the code in chapter 2 of the tutorial the resulting graph from my model seems way off from what it should be.
I've tried going through the code line by line but for the last 3 days I can't seem to find any differences between my code and the code used in the tutorial (other than variable names which I changed for clarity on my part and so I'm not just mindlessly copying).
I work a pretty busy menial job with variable work days so I don't get a lot of time off but I've spent 10 'off days' across a month trying to solve this and I just can't see it. Genuinely any help would be appreciated even if it's an eror on my part I would be alright with that being stated without saying what it is; I just want to know if I've done anything wrong at all.
Here's a link to the doc file for the tutorial if that helps:
https://www.learnpytorch.io/02_pytorch_classification/#31-going-from-raw-model-outputs-to-predicted-labels-logits-prediction-probabilities-prediction-labels
Here's my code:
`from sklearn.datasets import make_circles
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split as tts
import torch
from torch import nn
from helper_functions import plot_predictions, plot_decision_boundary
import os
from pathlib import Path
# Generate 1000 samples
sample_number = 1000
# Create circles
Features, labels = make_circles(sample_number,
noise=0.03, # <- adds a little noise to the dots
random_state=42) # <- sets random state seed for consistency
# View the first 5 values for both parameters
print(f'First 5 Features (X):\n{Features[:5]}\n'
f'First 5 labels (y):\n{labels[:5]}')
# Make a DataFrame of circle data
circles = pd.DataFrame({"inputType1": Features[:, 0], # <- everything in the 0th index is type 1
"inputType2": Features[:, 1], # <- everything in the 1st index is type 2
"output": labels
})
print(f'Created dataframe:\n'
f'{circles.head(10)}')
# Check the different labels
print(f'Number of value per class:\n'
f'{circles.output.value_counts()}')
# Visualise the dataframe
plt.scatter(x=Features[:, 0],
y=Features[:, 1],
c=labels,
cmap=plt.cm.RdYlBu)
# Display plot
# plt.show()
# Check the shapes of the features and labels
# ML deals with numerical representation
# Ensuring the input and output shapes are compatible is crucial
print(f'Circle shapes: {Features.shape, labels.shape}')
# View the first example of features and labels
Features_samples = Features[0]
labels_samples = labels[0]
print(f'Values for one sample of X: {Features_samples} and the same for y: {labels_samples}')
print(f'Shapes for one sample of X: {Features_samples.shape}'
f'\nand the same for y: {labels_samples.shape}')
# ^ Features dataset has 1000 samples with two feature classes
# ^ labels dataset has 1000 samples with no feature classes since it's a scalar
# Turning datasets into tensors
Features = torch.from_numpy(Features).type(torch.float)
labels = torch.from_numpy(labels).type(torch.float)
# View the first five samples
print(f'First 5 Features:\n'
f'{Features[:5]}\n'
f'First 5 labels:\n'
f'{labels[:5]}\n')
# Split data into train and test sets
input_data_train, input_data_test, model_output_train, model_output_test = tts(Features,
labels,
test_size=0.2,
random_state=42)
# Check that splits follow this pattern:
# 80% train, 20% test
print(f'Number of samples for input train:\n'
f'{len(input_data_train)}\n'
f'Number of samples for input test:\n'
f'{len(input_data_test)}\n'
f'Number of samples for output train:\n'
f'{len(model_output_train)}\n'
f'Number of samples for output test:\n'
f'{len(model_output_test)}\n')
# Begin building learning model
# Make device-agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f'Learning model processing on: {device}\n')
"""# Assign parameters to the device
input_data_train = input_data_train.to(device)
input_data_test = input_data_test.to(device)
model_output_train = model_output_train.to(device)
model_output_test = model_output_test.to(device)"""
# 1. Construct a model class that subclasses nn.Module
class CircleClassificationModel(nn.Module):
def __init__(self):
super().__init__()
# 2. Create 2 nn.Linear layers for handling Feature and labels, input and output shapes
self.layer_1 = nn.Linear(in_features=2, out_features=5) # <- inputs 2 Features, outputs 5
self.layer_2 = nn.Linear(in_features=5, out_features=1) # <- inputs 5 Features, outputs 1 label
# 3. Define a forward method containing the forward pass computation
def forward(self, x):
# Return the output of layer_2, a single feature, which is the same shape as the label
return self.layer_2(self.layer_1(x))
# Computation goes through layer_1 first
# The output of layer_1 goes through layer_2
# 4. Create an instance of the model and send it to target device
classification_model_0 = CircleClassificationModel().to(device)
# Display model parameters
print(f'Model parameters for self defined model:\n'
f'{classification_model_0}\n')
# The above code can be written more succinctly using nn.Sequential
# Implements two layers of nn.Linear()
# Which calls the following equation
# y = ( x * (Weights).transposed ) + bias
classification_model_0 = nn.Sequential(
nn.Linear(in_features=2, out_features=5),
nn.Linear(in_features=5, out_features=1)
).to(device)
# Display model parameters
print(f'Model (nn.Sequential) parameters:\n'
f'{classification_model_0}\n\n')
# Make predictions with the model
untrained_predictions = classification_model_0(input_data_test.to(device))
print(f'Length of predictions: {len(untrained_predictions)}, Shape: {untrained_predictions.shape}')
print(f'Length of test samples: {len(model_output_test)}, Shape: {model_output_test.shape}')
print(f'\nFirst 10 predictions:\n'
f'{untrained_predictions[:10]}')
print(f'\nFirst 10 test labels:\n'
f'{model_output_test[:10]}')
# Create a loss function
# Unlike the regression model, the classification model uses a different loss type
# Binary Cross Entropy will be used for this task
# torch.nn.BCELoss() - measures the BCE between the target(labels) and the input(features)
# Another version may be used:
# torch.nn.BCEWithLogitsLoss() - same, except it has a built-in Sigmoid function
# loss_fn = nn.BCELoss() # <- BCELoss = no sigmoid built-in
loss_function = nn.BCEWithLogitsLoss() # <- BCEWithLogitsLoss = sigmoid built-in
# Create an optimiser
optimiser = torch.optim.SGD(params=classification_model_0.parameters(),
lr=0.1)
# Calculate accuracy (a classification metric)
# This acts as an evaluation metric
# Offers perspective into how the model is going
# The loss function measures how wrong the model but
# Evaluation metrics measure how right it is
# Accuracy will be the first metric to be utilised
# Accuracy can be measured by dividing the total number of correct predictions
# By the total number of overall predictions
def accuracy_function(label_actual, label_predicted):
# Calculates where 2 tensors are equal
correct = torch.eq(label_actual, label_predicted).sum().item()
accuracy = (correct / len(label_predicted)) * 100
return accuracy
# View the first 5 results of the forward pass on test data
# labels_logits represents the output of the forward pass method above
# Which utilises two nn.Linear() layers
labels_logits = classification_model_0(input_data_test.to(device))[:5]
print(f'First 5 outputs of the forward pass:\n'
f'{labels_logits}')
# Use the sigmoid function on the model labels_logits
# Turns the output of the forward pass into prediction probabilities
# Measures the odds the model classifies a data point into one class or the other
# In the case of this problem the classes are either 0 or 1
# It uses the logic:
# If labels_prediction_probabilities >= 0.5 then assign the label class (1)
# If labels_prediction_probabilities < 0.5 then assign the label class (0)
labels_prediction_probabilities = torch.sigmoid(labels_logits)
print(f'Output of the sigmoid-ed forward pass:\n'
f'{labels_prediction_probabilities}')
# Find the predicted labels (round the prediction probabilities as well)
label_predictions = torch.round(labels_prediction_probabilities)
# In full
labels_predictions_classes = \
torch.round(torch.sigmoid(classification_model_0(input_data_test.to(device))[:5]))
# Check for equality
print(torch.eq(label_predictions.squeeze(), labels_predictions_classes.squeeze()))
# Get rid of the extra dimensions
label_predictions.squeeze()
# Display model predictions
print(f'Model Predictions:\n'
f'{label_predictions}')
# Display test labels for comparison with model predictions
print(f'\nFirst five test data:\n'
f'{model_output_test[:5]}')
# Building the training loop
torch.manual_seed(42)
# Set the number of epochs
epochs = 100
# Process data on the target devices
input_data_train, model_output_train = input_data_train.to(device),\
model_output_train.to(device)
input_data_test, model_output_test = input_data_test.to(device),\
model_output_test.to(device)
# Build the training and evaluation loop
for epoch in range(epochs):
# Training
classification_model_0.train()
# todo: Do the Forward pass
# Model outputs raw labels_logits
train_labels_logits = classification_model_0(input_data_train).squeeze()
# ^ .squeeze() removes the extra dimensions, won't work if model and data on diff devices
train_label_prediction = torch.round(torch.sigmoid(train_labels_logits))
# ^ turns logits -> prediction probabilities -> prediction label classes
# todo: Calculate the loss
# 2. Calculates loss/accuracy
""" train_loss = loss_function(torch.sigmoid(train_labels_logits),
model_output_train) # <- nn.BCELoss needs torch.sigmoid() """
train_loss = loss_function(train_labels_logits,
model_output_train)
train_accuracy = accuracy_function(label_actual=model_output_train,
label_predicted=train_label_prediction)
# todo: Optimiser zero grad
optimiser.zero_grad()
# todo: Loss backwards
train_loss.backward()
# todo: optimiser step step step
optimiser.step()
# Testing
# todo: evaluate the model
classification_model_0.eval()
with torch.inference_mode():
# todo: Do the forward pass
test_logits = classification_model_0(input_data_test).squeeze()
test_predictions = torch.round(torch.sigmoid(test_logits))
# todo: calculate the loss
test_loss = loss_function(test_logits,
model_output_test)
test_accuracy = accuracy_function(label_actual=model_output_test,
label_predicted=test_predictions)
# todo: print model statistics every 10 epochs
if epoch % 10 == 0:
print(f'Epoch: {epoch} | Loss: {train_loss:.5f}, | Train Accuracy: {train_accuracy:.2f}%'
f'Test Loss: {test_loss:.5f}, | Test accuracy: {test_accuracy:.2f}%')
# Plot decision boundary for training and test sets
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("My Train")
plot_decision_boundary(classification_model_0, input_data_train, model_output_train)
plt.subplot(1, 2, 2)
plt.title("My Test")
plot_decision_boundary(classification_model_0, input_data_test, model_output_test)
plt.show()`
AND HERE'S THE TUTORIAL CODE:
from sklearn.datasets import make_circles
# Make 1000 samples
n_samples = 1000
# Create circles
X, y = make_circles(n_samples,
noise=0.03, # a little bit of noise to the dots
random_state=42) # keep random state so we get the same values
print(f"First 5 X features:\n{X[:5]}")
print(f"\nFirst 5 y labels:\n{y[:5]}")
# Make DataFrame of circle data
import pandas as pd
circles = pd.DataFrame({"X1": X[:, 0],
"X2": X[:, 1],
"label": y
})
circles.head(10)
# Check different labels
circles.label.value_counts()
# Visualize with a plot
import matplotlib.pyplot as plt
plt.scatter(x=X[:, 0],
y=X[:, 1],
c=y,
cmap=plt.cm.RdYlBu);
# Check the shapes of our features and labels
X.shape, y.shape
# View the first example of features and labels
X_sample = X[0]
y_sample = y[0]
print(f"Values for one sample of X: {X_sample} and the same for y: {y_sample}")
print(f"Shapes for one sample of X: {X_sample.shape} and the same for y: {y_sample.shape}")
# Turn data into tensors
# Otherwise this causes issues with computations later on
import torch
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)
# View the first five samples
X[:5], y[:5]
# Split data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2, # 20% test, 80% train
random_state=42) # make the random split reproducible
len(X_train), len(X_test), len(y_train), len(y_test)
# Standard PyTorch imports
import torch
from torch import nn
# Make device agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
device
# 1. Construct a model class that subclasses nn.Module
class CircleModelV0(nn.Module):
def __init__(self):
super().__init__()
# 2. Create 2 nn.Linear layers capable of handling X and y input and output shapes
self.layer_1 = nn.Linear(in_features=2, out_features=5) # takes in 2 features (X), produces 5 features
self.layer_2 = nn.Linear(in_features=5, out_features=1) # takes in 5 features, produces 1 feature (y)
# 3. Define a forward method containing the forward pass computation
def forward(self, x):
# Return the output of layer_2, a single feature, the same shape as y
return self.layer_2(
self.layer_1(x)) # computation goes through layer_1 first then the output of layer_1 goes through layer_2
# 4. Create an instance of the model and send it to target device
model_0 = CircleModelV0().to(device)
model_0
# Replicate CircleModelV0 with nn.Sequential
model_0 = nn.Sequential(
nn.Linear(in_features=2, out_features=5),
nn.Linear(in_features=5, out_features=1)
).to(device)
model_0
# Make predictions with the model
untrained_preds = model_0(X_test.to(device))
print(f"Length of predictions: {len(untrained_preds)}, Shape: {untrained_preds.shape}")
print(f"Length of test samples: {len(y_test)}, Shape: {y_test.shape}")
print(f"\nFirst 10 predictions:\n{untrained_preds[:10]}")
print(f"\nFirst 10 test labels:\n{y_test[:10]}")
# Create a loss function
# loss_fn = nn.BCELoss() # BCELoss = no sigmoid built-in
loss_fn = nn.BCEWithLogitsLoss() # BCEWithLogitsLoss = sigmoid built-in
# Create an optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(),
lr=0.1)
# Calculate accuracy (a classification metric)
def accuracy_fn(y_true, y_pred):
correct = torch.eq(y_true, y_pred).sum().item() # torch.eq() calculates where two tensors are equal
acc = (correct / len(y_pred)) * 100
return acc
# View the frist 5 outputs of the forward pass on the test data
y_logits = model_0(X_test.to(device))[:5]
y_logits
# Use sigmoid on model logits
y_pred_probs = torch.sigmoid(y_logits)
y_pred_probs
# Find the predicted labels (round the prediction probabilities)
y_preds = torch.round(y_pred_probs)
# In full
y_pred_labels = torch.round(torch.sigmoid(model_0(X_test.to(device))[:5]))
# Check for equality
print(torch.eq(y_preds.squeeze(), y_pred_labels.squeeze()))
# Get rid of extra dimension
y_preds.squeeze()
y_test[:5]
torch.manual_seed(42)
# Set the number of epochs
epochs = 100
# Put data to target device
X_train, y_train = X_train.to(device), y_train.to(device)
X_test, y_test = X_test.to(device), y_test.to(device)
# Build training and evaluation loop
for epoch in range(epochs):
### Training
model_0.train()
# 1. Forward pass (model outputs raw logits)
y_logits = model_0(
X_train).squeeze() # squeeze to remove extra `1` dimensions, this won't work unless model and data are on same device
y_pred = torch.round(torch.sigmoid(y_logits)) # turn logits -> pred probs -> pred labls
# 2. Calculate loss/accuracy
# loss = loss_fn(torch.sigmoid(y_logits), # Using nn.BCELoss you need torch.sigmoid()
# y_train)
loss = loss_fn(y_logits, # Using nn.BCEWithLogitsLoss works with raw logits
y_train)
acc = accuracy_fn(y_true=y_train,
y_pred=y_pred)
# 3. Optimizer zero grad
optimizer.zero_grad()
# 4. Loss backwards
loss.backward()
# 5. Optimizer step
optimizer.step()
### Testing
model_0.eval()
with torch.inference_mode():
# 1. Forward pass
test_logits = model_0(X_test).squeeze()
test_pred = torch.round(torch.sigmoid(test_logits))
# 2. Caculate loss/accuracy
test_loss = loss_fn(test_logits,
y_test)
test_acc = accuracy_fn(y_true=y_test,
y_pred=test_pred)
# Print out what's happening every 10 epochs
if epoch % 10 == 0:
print(
f"Epoch: {epoch} | Loss: {loss:.5f}, Accuracy: {acc:.2f}% | Test loss: {test_loss:.5f}, Test acc: {test_acc:.2f}%")
from helper_functions import plot_predictions, plot_decision_boundary
# Plot decision boundaries for training and test sets
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Tut Train")
plot_decision_boundary(model_0, X_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Tut Test")
plot_decision_boundary(model_0, X_test, y_test)

Forward outputs on multiple sequences is wrong

I am using T5 to summarize multiple sequences as a batch. Here I want to generate the output of model.generate(input_ids) by calling forward function (model(**inputs)). I know that forward() and generate() work completely different see this. To make them working the same way. I take some sequences and call model.generate() on them to generate the corresponding outputs and get pairs of (text, summary). Now, Calling the forward function on these pairs one each time generates the same outputs. However, when calling the forward function on batch of sequences, the output is not the same ? What I missed ?
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
model.resize_token_embeddings(len(tokenizer))
model.to("cuda")
model.eval()
# sequences
seq1 = "summarize: Calling the model (which means the forward method) uses the labels for teacher forcing. This means inputs to the decoder are the labels shifted by one"
output1 = "calling the model uses the labels for teacher forcing. inputs to the decoder"
seq2 = "summarize: When you call the generate method, the model is used in the autoregressive fashion"
output2 = "the model is used in the auto-aggressive fashion."
seq3 = "summarize: However, selecting the token is a hard decision, and the gradient cannot be propagated through this decision"
output3 = "the token is a hard decision, and the gradient cannot be propagated through this decision"
input_sequences = [seq1, seq2, seq3]
output_seq = [output1, output2, output3]
# encoding input and attention mask
encoding = tokenizer(
input_sequences,
padding="longest",
max_length=128,
truncation=True,
return_tensors="pt",
)
input_ids, attention_mask = encoding.input_ids.to("cuda"), encoding.attention_mask.to("cuda")
# labels
target_encoding = tokenizer(
output_seq, padding="longest", max_length=128, truncation=True
)
labels = target_encoding.input_ids
labels = torch.tensor(labels).to("cuda")
labels[labels == tokenizer.pad_token_id] = -100
# Call the models
logits = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).logits
# Apply softamx() and batch_decode()
X = logits
X = F.softmax(X, dim=-1)
ids = X.argmax(dim=-1)
y = tokenizer.batch_decode(sequences=ids, skip_special_tokens=True)
# results: batch_size=3
['call the model uses the labels for teacher forcing inputs to the decoder are',
'the model is used in the auto-aggressive fashion the the the',
'the token is a hard decision, and the gradient cannot be propagated through this decision ']
# results: batch_size =1 i.e. consider 1 seq each time
['call the model uses the labels for teacher forcing inputs to the decoder are']
['the model is used in the auto-aggressive fashion ']
['the token is a hard decision, and the gradient cannot be propagated through this decision ']

How to extract Sentence Embedding Using BERT model from [CLS] token

I am following this link:
BERT document embedding
I want to extract sentence-embedding using BERT model using CLS token. Here is the code:
import torch
from keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
def text_to_embedding(tokenizer, model, in_text):
'''
Uses the provided BERT 'model' and 'tokenizer' to generate a vector
representation of the input string, 'in_text'.
Returns the vector stored as a numpy ndarray.
'''
# ===========================
# STEP 1: Tokenization
# ===========================
MAX_LEN = 510
# 'encode' will:
# (1) Tokenize the sentence
# (2) Prepend the '[CLS]' token to the start.
# (3) Append the '[SEP]' token to the end.
# (4) Map tokens to their IDs.
input_ids = tokenizer.encode(
in_text, # sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = MAX_LEN, # Truncate all sentences.
#return_tensors = 'pt' # Return pytorch tensors.
)
print(input_ids)
print(tokenizer.decode(input_ids))
# Pad our input tokens. Truncation was handled above by the 'encode'
# function, which also makes sure that the '[SEP]' token is placed at the
# end *after* truncating.
# Note: 'pad_sequences' expects a list of lists, but we only have one
# piece of text, so we surround 'input_ids' with an extra set of brackets.
results = tokenizer(in_text, max_length=MAX_LEN, truncation=True)
input_ids = results.input_ids
attn_mask = results.attention_mask
print(results)
# Cast to tensors.
input_ids = torch.tensor(input_ids)
attn_mask = torch.tensor(attn_mask)
# Add an extra dimension for the "batch" (even though there is only one
# input in this batch)
input_ids = input_ids.unsqueeze(0)
attn_mask = attn_mask.unsqueeze(0)
# ===========================
# STEP 1: Tokenization
# ===========================
# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
#model.eval()
# Copy the inputs to the GPU
#input_ids = input_ids.to(device)
#attn_mask = attn_mask.to(device)
# telling the model not to build the backward graph will make this
# a little quicker.
with torch.no_grad():
# Forward pass, returns hidden states and predictions
# This will return the logits rather than the loss because we have
# not provided labels.
outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
hidden_states = outputs[2]
#Sentence Vectors
#To get a single vector for our entire sentence we have multiple
#application-dependent strategies, but a simple approach is to
#average the second to last hiden layer of each token producing
#a single 768 length vector.
# `hidden_states` has shape [13 x 1 x ? x 768]
# `token_vecs` is a tensor with shape [? x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all ? token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
# Move to the CPU and convert to numpy ndarray.
sentence_embedding = sentence_embedding.detach().cpu().numpy()
return(sentence_embedding)
from transformers import BertTokenizer, BertModel
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',output_hidden_states = True), # Whether the model returns all hidden-states.
#model.cuda()
from transformers import BertTokenizer
# Load the BERT tokenizer.
print('Loadin BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
k=text_to_embedding(tokenizer, model, "I like to play cricket")
Output:
<ipython-input-14-f03410b60544> in text_to_embedding(tokenizer, model, in_text)
77 # This will return the logits rather than the loss because we have
78 # not provided labels.
---> 79 outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
80
81
TypeError: 'tuple' object is not callable
I get an error in this line outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
Instead of using average of hidden layer, I want to modify code to get embedding for input sentence using CLS token.
There are 3 ways you can approach your problem-
There exists a very cool tool called bert-as-service. It maps a sentence to a fixed length word embeddings based on the model you choose to use. The documentation is very well written.
Install
pip install bert-serving-server # server
pip install bert-serving-client # client, independent of bert-serving-server
Download one of the pre-trained models available at official BERT repo- link
Start the server
bert-serving-start -model_dir /model_directory/ -num_worker=4
Generate embedding
from bert_serving.client import BertClient
bc = BertClient()
vectors=bc.encode(your_list_of_sentences)
There exist an academic paper by name of Sentence-BERT and their github repo
You are doing a lot of manual work- padding attn mask etc.
Toeknizer does it for you automatically, check the documentation. And, if you see the implementation of the forward() call of the model, it returns-
return (sequence_output, pooled_output) + encoder_outputs[1:]
For the bert base (768 hidden states), sequence output is the embedding of all token of the sequence, so if your input size[ max_len] is 510, then each token is embedded is a space of 768 dimension making sequence output of size- 768 * 510 *1
Pooled output is the one where all the embeddings are squished into a space of 768*1 dimension.
So I think you will want to use Pooled output for simple embeddings.

BERT document embedding

I am trying to do document embedding using BERT. The code I use is a combination of two sources. I use BERT Document Classification Tutorial with Code, and BERT Word Embeddings Tutorial. Below is the code, I feed the first 510 tokens of each document to the BERT model. Finally, I apply K-means clustering to these embeddings, but the members of each cluster are TOTALLY irrelevant. I am wondering how this is possible. Maybe something is wrong with my code. I would appreciate if you take a look at my code and tell if there is something wrong with it. I use Google colab to run this code.
# text_to_embedding function
import torch
from keras.preprocessing.sequence import pad_sequences
def text_to_embedding(tokenizer, model, in_text):
'''
Uses the provided BERT 'model' and 'tokenizer' to generate a vector
representation of the input string, 'in_text'.
Returns the vector stored as a numpy ndarray.
'''
# ===========================
# STEP 1: Tokenization
# ===========================
MAX_LEN = 510
# 'encode' will:
# (1) Tokenize the sentence
# (2) Prepend the '[CLS]' token to the start.
# (3) Append the '[SEP]' token to the end.
# (4) Map tokens to their IDs.
input_ids = tokenizer.encode(
in_text, # sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = MAX_LEN, # Truncate all sentences.
#return_tensors = 'pt' # Return pytorch tensors.
)
# Pad our input tokens. Truncation was handled above by the 'encode'
# function, which also makes sure that the '[SEP]' token is placed at the
# end *after* truncating.
# Note: 'pad_sequences' expects a list of lists, but we only have one
# piece of text, so we surround 'input_ids' with an extra set of brackets.
results = pad_sequences([input_ids], maxlen=MAX_LEN, dtype="long",
value=0, truncating="post", padding="post")
# Remove the outer list.
input_ids = results[0]
# Create attention masks.
attn_mask = [int(i > 0) for i in input_ids]
# Cast to tensors.
input_ids = torch.tensor(input_ids)
attn_mask = torch.tensor(attn_mask)
# Add an extra dimension for the "batch" (even though there is only one
# input in this batch)
input_ids = input_ids.unsqueeze(0)
attn_mask = attn_mask.unsqueeze(0)
# ===========================
# STEP 1: Tokenization
# ===========================
# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
model.eval()
# Copy the inputs to the GPU
input_ids = input_ids.to(device)
attn_mask = attn_mask.to(device)
# telling the model not to build the backward graph will make this
# a little quicker.
with torch.no_grad():
# Forward pass, returns hidden states and predictions
# This will return the logits rather than the loss because we have
# not provided labels.
outputs = model(
input_ids = input_ids,
token_type_ids = None,
attention_mask = attn_mask)
hidden_states = outputs[2]
#Sentence Vectors
#To get a single vector for our entire sentence we have multiple
#application-dependent strategies, but a simple approach is to
#average the second to last hiden layer of each token producing
#a single 768 length vector.
# `hidden_states` has shape [13 x 1 x ? x 768]
# `token_vecs` is a tensor with shape [? x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all ? token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
# Move to the CPU and convert to numpy ndarray.
sentence_embedding = sentence_embedding.detach().cpu().numpy()
return(sentence_embedding)
from transformers import BertTokenizer, BertModel
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
model.cuda()
from transformers import BertTokenizer
# Load the BERT tokenizer.
print('Loadin BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
I don't know if it solves your problem but here's my 2 cent:
You don't have to calculate the attention mask and do the padding manually. Have a look at the documentation. Just call the tokenizer itself:
results = tokenizer(in_text, max_length=MAX_LEN, truncation=True)
input_ids = results.input_ids
attn_mask = results.attention_mask
# Cast to tensors
...
Instead of using the average of the second to last hidden layer, you can try the same thing with the last hidden layer; or you can use the vector represents [CLS] from the last layer

X has 7 features per sample; expecting 18282

I'm trying to make a textclassification model with sklearn. I'm quite new to python and also sklearn. I already made the model with some training data and saved the model. But there's an error when I try to reuse the model in an another python program/file.
I already looked in some similar problems here on stackoverflow, but I couldn't find a solution for me.
I made some comments, so you can read the code more easily.
...
# load the dataset
data = codecs.open('C:/Users/baran/PycharmProjects/test/resource/CorpusMitLabelsPlusSonstige.txt', encoding='utf8',
errors='ignore').read ()
# seperate lables from text
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
content = line.split()
labels.append(content[0])
texts.append(" ".join(content[1:]))
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
...
And since I was training with different methods to evaluate which was better I made a train_model method.
...
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False, is_not_tfid=False,
correct_model=False):
# fit the training dataset on the classifier
...
elif correct_model:
classifier.fit(feature_vector_train, label)
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'wb') as file:
pickle.dump(classifier, file)
# with open(pkl_filename, 'rb') as file:
# pickle_model = pickle.load(file)
# joblib.dump(classifier, "C:/Users/baran/PycharmProjects/test/resources/model.pkl")
# loaded_model = joblib.load("C:/Users/baran/PycharmProjects/test/resources/model.pkl")
# result = loaded_model.score(feat)
# print(pickle_model.predict(feature_vector_valid))
...
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
...
return metrics.accuracy_score(valid_y, predictions)
...
This is the "correct_model":
...
# Linear Classifier on Count Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count, correct_model=True)
print("LR, Count Vectors: ", accuracy)
...
This model gives me something around 80% accuracy on the validation data.
So this is my test file where I wanted to test, if I can load and reuse the model:
...
texts = []
texts.append("Der Bus hat nicht an der Haltestelle gehalten")
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
test_data = count_vect.transform(trainDF['text'])
# load the model
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
#reuse the model
test_load = joblib.load("C:/Users/baran/PycharmProjects/test/model.pkl")
print(test_load.predict(test_data))
...
Then I get this error:
...
ValueError: X has 7 features per sample; expecting 18282
What I expected is, that it will give me "3" as a result which is the encoding for a specific label. These predictions works in the same file where I also train the model, but somehow I can not use new validation data.
I think I made some mistake when fitting and or transforming the data.

Resources