How to create a custom parallel corpus for machine translation with recent versions of pytorch and torchtext? - pytorch

I am trying to train a model for NMT on a custom dataset. I found this great tutorial on youtube along with the accompanying repo, but it uses an old version of PyTorch and torchtext. More recent versions of torchtext have removed the Field and BucketIterator classes.
I looked for more recent tutorials. The closest thing I could find was this medium post (again with the accompanying code) which worked with a custom dataset for text classification. I tried to replicate the code with my problem and got this far:
from os import PathLike
from torch.utils.data import Dataset
from torchtext.vocab import Vocab
import pandas as pd
from .create_vocab import tokenizer
class ParallelCorpus(Dataset):
"""A parallel corpus for training a machine translation model"""
def __init__(self,
corpus_path: str | PathLike,
source_vocab: Vocab,
target_vocab: Vocab
):
super().__init__()
self.corpus = pd.read_csv(corpus_path)
self.source_vocab = source_vocab
self.target_vocab = target_vocab
def __len__(self):
return len(self.corpus)
def __getitem__(self, index: int):
source_sentence = self.corpus.iloc[index, 0]
source = [self.source_vocab["<sos>"]]
source.extend(
self.source_vocab.lookup_indices(tokenizer(source_sentence))
)
source.append(self.source_vocab["<eos>"])
target_sentence = self.corpus.iloc[index, 1]
target = [self.target_vocab["<sos>"]]
target.extend(
self.target_vocab.lookup_indices(tokenizer(target_sentence))
)
target.append(self.target_vocab["<eos>"])
return source, target
My question is: is this the correct way to implement parallel corpora for pytorch? And where can I find more information about this since the documentation wasn't much help.
Thank you in advance and sorry if this is against the rules.

Related

API to serve roberta ClassificationModel with FastAPI

I have trained transformer using simpletransformers model on colab ,downloaded the searialized model and i have little issues on using it to make inferences.
Loading the model on model on jupyter works but while using it with fastapi gives an error
This is how I,m using it on jupyter:
from scipy.special import softmax
label_cols = ['art', 'politics', 'health', 'tourism']
model = torch.load("model.bin")
pred = model.predict(['i love politics'])[1]
preds = softmax(pred,axis=1)
preds
It gives the following result:array([[0.00230123, 0.97465035, 0.00475409, 0.01829433]])
I have tried using fastapi as follows but keeps on getting an error:
from pydantic import BaseModel
class Message(BaseModel):
text : str
model = torch.load("model.bin")
#app.post("/predict")
def predict_health(data: Message):
prediction = model.predict(data.text)[1]
preds = softmax(prediction, axis=1)
return {"results": preds}
Could you please specify the error you get, otherwise its quite hard to see debug
Also it seems that the model.predict function in the jupyter code gets an array as input, while in your fastapi code you are passing a string directly to that function.
So maybe try
...
prediction = model.predict([data.text])[1]
...
It's hard to say without the error.
In case it helps you could have a look at this article that shows how to build classification with Hugging Face transformers (Bart Large MNLI model) and FastAPI: https://nlpcloud.io/nlp-machine-learning-classification-api-production-fastapi-transformers-nlpcloud.html

Deploy Keras model on Spark

I have a trained keras model.
https://github.com/qubvel/efficientnet
I have a large updating dataset I want to get predictions on. Meaning to run my spark job every 2 hours or so.
What is the way to implement this? MlLib does not support efficientNet.
When searching online I saw this kind of implementation using sparkdl, but it does not support efficentNet as modelName parameter.
featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
My naive approach would be
import efficientnet.keras as efn
model = efn.EfficientNetB0(weights='imagenet')
from sparkdl import readImages
image_df = readImages("flower_photos/sample/")
image_df.withcolumn("modelTags", efficient_net_udf($"image".data))
and creating a UDF that calls model.predict...
Another method I saw is
from keras.preprocessing.image import img_to_array, load_img
import numpy as np
import os
from pyspark.sql.types import StringType
from sparkdl import KerasImageFileTransformer
import efficientnet.keras as efn
model = efn.EfficientNetB0(weights='imagenet')
model.save("kerasModel.h5")
def loadAndPreprocessKeras(uri):
image = img_to_array(load_img(uri, target_size=(299, 299)))
image = np.expand_dims(image, axis=0)
return image
transformer = KerasImageFileTransformer(inputCol="uri", outputCol="predictions",
modelFile='path/kerasModel.h5',
imageLoader=loadAndPreprocessKeras,
outputMode="vector")
files = [os.path.abspath(os.path.join(dirpath, f)) for f in os.listdir("/data/myimages") if f.endswith('.jpg')]
uri_df = sqlContext.createDataFrame(files, StringType()).toDF("uri")
keras_pred_df = transformer.transform(uri_df)
What is the correct (and working) way to approach this?

Save and load a Pytorch model

i am trying to train a pytorch model on colab then save the model parameters and load it on my local computer.
After training, the model parameters are stored as below:
torch.save(Model.state_dict(),PATH)
loaded as below:
device = torch.device('cpu')
Model.load_state_dict(torch.load(PATH, map_location=device))
error:
AttributeError: 'Sequential' object has no attribute 'copy'
Does anyone know how to solve this issue?
Your question does not provide sufficient details to be answered correctly. If you are trying to save and load your own model and have a class definition for it see this well known answer and clarify why that's not sufficient for your use.
If you are loading a torch.nn.Sequential model then as far as I know simply loading the model directly and just using it should be sufficient. If it's not post on the pytorch forum what error you get.
For now look at my example show casing loading a sequential model and then using it without error:
# test for saving everything with torch.save
import torch
import torch.nn as nn
from pathlib import Path
from collections import OrderedDict
import numpy as np
import pickle
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
num_samples = 3
Din, Dout = 1, 1
lb, ub = -1, 1
x = torch.torch.distributions.Uniform(low=lb, high=ub).sample((num_samples, Din))
f = nn.Sequential(OrderedDict([
('f1', nn.Linear(Din,Dout)),
('out', nn.SELU())
]))
y = f(x)
# save data torch to numpy
x_np, y_np = x.detach().cpu().numpy(), y.detach().cpu().numpy()
db2 = {'f': f, 'x': x_np, 'y': y_np}
torch.save(db2, path / 'db_f_x_y')
db3 = torch.load(path / 'db_f_x_y')
f3 = db3['f']
x3 = db3['x']
y3 = db3['y']
xx = torch.tensor(x3)
yy3 = f3(xx)
print(yy3)
there should be an official answer how to save and load nn.Sequential models How does one save torch.nn.Sequential models in pytorch properly? but for now torch.save and torch.load seem to work just fine.

Pytorch DataLoader multiple data source

I am trying to use Pytorch dataloader to define my own dataset, but I am not sure how to load multiple data source:
My current code:
class MultipleSourceDataSet(Dataset):
def __init__ (self, json_file, root_dir, transform = None):
with open(root_dir + 'block0.json') as f:
self.result = torch.Tensor(json.load(f))
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.result[0])
def __getitem__ (self):
None
The data source is 50 blocks under root_dir = ~/Documents/blocks/
I split them and avoid to combine them directly before since this is a very big dataset.
How can I load them into a single dataloader?
For DataLoader you need to have a single Dataset, your problem is that you have multiple 'json' files and you only know how to create a Dataset from each 'json' separately.
What you can do in this case is to use ConcatDataset that contains all the single-'json' datasets you create:
import os
import torch.utils.data as data
class SingeJsonDataset(data.Dataset):
# implement a single json dataset here...
list_of_datasets = []
for j in os.path.listdir(root_dir):
if not j.endswith('.json'):
continue # skip non-json files
list_of_datasets.append(SingeJsonDataset(json_file=j, root_dir=root_dir, transform=None))
# once all single json datasets are created you can concat them into a single one:
multiple_json_dataset = data.ConcatDataset(list_of_datasets)
Now you can feed the concatenated dataset into data.DataLoader.
I should revise my question as 2 different sub-questions:
How to deal with large datasets in PyTorch to avoid memory error
If I am separating large a dataset into small chunks, how can I load multiple mini-datasets
For question 1:
PyTorch DataLoader can prevent this issue by creating mini-batches. Here you can find further explanations.
For question 2:
Please refer to Shai's answer above.

Tensorflow Scikit Flow get GraphDef for Android (save *.pb file)

I want to use my Tensorflow algorithm in an Android app. The Tensorflow Android example starts by downloading a GraphDef that contains the model definition and weights (in a *.pb file). Now this should be from my Scikit Flow algorithm (part of Tensorflow).
At the first glance it seems easy you just have to say classifier.save('model/') but the files saved to that folder are not *.ckpt, *.def and certainly not *.pb. Instead you have to deal with a *.pbtxt and a checkpoint (without ending) file.
I'm stuck there since quite a while. Here a code example to export something:
#imports
import tensorflow as tf
import tensorflow.contrib.learn as skflow
import tensorflow.contrib.learn.python.learn as learn
from sklearn import datasets, metrics
#skflow example
iris = datasets.load_iris()
feature_columns = learn.infer_real_valued_columns_from_input(iris.data)
classifier = learn.LinearClassifier(n_classes=3, feature_columns=feature_columns,model_dir="modeltest")
classifier.fit(iris.data, iris.target, steps=200, batch_size=32)
iris_predictions = list(classifier.predict(iris.data, as_iterable=True))
score = metrics.accuracy_score(iris.target, iris_predictions)
print("Accuracy: %f" % score)
The files you get are:
checkpoint
graph.pbtxt
model.ckpt-1.meta
model.ckpt-1-00000-of-00001
model.ckpt-200.meta
model.ckpt-200-00000-of-00001
Many possible workarounds I found would require having the GraphDef in a variable (don't know how with Scikit Flow). Or a Tensorflow session which doesn't seem to be required using Scikit Flow.
To save as pb file, you need to extract the graph_def from the constructed graph. You can do that as--
from tensorflow.python.framework import tensor_shape, graph_util
from tensorflow.python.platform import gfile
sess = tf.Session()
final_tensor_name = 'results:0' #Replace final_tensor_name with name of the final tensor in your graph
#########Build your graph and train########
## Your tensorflow code to build the graph
###########################################
outpt_filename = 'output_graph.pb'
output_graph_def = sess.graph.as_graph_def()
with gfile.FastGFile(outpt_filename, 'wb') as f:
f.write(output_graph_def.SerializeToString())
If you want to convert your trained variables to constants (to avoid using ckpt files to load the weights), you can use:
output_graph_def = graph_util.convert_variables_to_constants(sess, sess.graph.as_graph_def(), [final_tensor_name])
Hope this helps!

Resources