Load tokenizer from .model and .vocab files - pytorch

I trained a sentencepiece tokenizer, but can't seem to be able to load it using BertTokenizer.from_pretrained('/path to .model file'). The directory I saved the sentencepiece tokenizer contains a .model and .vocab file only.

Related

Load pytorch model with correct args from files

Having followed Chris McCormick's tutorial for creating a BERT Fake News Detector (link here), at the end he saves the PyTorch model using the following code:
output_dir = './model_save/'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
As he says himself, it can be reloaded using from_pretrained(). Currently, what the code does is create an output directory with 6 files:
config.json
merges.txt
pytorch_model.bin
special_tokens_map.json
tokenizer_config.json
vocab.json
So how can I use the from_pretrained() method to load the model with all of its arguments and respective weights, and which files do I use from the six?
I understand that a model can be loaded as such (from PyTorch documentation):
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.eval()
but how can I make use of the files in the output directory to do this?
Any help is appreciated!

How to do language model training on BERT

I want to train BERT on a target corpus. I am looking at this HuggingFace implementation.
They are using .raw files for the training data. If I have .txt files of my training data, how can I use their implementation?
The .raw only indicates that they use the raw version of the WikiText, they are regular text files containing the raw text:
We're using the raw WikiText-2 (no tokens were replaced before the tokenization).
The description of the data files options also says that they are text files. From run_language_modeling.py - L86-L88:
train_data_file: Optional[str] = field(
default=None, metadata={"help": "The input training data file (a text file)."}
)
Therefore you can just specify your text files.

Using download_data() and untar_data() in fastai library

I downloaded Fashion MNIST dataset from kaggle using dowload_data() function in fastai library.
downloaded_data = download_data("https://www.kaggle.com/zalando-research/fashionmnist/download")
output -
PosixPath('/root/.fastai/data/download.tgz')
download_data saves it as .tgz file, now I use untar_data().
path = untar_data('/root/.fastai/data/download.tgz')
output -
PosixPath('/root/.fastai/data/download.tgz')
Which did not extract .tgz file. How do I use this dataset in fastai library?
In fastai library, the download_data gives you a pathlib.PosixPath file, not the exact file, you need to use another unzipping library to extract the data.
If you just need the MNIST data from fast ai, here's an easier way:
from fastai import datasets
import gzip, pickle
MNIST_URL='http://deeplearning.net/data/mnist/mnist.pkl'
path = datasets.download_data(MNIST_URL, ext='.gz')
with gzip.open(path, 'rb') as f:
((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')

pbtxt missing after saving a trained model

What I am trying to do is to convert my trained CNN to TfLite and use it in my android app. AFAIK I need the .pbtxt in order to freeze the parameters and do the conversion.
However when I save my network using this standard code:
saver = tf.train.Saver(max_to_keep=4)
saver.save(sess=session, save_path="some_path", global_step=step)
I only get the
.data
.index
.meta
checkpoint
files. No pbtxt.
Is there a way to convert the trained network to tflite without a pbtxt or can I obtain the pbtxt from those files?
Thank you
Simply execute:
tf.train.write_graph(session.graph.as_graph_def(),
"path",
'model.pb',
as_text=False)
to get a .pb or
tf.train.write_graph(session.graph.as_graph_def(),
"path",
'model.pbtxt',
as_text=True)
to get the text version.

How to Train GloVe algorithm on my own corpus

I tried to follow this.
But some how I wasted a lot of time ending up with nothing useful.
I just want to train a GloVe model on my own corpus (~900Mb corpus.txt file).
I downloaded the files provided in the link above and compiled it using cygwin (after editing the demo.sh file and changed it to VOCAB_FILE=corpus.txt . should I leave CORPUS=text8 unchanged?)
the output was:
cooccurrence.bin
cooccurrence.shuf.bin
text8
corpus.txt
vectors.txt
How can I used those files to load it as a GloVe model on python?
You can do it using GloVe library:
Install it: pip install glove_python
Then:
from glove import Corpus, Glove
#Creating a corpus object
corpus = Corpus()
#Training the corpus to generate the co-occurrence matrix which is used in GloVe
corpus.fit(lines, window=10)
glove = Glove(no_components=5, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')
Reference: word vectorization using glove
This is how you run the model
$ git clone http://github.com/stanfordnlp/glove
$ cd glove && make
To train it on your own corpus, you just have to make changes to one file, that is demo.sh.
Remove the script from if to fi after 'make'.
Replace the CORPUS name with your file name 'corpus.txt'
There is another if loop at the end of file 'demo.sh'
if [ "$CORPUS" = 'text8' ]; then
Replace text8 with your file name.
Run the demo.sh once the changes are made.
$ ./demo.sh
Make sure your corpus file is in the correct format.You'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters.
your corpus should go to variable CORPUS. The vectors.txt is the output, which suppose to be useful. You can train Glove in python, but it takes more time and you need to have C compiling environment. I tried it before and won't recommend it.
Here is my take on this::
After cloning the repository, edit the demo.sh file as you have to train it using your own corpus replace the CORPUS name with your file's name.
Then remove the script between MAKE and CORPUS as that is for downloading an example corpus for you.
Then run make which will form the four files in the build folder.
Now run ./demo.sh which will train and do all the stuff mentioned in the script on your own corpus and output will be generated as vectors.txt file.
Note : Don't forget to keep your corpus file directly inside the Glove folder.

Resources