Pre-trained FastText hyperparameters - nlp

I'm using the pre-trained model:
import fasttext.util
fasttext.util.download_model('en', if_exists='ignore') # English
ft = fasttext.load_model('cc.en.300.bin')
Where can I find an exhaustive list of the values of the hyperparameters used to train the model?
https://fasttext.cc/docs/en/options.html list the default values, that differ from the used one: for example, the dimension of the word vectors is 300 and not 100 (citing https://fasttext.cc/docs/en/crawl-vectors.html that doesn't list them all).

From looking at the _FastText Python model class in Facebook's source...
https://github.com/facebookresearch/fastText/blob/a20c0d27cd0ee88a25ea0433b7f03038cd728459/python/fasttext_module/fasttext/FastText.py#L99
...it looks like, at least when creating a model, all the hyperparameters are added as attributes on the object.
Have you checked if that's the case on your loaded model? For example, does ft.dim report 300, and other parameters like ft.minCount report anything interesting?
Update: As that didn't seem to work, it also looks like the _FastText model wraps an internal instance of a native (not-in-Python) FastText model in its .f attribute. (See a few lines up from the source code I pointed to earlier.)
And that native-instance is set up by the module specified by fasttext_pybind.cc. That code looks like it specified a bunch of read-write class variable, associated with the metaparameters - see for example starting at:
https://github.com/facebookresearch/fastText/blob/a20c0d27cd0ee88a25ea0433b7f03038cd728459/python/fasttext_module/fasttext/pybind/fasttext_pybind.cc#L88
So: does ft.f.minCount or ft.f.dim return anything useful from a post-loaded model ft?

Citing NVS Abhilash from https://github.com/facebookresearch/fastText/issues/887#issuecomment-649018188 the right code to write is:
args_obj = ft.f.getArgs()
for hparam in dir(args_obj):
if not hparam.startswith('__'):
print(f"{hparam} -> {getattr(args_obj, hparam)}")
This will print all the hyperparameters of the trained model!

Related

How to find back the architecture of a pytorch model having only the weight dictionnary?

I wanted to use the multilingual-codesearch model but first the code doesn't work and outputs the following error which suggest that it cannot load with only weights:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ncoop57/multilingual-codesearch")
model = AutoModel.from_pretrained("ncoop57/multilingual-codesearch")
ValueError: Unrecognized model in ncoop57/multilingual-codesearch. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: gpt_neo, big_bird, speech_to_text, vit, wav2vec2, m2m_100, convbert, led, blenderbot-small, retribert, ibert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta-v2, deberta, flaubert, fsmt, squeezebert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas
Then I downloaded the pytorch bin file but it only contains the weight dictionnary (state dictionnary as mentioned here), which means that if I want to use the model I have to initialize the good architecture and then load the weights.
But how am I supposed to find the architecture fitting the weight of a model that complex ? I saw that some method could find back the model based on the weight dictionnary but I didn't manage to make them work (I think about enter link description here).
How can one find back the architecture of a weight dictionnary in order to make the model work ? Is it even possible ?

how to find out the method definition of torch.nn.Module.parameters()

I am following this notebook:
One of the method:
def init_hidden(self, batch_size):
''' Initializes hidden state '''
# Create two new tensors with sizes n_layers x batch_size x n_hidden,
# initialized to zero, for hidden state and cell state of LSTM
weight = next(self.parameters()).data
if (train_on_gpu):
hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
else:
hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
return hidden
I would like to see what type of weight is and how to use new() method, so I was trying to find out the parameters() method as the data attribute comes from paramerters() method.
Surprisingly, I cannot find where it comes from after reading the source code of nn module in PyTorch.
How do you figure out where to see the definition of methods you saw from PyTorch?
so I was trying to find out the parameters() method as the data
attribute comes from paramerters() method.
Surprisingly, I cannot find where it comes from after reading the
source code of nn module in PyTorch.
You can see the module definition under torch/nn/modules/module.py here at line 178.
You can then easily spot the parameters() method here.
How do you guys figure out where to see the definition of methods you
saw from PyTorch?
The easiest way that I myself always use, is to use VSCode's Go to Definition or its Peek -> Peek definition feature.
I believe Pycharm has a similar functionality as well.
You can also check source code directly from PyTorch documentation.
See here for torch.nn.Module.parameters function (just click orange "Source", which gets you here).
Source is linked if it isn't written in C/C++/low level, in this case you have to go through GitHub and get around the project.

"strip" onnx graph from its constants (initializers)

I have an onnx graph/model that has big constants in it, so it is taking a lot of time to load it and parse it. Can I "strip" the data from the graph, so I inspect the graph nodes without its data ?
Initializer is one of the field in GraphProto. You should be able to clear initializer field with simple python script. I haven't tested the following code but should be something like this:
import onnx
def clear_initializer(model_path):
model = onnx.load_model(model_path)
model.graph.ClearField('initializer')
onnx.save_model(model)
references:
https://developers.google.com/protocol-buffers/docs/reference/python/google.protobuf.message.Message-class
https://github.com/onnx/onnx/blob/2e7099ee7c37b196c197c9a084a97698a41da232/onnx/init.py

Using lda2vec library with other types of wordvectors

Hi I'm new to NLP field and recently got interested in lda2vec.
After reading moody's article about lda2vec, I've tried to use the code he posted, but customize wordvector generation parts.
I'd like to use pretrained embeddings and have no idea where I should fix
since I'm not good at coding too.
Any suggestions?
Look at the class corpus, especialy this method:
def compact_word_vectors(self, vocab, filename=None, array=None, top=20000):
...
# code is omitted
From a doctring, argument filename is a "Filename for SpaCy-compatible word vectors or if use_spacy=False then uses word2vec vectors via gensim."
Try to set pre-trained word-vectors with this method. Hope it helps!

How to properly tag a list of documenta by Gensim TaggedDocument()

I would like to tag a list of documents by Gensim TaggedDocument(), and then pass these documents as in input of Doc2Vec().
I have read the documentation about TaggedDocument here, but I don' t have understood what exactly are the parameters words and tags.
I have tried:
texts = [[word for word in document.lower().split()]
for document in X.values]
texts = [[token for token in text]
for text in texts]
model = gensim.models.Doc2Vec(texts, vector_size=200)
model.train(texts, total_examples=len(texts), epochs=10)
But I get the error 'list' object has no attribute 'words'.
Doc2Vec expects an iterable collection of texts that are each (shaped like) the example TaggedDocument class, with both words and tags properties.
The words can be your tokenized text (as a list), but the tags should be a list of document-tags that should be receive learned vectors via the Doc2Vec algorithm. Most often, these are unique IDs, one per document. (You can just use plain int indexes, if that works as a way to refer to your documents elsewhere, or string IDs.) Note that tags must be a list-of-tags, even if you're only providing one per document.
You are simply providing a list of lists-of-words, thus generating the error.
Try instead just the single line to initialize texts:
texts = [TaggedDocument(
words=[word for word in document.lower().split()],
tags=[i]
) for i, document in enumerate(X.values)]
Also, you don't need to call train() if you've supplied texts when the Doc2Vec was created. (By supplying the corpus at initialization, Doc2Vec will automatically do both an initial vocabulary-discovery scan and then your specified number of training passes.)
You should look at working examples for inspiration, such as the doc2vec-lee.ipynb runnable Jupyter notebook that's included with gensim. It will be your install directory, if you can find it, but you can also view a (static, non-runnable) version inside the gensim source code repository at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

Resources