I am new in NLU and I am doing a project on document embedding. I want to fine-tune the doc2vec model in gensim on my small dataset to see if it can help for document clustering. I read the tutorial on the website but they did not mention anything about fine-tuning. Where I can find doc2vec pertained on wikipedia or twitter on any big dataset.
Related
I want to create a language translation model using transformers. However, Tensorflow seems to only have a BERT model for English https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4 . If I want a BERT for another language, what is the best way to go about accomplishing this? Should I create a new BERT or can I train Tensorflow's own BertTokenizer on another language?
The Hugging Face model hub contains a plethora of pre-trained monolingual and multilingual transformers (and relevant tokenizers) which can be fine-tuned for your downstream task.
However, if you are unable to locate a suitable model for you language, then yes training from scratch is the only option. Beware though that training from scratch can be a resource-intensive task that will require significant compute power. Here is an excellent blog post to get you started.
Is Google's pretrained word2vec model CBO or skipgram.
We load pretrained model by:
from gensim.models.keyedvectors as word2vec
model= word2vec.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
How can we specifically load pretrained CBOW or skipgram model ?
The GoogleNews word-vectors were trained by Google, using a proprietary corpus, but they're never explicitly described all the training-parameters used. (It's not encoded in the file.)
It's been asked a number of times on the Google Group devoted to the word2vec-toolkit code, without a definitive answer. For example, there's a response from word2vec author Mikolov that he doesn't remember the training parameters. Elsewhere, another poster thinks one of the word2vec papers implies skip-gram was used – but as that passage doesn't precisely match other aspects (like vocabulary-size) of the released GoogleNews vectors, I wouldn't be completely confident of that.
As Google hasn't been clear, and in any case hasn't released alternate versions based on different training modes, if you want to run any tests or make any conclusions about the different modes, you'll have to use other vector-sets, or train your own vectors in varying ways.
Late to the party, but Mikolov describes the hyperparameters here. The Google News pretrained vectors were trained using CBOW. I believe that's the only option for you to load; there is no pretrained skip-gram version available.
I am training word vectors on particular text corpus using fast text.
Fasttext provides all the necessary mechanics and options for training word vectors and when looked with tsne, the vectors are amazing. I notice gensim has a wrapper for fasttext which is good for accessing vectors.
for my task, I have many text corpuses. I need to use the above trained vectors again with new corpus and use the trained vectors again on new discovered corpuses. fasttext doesnot provide this function. I donot see any package that achieves this or may be I am lost. I see in google forum gensim provides intersect_word2vec_format, but cannot understand or find usage tutorial for this. There is another question open similar to this with no answer.
So apart from gensim, is there any other way to train the models like above.
I want to use an LDA(Latent Dirichlet Allocation) model for an NLP purpose.
To train a such a model from Wikipedia corpus takes about 5 to 6 hours and wiki corpus is about 8GB. Check the Tutorial
Rather than doing so, is there a place to download a built LDA model and use it directly with Gensim?
I did document similarity on my corpus using Doc2Vec and it outputting not that good of similarities. I was wondering if I could do a topic model from what Doc2Vec is giving me to increase the accuracy of my model in order to get better similarities?
You should train a new model (like LDA) from the original corpus.
If the native similarities given by the Doc2Vec process aren't very good, maybe you can improve them with tuning your process.
But if that doesn't work, then Doc2Vec hasn't distilled useful info from your data – and downstream calculations built on those (bad) raw numbers aren't likely to get magically better.