Spacy has great parsing capacities and it's API is very intuitive for the most part. Is there any way from the Spacy API to fine tune its word embedding models? In particular, I would like to keep Spacy's tokens and give them a vector when possible.
The only thing I've come across for now is to train the embeddings using gensim (but then I wouldn't know how to load the embeddings from spacy to gensim) and then load then back to spacy, as in: https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/. This doesn't help for the first part: training on spacy tokens.
Any help appreciated.
From the spacy documentation:
If you need to train a word2vec model, we recommend the implementation
in the Python library Gensim.
Besides gensim you can also use other implementations like FastText . The easiest way to use the custom vectors from spacy, is to create a model using the init-model command-line utility, like this:
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
python -m spacy init-model en model --vectors-loc cc.la.300.vec.gz
then simply load your model as usual: nlp = spacy.load('model'). There is a detailed documentation in the spacy website.
Related
I downloaded word embedding from this link. I want to load it in Gensim to do some work but I am not able to load it. I have found many resources and none of it is working. I am using Gensim version 4.1.
I have tried
gensim.models.fasttext.load_facebook_model('/home/admin1/embeddings/crawl-300d-2M.vec')
gensim.models.fasttext.load_facebook_vectors('/home/admin1/embeddings/crawl-300d-2M.vec')
and it is showing me
NotImplementedError: Supervised fastText models are not supported
I went to try to load it using using FastText.load('/home/admin1/embeddings/crawl-300d-2M.vec',) but then it showed UnpicklingError: could not find MARK.
Also, using
Per the NotImplementedError, those are the one kind of full Facebook FastText model, -supervised mode, that Gensim does not support.
So sadly, the answer to "How do you load these?" is "you don't".
The .vec files contain just the full-word vectors in a plain-text format – no subword info for synthesizing OOV vectors, or supervised-classification output features. Those can be loaded into a KeyedVectors model:
kv_model = KeyedVectors.load_word2vec_format('crawl-300d-2M.vec')
I would like to do some supervised binary classification tasks with sentences, and have been using spaCy because of its ease of use. I used spaCy to convert the text into vectors, and then fed the vectors to a machine learning model (e.g. XGBoost) to perform the classfication. However, the results have not been very satisfactory.
In spaCy, it is easy to load a model (e.g. BERT / Roberta / XLNet) to convert words / sentences to nlp objects. Directly calling the vector of the object will however will default to an average of the token vectors.
Here are two questions:
1) Can we do better than simply getting the average of token vectors, like having context/order-aware sentence vectors using spaCy? For example, can we extract the sentence embedding from the previous layer of the BERT transformer instead of the final token vectors in spaCy?
2) Would it be better to directly use spaCy to train the downstream binary classification task? For example, here discusses how to add a text classifier to a spaCy model. Or is it generally better to apply more powerful machine learning models like XGBoost?
Thanks in advance!
I found this being discussed in the page below. Maybe it helps.
"Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much."
https://github.com/huggingface/transformers/issues/1950
I am tokenizing my text corpus which is in german language using the spacy's german model.
Since currently, spacy only has small german model, I am unable to extract the word vectors using spacy itself.
So, I am using fasttext's pre-trained word embeddings from here:https://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning
Now facebook has used ICU tokenizer for tokenization process before extracting word embeddings for it. and i am using spacy
Can someone tell me if this is okay?
I feel spacy and ICU tokenizer might behave differently and if so then many tokens in my text corpus would not have a corresponding word vector
Thank for your help!
UPDATE:
I tried the above method and after extensive testing, I found that this works well for my use case.
Most(almost all) of the tokens in my data matched the tokens present in fasttext ans I was able to obtain the word vectors representation for the same.
I am training word vectors on particular text corpus using fast text.
Fasttext provides all the necessary mechanics and options for training word vectors and when looked with tsne, the vectors are amazing. I notice gensim has a wrapper for fasttext which is good for accessing vectors.
for my task, I have many text corpuses. I need to use the above trained vectors again with new corpus and use the trained vectors again on new discovered corpuses. fasttext doesnot provide this function. I donot see any package that achieves this or may be I am lost. I see in google forum gensim provides intersect_word2vec_format, but cannot understand or find usage tutorial for this. There is another question open similar to this with no answer.
So apart from gensim, is there any other way to train the models like above.
I have been trying to use NER feature of NLTK. I want to extract such entities from the articles. I know that it can not be perfect in doing so but I wonder if there is human intervention in between to manually tag NEs, will it improve?
If yes, is it possible with present model in NLTK to continually train the model. (Semi-Supervised Training)
The plain vanilla NER chunker provided in nltk internally uses maximum entropy chunker trained on the ACE corpus. Hence it is not possible to identify dates or time, unless you train it with your own classifier and data(which is quite a meticulous job).
You could refer this link for performing he same.
Also, there is a module called timex in nltk_contrib which might help you with your needs.
If you are interested to perform the same in Java better look into Stanford SUTime, it is a part of Stanford CoreNLP.