Should I train embeddings using data from both training,validating and testing corpus? - nlp

I am in a case that I don't have any pre-trained words embedding for my domain (Vietnamese food reviews). so I got a though of embedding from the general and specific corpus.
And the point here is can I use the dataset of training, test and validating (did preprocess) as a source for creating my own word embeddings. If don't, hope you can give your experience.
Based on my intuition, and some experiments a wide corpus appears to be better, but I'd like to know if there's relevant research or other relevant results.

can I use the dataset of training, test and validating (did
preprocess) as a source for creating my own word embeddings
Sure, embeddings are not your features for your machine learning model. They are the "computational representation" of your data. In short, they are made of words represented in a vector space. With embeddings, your data is less sparse. Using word embeddings could be considered part of the pre-processing step of NLP.
Usually (I mean, using the most used technique, word2vec), the representation of a word in the vector space is defined by its surroundings (the words that it commonly goes along with).
Therefore, to create embeddings, the larger the corpus, the better, since it can better place a word vector in the vector space (and hence compare it to other similar words).

Related

Find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model

How to find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model? I need to fine tune FastText with my domain specific words.
A FastText model will already be able to generate vectors for OOV words.
So there's not necessarily any need to either list the specifically OOV words in your PDF, nor 'fine tune' as FastText model.
You just ask it for vectors, it gives them back. The vectors for full in-vocabulary words, that were trained from relevant training material, will likely be best, while vectors synthesized for OOV words from word-fragments (character n-grams) shared with training material will just be rough guesses - better than nothing, but not great.
(To train a good word-vector requires many varied examples of a word's use, interleaved with similarly good examples of its many 'peer' words – and traditionally, in one unified, balanced training session.)
If you think you need to do more, you should expand your questin with more details about why you think that's necessary, and what existing precedents (in docs/tutorials/papers) you're trying to match.
I've not seen a well-documented way to casually fine-tune, or incrementally expand the known-vocabulary of, an existing FastText model. There would be a lot of expert tradeoffs required, and in many cases simply training a new model with sufficient data is likely to be a safer approach.
Anyone seeking such fine-tuning should have a clear idea of:
what their incremental data might be able to add to an existing model
what process/code will they be using, and why that process/code might be expected to give meaningful results with their specific starting model & new data
how the results of any such process can be evaluated to ensure the extra fine-tuning steps are beneficial compared to alternatives

How to find similar sentence from a corpus on word2vec?

I have implemented word2vec on my corpus using the TensorFlow tutorial: https://www.tensorflow.org/tutorials/text/word2vec#next_steps
Now I'm want to give a sentence as input and want to find a similar sentence in the corpus.
Any leads on how I can perform this?
A simple word2vec model is not capable of such task, as it only relates word semantics to each other, not the semantics of whole sentences. Inherently, such a model has no generative function, it only serves as a look-up table.
Word2vec models map word strings to vectors in the embedding space. To find similar words for a given sample word, one can simply go through all vectors in the vocabulary and find the ones that are closest (in terms of the 2-norm) from the sample word vector. For further information you could go here or here.
However, this does not work for sentences as it would require a whole vocabulary of sentences of which to pick similar ones - which is not feasible.
Edit: This seems to be a duplicate of this question.

Passing multiple sentences to BERT?

I have a dataset with paragraphs that I need to classify into two classes. These paragraphs are usually 3-5 sentences long. The overwhelming majority of them are less than 500 words long. I would like to make use of BERT to tackle this problem.
I am wondering how I should use BERT to generate vector representations of these paragraphs and especially, whether it is fine to just pass the whole paragraph into BERT?
There have been informative discussions of related problems here and here. These discussions focus on how to use BERT for representing whole documents. In my case the paragraphs are not that long, and indeed could be passed to BERT without exceeding its maximum length of 512. However, BERT was trained on sentences. Sentences are relatively self-contained units of meaning. I wonder if feeding multiple sentences into BERT doesn't conflict fundamentally with what the model was designed to do (although this appears to be done regularly).
I think your question is based on a misconception. Even though the BERT paper uses the term sentence quite often, it is not referring to a linguistic sentence. The paper defines a sentence as
an arbitrary span of contiguous text, rather than an actual linguistic sentence.
It is therefore completely fine to pass whole paragraphs to BERT and a reason why they can handle those.

word2vec limit similar_by_vector() result to re-trained corpus

Assume you have a (wikipedia) pre-trained word2vec model, and train it on an additional corpus (very small, 1000 scentences).
Can you imagine a way to limit a vector-search to the "re-trained" corpus only?
For example
model.wv.similar_by_vector()
will simply find the closest word for a given vector, no matter if it is part of the Wikipedia corpus, or the re-trained vocabulary.
On the other hand, for 'word' search the concept exists:
most_similar_to_given('house',['garden','boat'])
I have tried to train based on the small corpus from scratch, and it somewhat works as expected. But of course could be much more powerful if the assigned vectors come from a pre-trained set.
Sharing an efficient way to do this manually:
re-train word2vec on the additional corpus
create full unique word-index of corpus
fetch re-trained vectors for each word in the index
instead of the canned function "similar_by_vector", use scipy.spatial.KDTree.query()
This finds the closest word within the given corpus only and works as expected.
Similar to the approach for creating a subset of doc-vectors in a new KeyedVectors instance suggested here, assuming small_vocab is a list of the words in your new corpus, you could try:
subset_vectors = WordEmbeddingsKeyedVectors(vector_size)
subset_vectors.add(small_vocab, w2v_model.wv[small_vocab])
Then subset_vectors contains just the words you've selected, but supports familiar operations like most_similar().

how to create word vector

How to create word vector? I used one hot key to create word vector, but it is very huge and not generalized for similar semantic word. So I have heard about word vector using neural network that finds word similarity and word vector. So I wanted to know how to generate this vector (algorithm) or good material to start creating word vector ?.
Word-vectors or so-called distributed representations have a long history by now, starting perhaps from work of S. Bengio (Bengio, Y., Ducharme, R., & Vincent, P. (2001).A neural probabilistic language model. NIPS.) where he obtained word-vectors as by-product of training neural-net lanuage model.
A lot of researches demonstrated that these vectors do capture semantic relationship between words (see for example http://research.microsoft.com/pubs/206777/338_Paper.pdf). Also this important paper (http://arxiv.org/abs/1103.0398) by Collobert et al, is a good starting point with understanding word vectors, the way they are obtained and used.
Besides word2vec there is a lot of methods to obtain them. Expamples include SENNA embeddings by Collobert et al (http://ronan.collobert.com/senna/), RNN embeddings by T. Mikolov that can be computed using RNNToolkit (http://www.fit.vutbr.cz/~imikolov/rnnlm/) and much more. For English, ready-made embeddings can be downloaded from these web-sites. word2vec really uses skip-gram model (not neural network model). Another fast code for computing word representations is GloVe (http://www-nlp.stanford.edu/projects/glove/). It is an open question whatever deep neural networks are essential for obtaining good embeddings or not.
Depending of your application, you may prefer using different types of word-vectors, so its a good idea to try several popular algorithms and see what works better for you.
I think the thing you mean is Word2Vec (https://code.google.com/p/word2vec/). It trains N-dimensional word vectors of documents based on a given corpus. So in my understanding of word2vec the neural network is just used to aggregate the dimensions of the document vector and also capturing some relationship between words. But what should be mentioned is that this is not really semantically related, it just reflects the structural relationship in your training body.
If you want to capture semantic relatedness have a look a WordNet based measures, for instance implemented is these libaries:
Java: https://code.google.com/p/ws4j/
Perl: http://wn-similarity.sourceforge.net/
To get started with word2vec you can use their pretrained vectors. You should find all information about this at https://code.google.com/p/word2vec/.
When you seek for a java implementation. This is a good starting point: http://deeplearning4j.org/word2vec.html
I hope this helps
Best wishes

Resources