How to get word vector representation when using Deep Learning in NLP - nlp

How to get word vector representation when using Deep Learning in NLP ? The words are represented by a fixed length vector, see http://machinelearning.wustl.edu/mlpapers/paper_files/BengioDVJ03.pdf for more details.

Deep Learning and NLP are quite complex subjects, so if you really want to understand them you'll need to follow a couple of courses in the field and read many papers. There are lots of different techniques for converting words into vector representations and it's a very active area of research. Socher's DL for NLP tutorial is a good next step if you are already well acquainted with NLP and Machine Learning (including deep learning).
With that said (and considering it's a programming forum), if you are just interested for now in using someone's else tools to quickly obtain vector representations which can be useful in some tasks, one library which you must look at is word2vec. Take a look in its website: https://code.google.com/p/word2vec/. It's a very powerful tool and for some basic stuff it could be used without much knowledge.

For getting word vector for a word you can use Google News 300 dimensional word vector model.
Download the model from here - https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing OR from here
https://s3.amazonaws.com/mordecai-geo/GoogleNews-vectors-negative300.bin.gz .
After downloading load the model using gensim python library as below -
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
Then just query the model for word vector corresponding to a word like
model['usa']
And it returns you a 300 dimensional word vector for usa.
Note that you may not found word vectors for all the words in this model.
Also instead of this Google News model, other models can also be used.

Related

Is there an embeddings technique to represent multilingual paragraphs?

I have a dataset that includes English, Spanish and German documents. I want to represent them using document embeddings techniques to compute their similarities. However, as the documents are in different languages and the length of each one is paragraph-sized, it is difficult to find a pre-trained model (I do not have enough data for training) .
I found some interesting models like Sent2Vec and LASER that also work on multilingual context. However, both of them have been implemented for sentence representation. The question is two folds:
Is there any model that could be used to represent multilingual paragraphs?
Is it possible to employ sent2vec (or LASER) to represent paragraphs (I mean to represent each paragraph using an embeddings vector)?
Any help would be appreciated.

Gensim: What is difference between word2vec and doc2vec?

I'm kinda newbie and not native english so have some trouble understanding Gensim's word2vec and doc2vec.
I think both give me some words most similar with query word I request, by most_similar()(after training).
How can tell which case I have to use word2vec or doc2vec?
Someone could explain difference in short word, please?
Thanks.
In word2vec, you train to find word vectors and then run similarity queries between words. In doc2vec, you tag your text and you also get tag vectors. For instance, you have different documents from different authors and use authors as tags on documents. Then, after doc2vec training you can use the same vector aritmetics to run similarity queries on author tags: i.e who are the most similar authors to AUTHOR_X? If two authors generally use the same words then their vector will be closer. AUTHOR_X is not a real word which is part of your corpus just something you determine. So you don't need to have it or manually insert it into your text. Gensim allows you to train doc2vec with or without word vectors (i.e. if you only care about tag similarities between each other).
Here is a good presentation on word2vec basics and how they use doc2vec in an innovative way for product recommendations (related blog post).
If you tell me about what problem you are trying to solve, may be I can suggest which method will be more appropriate.

Biasing word2vec towards special corpus

I am new to stackoverflow. Please forgive my bad English.
I am using word2vec for a school project. I want to work with a domain specific corpus (like Physics Textbook) for creating the word vectors using Word2Vec. This standalone does not provide good results due to lesser size of the corpus. This especially hurts as we want to evaluate on words that may very well be outside the vocabulary of the text book.
We want the textbook to encode the domain specific relationships and semantic "nearness". "Quantum" and "Heisenberg" are especially close in this textbook for eg. which may not hold true for background corpus. To handle the generic words (like "any") we need the basic background model(like the one provided by Google on word2vec site).
Is there any way that we can supplant to the background model using our newer corpus. Just training on the corpus etc. doesnot work well.
Are there any attempts to combine vector representations from two corpus- general and specific. I could not find any in my searches.
Let's talk about gensim since you tagged you question with it. You can load a previously trained model in python using gensim. Then you continue training it. Would it be useful?
# load from previous gensim file:
model = gensim.models.Word2Vec.load(fname)
# or from word2vec c format:
# model = gensim.models.Word2Vec.load_word2vec_format('/path/vectors.bin', binary=True)
# continue training:
model.train(other_sentences)
model.save(fname)

how to create word vector

How to create word vector? I used one hot key to create word vector, but it is very huge and not generalized for similar semantic word. So I have heard about word vector using neural network that finds word similarity and word vector. So I wanted to know how to generate this vector (algorithm) or good material to start creating word vector ?.
Word-vectors or so-called distributed representations have a long history by now, starting perhaps from work of S. Bengio (Bengio, Y., Ducharme, R., & Vincent, P. (2001).A neural probabilistic language model. NIPS.) where he obtained word-vectors as by-product of training neural-net lanuage model.
A lot of researches demonstrated that these vectors do capture semantic relationship between words (see for example http://research.microsoft.com/pubs/206777/338_Paper.pdf). Also this important paper (http://arxiv.org/abs/1103.0398) by Collobert et al, is a good starting point with understanding word vectors, the way they are obtained and used.
Besides word2vec there is a lot of methods to obtain them. Expamples include SENNA embeddings by Collobert et al (http://ronan.collobert.com/senna/), RNN embeddings by T. Mikolov that can be computed using RNNToolkit (http://www.fit.vutbr.cz/~imikolov/rnnlm/) and much more. For English, ready-made embeddings can be downloaded from these web-sites. word2vec really uses skip-gram model (not neural network model). Another fast code for computing word representations is GloVe (http://www-nlp.stanford.edu/projects/glove/). It is an open question whatever deep neural networks are essential for obtaining good embeddings or not.
Depending of your application, you may prefer using different types of word-vectors, so its a good idea to try several popular algorithms and see what works better for you.
I think the thing you mean is Word2Vec (https://code.google.com/p/word2vec/). It trains N-dimensional word vectors of documents based on a given corpus. So in my understanding of word2vec the neural network is just used to aggregate the dimensions of the document vector and also capturing some relationship between words. But what should be mentioned is that this is not really semantically related, it just reflects the structural relationship in your training body.
If you want to capture semantic relatedness have a look a WordNet based measures, for instance implemented is these libaries:
Java: https://code.google.com/p/ws4j/
Perl: http://wn-similarity.sourceforge.net/
To get started with word2vec you can use their pretrained vectors. You should find all information about this at https://code.google.com/p/word2vec/.
When you seek for a java implementation. This is a good starting point: http://deeplearning4j.org/word2vec.html
I hope this helps
Best wishes

Feature Construction for Text Classification using Autoencoders

Autoencoders can be used to reduce dimensionallity in feature vectors - as far as I understand. In text classification a feature vector is normally constructed via a dictionary - which tends to be extremely large. I have no experience in using autoencoders, so my questions are:
Could autoencoders be used to reduce dimensionallity in text classification? (Why? / Why not?)
Has anyone already done this? A source would be nice, if so.
The existing works use auto encoder for creating models in the sentence level. Basically after training the model using Autoencode, you can get a vector for a sentence. Since any document consists of sentences you can get a set of vectors for the document, and do the document classification. In my experience with various vector representation (e.g. those generated from autoencodes) doing so might give answers worse than classification with bag of words.

Resources