In paragraph vector modeling, they refer paragraph as a memory information, together with context words to predict the target word. I can't see why a paragraph will be useful information to predict the target word.
Should the paragraph include the target word?
1
Can anyone give me examples of how to do it? What's D here? Is the paragraph ID also a one hot paragraph vector?
For example , I have paragraph A, B, C and word a,b,c,d,e,f,g.
Paragraph B is the sequence of abcdefg.
The document is A+B +C
If I want to train this document and I want to predict word d.
What's the input paragraph here?
I know the word input should be hot word vector of a,b,c,e,f,g,if the window size is 7.
The image you posted is from paper Distributed Representations of Sentences and Documents by Quoc Le and Tomas Mikolov. You can find detailed explanation of paragraph vectors in section 2.2.
When training word embeddings we usually take vectors of words from the neighborhood of certain word. When using paragraph embedding you can think about it as adding one more word for each training sample we process. It is like more global word that is in a way describing the whole paragraph, not just the few words that were selected as context.
The representation of paragraphs is the same as representation of words. You are encoding which paragraph you want to use with one-hot vector and the paragraph embedding itself is being trained while the corpus is processed. During training you can again think about it as some hidden word inserted to every context of given paragraph.
When calculating the values in the hidden layer you can use addition or concetating. The paper I mentioned is using a concetating so the resulting vector is one half paragraph vector and one half vector calculated from word embeddings.
Related
I'm trying to create an RNN that would predict the next word, given the previous word. But I'm struggling with modeling this into a dataset, specifically, how to indicate the next word to be predicted as a 'label'.
I could use a hot label encoded vector for each word in the vocabulary, but a) it'll have tens of thousands of dimensions, given the large vocabulary, b) I'll lose all the other info contained in the word vector. Perhaps that info would be useful in calculating the error, i.e how off the predictions were from the actual word.
What should I do? Should I just use the one hot encoded vector?
I have 45000 text records in my dataframe. I wanted to convert those 45000 records into word vectors so that I can train a classifier on the word vector. I am not tokenizing the sentences. I just split the each entry into list of words.
After training word2vec model with 300 features, the shape of the model resulted in only 26000. How can I preserve all of my 45000 records ?
In the classifier model, I need all of those 45000 records, so that it can match 45000 output labels.
If you are splitting each entry into a list of words, that's essentially 'tokenization'.
Word2Vec just learns vectors for each word, not for each text example ('record') – so there's nothing to 'preserve', no vectors for the 45,000 records are ever created. But if there are 26,000 unique words among the records (after applying min_count), you will have 26,000 vectors at the end.
Gensim's Doc2Vec (the '
Paragraph Vector' algorithm) can create a vector for each text example, so you may want to try that.
If you only have word-vectors, one simplistic way to create a vector for a larger text is to just add all the individual word vectors together. Further options include choosing between using the unit-normed word-vectors or raw word-vectors of many magnitudes; whether to then unit-norm the sum; and whether to otherwise weight the words by any other importance factor (such as TF/IDF).
Note that unless your documents are very long, this is a quite small training set for either Word2Vec or Doc2Vec.
As I was reading about tf–idf on Wiki, I was confused by what it means by the word "document". Does it mean paragraph?
"The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient."
Document in the tf-idf context can typically be thought of as a bag of words. In a vector space model each word is a dimension in a very high-dimensional space, where the magnitude of an word vector is the number of occurrences of the word (term) in the document. A Document-Term matrix represents a matrix where the rows represent documents and the columns represent the terms, with each cell in the matrix representing # occurrences of the word in the document. Hope it's clear.
A "document" is a distinct text. This generally means that each article, book, or so on is its own document.
If you wanted, you could treat an individual paragraph or even sentence as a "document". It's all a matter of perspective.
I am trying to classify different concepts in a text using n-gram. My data tyically exists of six columns:
The word that needs classification
The classification
First word on the left of 1)
Second word on the left of 1)
First word on the right of 1)
Second word on the right of 1)
When I try to use a SVM in Rapidminer, I get the error that it can not handle polynominal values. I know that this can be done because I have read it in different papers. I set the second column to 'label' and have tried to set the rest to 'text' or 'real', but it seems to have no effect. What am I doing wrong?
You have to use the Support Vector Machine (LibSVM) Operator.
In contrast to the classic SVM which only supports two class problems, the LibSVM implementation (http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf) supports multi-class classification as well as regression.
One approach could be to create attributes with names equal to the words and values equal to the distance from the word of interest. Of course, all possible words would need to be represented as attributes so the input data would be large.
I want to know the best way to rank sentences based on similarity from a set of documents.
For e.g lets say,
1. There are 5 documents.
2. Each document contains many sentences.
3. Lets take Document 1 as primary, i.e output will contain sentences from this document.
4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd...
Thanks in advance.
I'll cover the basics of textual document matching...
Most document similarity measures work on a word basis, rather than sentence structure. The first step is usually stemming. Words are reduced to their root form, so that different forms of similar words, e.g. "swimming" and "swims" match.
Additionally, you may wish to filter the words you match to avoid noise. In particular, you may wish to ignore occurances of "the" and "a". In fact, there's a lot of conjunctions and pronouns that you may wish to omit, so usually you will have a long list of such words - this is called "stop list".
Furthermore, there may be bad words you wish to avoid matching, such as swear words or racial slur words. So you may have another exclusion list with such words in it, a "bad list".
So now you can count similar words in documents. The question becomes how to measure total document similarity. You need to create a score function that takes as input the similar words and gives a value of "similarity". Such a function should give a high value if the same word appears multiple times in both documents. Additionally, such matches are weighted by the total word frequency so that when uncommon words match, they are given more statistical weight.
Apache Lucene is an open-source search engine written in Java that provides practical detail about these steps. For example, here is the information about how they weight query similarity:
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html
Lucene combines Boolean model (BM) of Information Retrieval with
Vector Space Model (VSM) of Information Retrieval - documents
"approved" by BM are scored by VSM.
All of this is really just about matching words in documents. You did specify matching sentences. For most people's purposes, matching words is more useful as you can have a huge variety of sentence structures that really mean the same thing. The most useful information of similarity is just in the words. I've talked about document matching, but for your purposes, a sentence is just a very small document.
Now, as an aside, if you don't care about the actual nouns and verbs in the sentence and only care about grammar composition, you need a different approach...
First you need a link grammar parser to interpret the language and build a data structure (usually a tree) that represents the sentence. Then you have to perform inexact graph matching. This is a hard problem, but there are algorithms to do this on trees in polynomial time.
As a starting point you can compute soundex for each word and then compare documents based on soundexes frequencies.
Tim's overview is very nice. I'd just like to add that for your specific use case, you might want to treat the sentences from Doc 1 as documents themselves, and compare their similarity to each of the four remaining documents. This might give you a quick aggregate similarity measure per sentence without forcing you to go down the route of syntax parsing etc.