What are the negative & sample parameters? - doc2vec

I am new to NLP and doc2Vec. I want to understand the parameters of doc2Vec. Thank you
Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, sample = 0, seed=0)
vector_size:I believe this is to control over-fitting. A larger feature vector will learn more details so it tends to over-fit. Is there a method to determine a appropriate vector size based on the number of document or total words in all doc?
negative: how many “noise words” should be drawn. What is noise word?
sample: the threshold for configuring which higher-frequency words are randomly down sampled. So what does sample=0 mean?

As a beginner, only vector_size will be of initial interest.
Typical values are 100-1000, but larger dimensionalities require far more training data & more memory. There's no hard & fast rules – try different values, & see what works for your purposes.
Very vaguely, you'll want your count of unique vocabulary words to be much larger than the vector_size, at least the square of the vector_size: the gist of the algorithm is to force many words into a smaller-number of dimensions. (If for some reason you're running experiments on tiny amounts of data with a tiny vocabulary – for which word2vec isn't really good anyway – you'll have to shrink the vector_size very low.)
The negative value controls a detail of how the internal neural network is adjusted: how many random 'noise' words the network is tuned away from predicting for each target positive word it's tuned towards predicting. The default of 5 is good unless/until you have a repeatable way to rigorously score other values against it.
Similarly, sample controls how much (if at all) more-frquent words are sometimes randomly skipped (down-sampled). (So many redundant usage examples are overkill, wasting training time/effort that could better be spent on rarer words.) Again, you'd only want to tinker with this if you've got a way to compare the results of alternate values. Smaller values make the downsampling more aggressive (dropping more words). sample=0 would turn off such down-sampling completely, leaving all training text words used.
Though you didn't ask:
dm=0 turns off the default PV-DM mode in favor of the PV-DBOW mode. That will train doc-vectors a bit faster, and often works very well on short texts, but won't train word-vectors at all (unless you turn on an extra dbow_words=1 mode to add-back interleaved ski-gram word-vector training).
hs is an alternate mode to train the neural-network that uses multi-node encodings of words, rather than one node per (positive or negative) word. If enabled via hs=1, you should disable the negative-sampling with negative=0. But negative-sampling mode is the default for a reason, & tends to get relatively better with larger amounts of training data - so it's rare to use this mode.

Related

Does adding a list of Word2Vec embeddings give a meaningful represenation?

I'm using a pre-trained word2vec model (word2vec-google-news-300) to get the embeddings for a given list of words. Please note that this is NOT a list of words that we get after tokenizing a sentence, it is just a list of words that describe a given image.
Now I'd like to get a single vector representation for the entire list. Does adding all the individual word embeddings make sense? Or should I consider averaging?
Also, I would like the vector to be of a constant size so concatenating the embeddings is not an option.
It would be really helpful if someone can explain the intuition behind considering either one of the above approaches.
Averaging is most typical, when someone is looking for a super-simple way to turn a bag-of-words into a single fixed-length vector.
You could try a simple sum, as well.
But note that the key difference between the sum and average is that the average divides by the number of input vectors. Thus they both result in a vector that's pointing in the exact same 'direction', just of different magnitude. And, the most-often-used way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. So for a lot of cosine-similarity-based ways of later comparing the vectors, sum-vs-average will give identical results.
On the other hand, if you're comparing the vectors in other ways, like via euclidean-distances, or feeding them into other classifiers, sum-vs-average could make a difference.
Similarly, some might try unit-length-normalizing all vectors before use in any comparisons. After such a pre-use normalization, then:
euclidean-distance (smallest to largest) & cosine-similarity (largest-to-smallest) will generate identical lists of nearest-neighbors
average-vs-sum will result in different ending directions - as the unit-normalization will have upped some vectors' magnitudes, and lowered others, changing their relative contributions to the average.
What should you do? There's no universally right answer - depending on your dataset & goals, & the ways your downstream steps use the vectors, different choices might offer slight advantages in whatever final quality/desirability evaluation you perform. So it's common to try a few different permutations, along with varying other parameters.
Separately:
The GoogleNews vectors were trained on news articles back around 2013; their word senses thus may not be optimal for an image-labeling task. If you have enough of your own data, or can collect it, training your own word-vectors might result in better results. (Both the use of domain-specific data, & the ability to tune training parameters based on your own evaluations, could offer benefits - especially when your domain is unique, or the tokens aren't typical natural-language sentences.)
There are other ways to create a single summary vector for a run-of-tokens, not just arithmatical-combo-of-word-vectors. One that's a small variation on the word2vec algorithm often goes by the name Doc2Vec (or 'Paragraph Vector') - it may also be worth exploring.
There are also ways to compare bags-of-tokens, leveraging word-vectors, that don't collapse the bag-of-tokens to a single fixed-length vector 1st - and while they're more expensive to calculate, sometimes offer better pairwise similarity/distance results than simple cosine-similarity. One such alternate comparison is called "Word Mover's Distance" - at some point,, you may want to try that as well.

Why word embedding technique works

I have look into some word embedding techniques, such as
CBOW: from context to single word. Weight matrix produced used as embedding vector
Skip gram: from word to context (from what I see, its acutally word to word, assingle prediction is enough). Again Weight matrix produced used as embedding
Introduction to these tools would always quote "cosine similarity", which says words of similar meanning would convert to similar vector.
But these methods all based on the 'context', account only for words around a target word. I should say they are 'syntagmatic' rather than 'paradigmatic'. So why the close in distance in a sentence indicate close in meaning? I can think of many counter example that frequently occurs
"Have a good day". (good and day are vastly different, though close in distance).
"toilet" "washroom" (two words of similar meaning, but a sentence contains one would unlikely to contain another)
Any possible explanation?
This sort of "why" isn't a great fit for StackOverflow, but some thoughts:
The essence of word2vec & similar embedding models may be compression: the model is forced to predict neighbors using far less internal state than would be required to remember the entire training set. So it has to force similar words together, in similar areas of the parameter space, and force groups of words into various useful relative-relationships.
So, in your second example of 'toilet' and 'washroom', even though they rarely appear together, they do tend to appear around the same neighboring words. (They're synonyms in many usages.) The model tries to predict them both, to similar levels, when typical words surround them. And vice-versa: when they appear, the model should generally predict the same sorts of words nearby.
To achieve that, their vectors must be nudged quite close by the iterative training. The only way to get 'toilet' and 'washroom' to predict the same neighbors, through the shallow feed-forward network, is to corral their word-vectors to nearby places. (And further, to the extent they have slightly different shades of meaning – with 'toilet' more the device & 'washroom' more the room – they'll still skew slightly apart from each other towards neighbors that are more 'objects' vs 'places'.)
Similarly, words that are formally antonyms, but easily stand-in for each-other in similar contexts, like 'hot' and 'cold', will be somewhat close to each other at the end of training. (And, their various nearer-synonyms will be clustered around them, as they tend to be used to describe similar nearby paradigmatically-warmer or -colder words.)
On the other hand, your example "have a good day" probably doesn't have a giant influence on either 'good' or 'day'. Both words' more unique (and thus predictively-useful) senses are more associated with other words. The word 'good' alone can appear everywhere, so has weak relationships everywhere, but still a strong relationship to other synonyms/antonyms on an evaluative ("good or bad", "likable or unlikable", "preferred or disliked", etc) scale.
All those random/non-predictive instances tend to cancel-out as noise; the relationships that have some ability to predict nearby words, even slightly, eventually find some relative/nearby arrangement in the high-dimensional space, so as to help the model for some training examples.
Note that a word2vec model isn't necessarily an effective way to predict nearby words. It might never be good at that task. But the attempt to become good at neighboring-word prediction, with fewer free parameters than would allow a perfect-lookup against training data, forces the model to reflect underlying semantic or syntactic patterns in the data.
(Note also that some research shows that a larger window influences word-vectors to reflect more topical/domain similarity – "these words are used about the same things, in the broad discourse about X" – while a tiny window makes the word-vectors reflect a more syntactic/typical similarity - "these words are drop-in replacements for each other, fitting the same role in a sentence". See for example Levy/Goldberg "Dependency-Based Word Embeddings", around its Table 1.)
‘Embedding’ mean a semantic vector representation. e.g. how to represent words such that synonyms are nearer than antonyms or other unrelated words.
Embeddings algorithms like Word2vec maps entities be it e-commerce
items or words (say in English language), to N-dimensional vectors.
Now since you have a mathematical representation of the entities in
a Euclidean space, you can use associated semantics such as distance
between vectors. e.g:
For a given item say ‘Levis Jeans’ recommend the most related items
which are often co-purchased with it.
This can be easily done: search the nearest vectors to the vector of
‘Levis Jeans’, and recommend them. You will find that the nearest
vectors correspond to items such as T-shirts etc., which are
relevant to the Levis Jeans. Similarly it preserves
distance/similarity between words e.g.: King - Queen = Man - Woman !
Yes, Word2vec captures such co-occurrance relationships, when
mapping the items/words to vectors also called as ‘item/word
embeddings’.
This is not specifically targeted to sentence embeddings but nevertheless here you get some crucial insights extremely relevant to the core logic behind embedding generation. Read till the end.

Getting less than 1 score while trying to check the cosine similarities of same document

I have used doc2vec to find the similarities in multiple documents, but when i am checking the same document which i created my model, the score should be '1' right? as the used document and the to be predict document is same. Sadly, I am getting different score when trying to find the similarities. Below is the attached code. Please tell me how to make this right, I can't find what is wrong here. Pleas help me...doc2vec - calculating document cosine similarity
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
df['Tagged_data'] = df['sent_to_word_tokenize_text'].apply(lambda x: [TaggedDocument(d, [i]) for i, d in enumerate(x)])
sadguru_model = Doc2Vec(df['Tagged_data'][0], vector_size = 1000, window = 500, dm = 1, min_count = 1, workers = 2, epochs = 100)
test_doc = word_tokenize(' '.join([word for word in df['Sentence_Tokenized_Text'][0]]))
# Sadguru model document
index0 = sadguru_model.docvecs.most_similar(positive=sadguru_model.infer_vector(test_doc)], topn =1) output: index0 = [(4014, 0.5270981788635254)]
output: index0 = [(4014, 0.5270981788635254)]
Doc2Vec doesn't discover true, unique vectors for every input document. Rather, it progressively learns useful-approximation vectors, using an internal algorithm that itself makes use of a lot of random initialization and random sampling. As a result:
if your training data includes the same document (same words) twice, with different document-ids, they won't get identical vectors
re-inferring vectors on a trained model, with the exact same words as an in-training document, won't result in identical vectors to the same original document
For more info, see the Gensim FAQ questions 11 & 12.
If your data & parameters are sufficient, then you can expect that two identical documents should have "very close" vectors, and a re-inference of the same document-words creates a vector "very close" to the same document in the original training set. (There's no precise definition of "very close", but in a working model, such same-word documents will be closer to each other than other documents in the training set.)
So you should expect 'high' similarities approaching 1.0, but essentially never 1.0 exactly, unless you've made two identical vectors on purpose with a lot of special effort.
However, you're not even seeing that 'very close' result, because it looks like your training parameters (and probably, training corpus) are way out-of-whack compared to normal or best practices. Specifically:
A vector_size=1000 is only appropriate for gigantic datasets, of millions (ideally tens-of-millions) of documents. If you're using vectors larger than your data can fill with meaningful distinctions, your models' results will appear increasingly random - especially in the case of identical or very-similar documents, because now instead of the stochastic, iterative process gradually nudging them to the same 'neighborhood' of values, they could wind up all over the place.
A window=500 is unprecedented. The default is 5, sometimes values up to 20 are used, or occasionally giant values **if and only if* the documents themselves are tiny, such that the effective window is still just "the whole document of a manageable size". On a real-sized corpus with documents over 500 words, window=500 would be amazingly expensive to calculate & likely result if far-worse vectors than a more typical value.
A min_count=1 is almost always a bad idea. Words that appear only once, or a few times, don't have the variety of subtly-varying uses that are needed for Doc2Vec (& related algorithms like Word2Vec, FastText, etc) to learn meaningful representations. Instead, single/rare uses contribute weird nonrepresentative examples, and often just function as noise preventing other words with enough examples from being better-understood. Far more people should be increasing the value over 5, as their training data grows, than reducing it.
An epochs=100 is highly uncommon, mostly used if struggling to squeeze some results from insufficient data by intensively re-training on it. (The cases where that makes the most sense would also be those where, due to small data, you decrease the vector_size to below the default of 100.) For Doc2Vec, epochs of 10-20 is most common in published results.
Try a vector_size no larger than the square root of the count of unique documents you have, leave the min_count at its default (or at least 2), leave the window at its default (unless you specifically have very-small documents), and try epochs=20 (unless you have very few documents and find improvement with slightly more).
Then you'll likely find your self-similarity test to return some high value – perhaps 0.9 or more – rather than 0.52, but still not 1.0.

Questions about standardizing and scaling

I am trying to generate a model that uses several physico-chemical properties of a molecule (incl. number of atoms, number of rings, volume, etc.) to predict a numeric value Y. I would like to use PLS Regression, and I understand that standardization is very important here. I am programming in Python, using scikit-learn. The type and range for the features varies. Some are int64 while other are float. Some features generally have small (positive or negative) values, while other have very large value. I have tried using various scalers (e.g. standard scaler, normalize, minmax scaler, etc.). Yet, the R2/Q2 are still low. I have a few questions:
Is it possible that by scaling, some of the very important features lose their significance, and thus contribute less to explaining the variance of the response variable?
If yes, if I identify some important features (by expert knowledge), is it OK to scale other features but those? Or scale the important features only?
Some of the features, although not always correlated, have values that are in a similar range (e.g. 100-400), compared to others (e.g. -1 to 10). Is it possible to scale only a specific group of features that are within the same range?
The whole idea of scaling is to make models more robust to analysis on features space. For example, if you have 2 features as 5 Kg and 5000 gm, we know both are same, but for some algorithm, which are sensitive to metric space such as KNN, PCA etc, they will be more weighted towards second features, so scaling must be done for these algos.
Now coming to your question,
Scaling doesn't effect the significance of features. As i explained above, it helps in better analysis of data.
No, you should not do, reason explained above.
If you want to include domain knowledge in your model, you can use it as prior information. In short, for linear model, this is same as regularization. It has very good features. if you think, you have many useless-features, you can use L1 regularization, which creates sparse effect on features space, which is nothing but assign 0 weight to useless features. Here is the link for more-info.
One more point, some method such as tree based model doesn't need scaling, In last, it mostly depend on the model, you choose.
Lose significance? Yes. Contribute less? No.
No, it's not OK. It's either all or nothing.
No. The idea of scaling is not to decrease / increase significance / effect of a variable. It's to transform all variables to a common scale that can be interpreted.

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Resources