Why word embedding technique works - nlp

I have look into some word embedding techniques, such as
CBOW: from context to single word. Weight matrix produced used as embedding vector
Skip gram: from word to context (from what I see, its acutally word to word, assingle prediction is enough). Again Weight matrix produced used as embedding
Introduction to these tools would always quote "cosine similarity", which says words of similar meanning would convert to similar vector.
But these methods all based on the 'context', account only for words around a target word. I should say they are 'syntagmatic' rather than 'paradigmatic'. So why the close in distance in a sentence indicate close in meaning? I can think of many counter example that frequently occurs
"Have a good day". (good and day are vastly different, though close in distance).
"toilet" "washroom" (two words of similar meaning, but a sentence contains one would unlikely to contain another)
Any possible explanation?

This sort of "why" isn't a great fit for StackOverflow, but some thoughts:
The essence of word2vec & similar embedding models may be compression: the model is forced to predict neighbors using far less internal state than would be required to remember the entire training set. So it has to force similar words together, in similar areas of the parameter space, and force groups of words into various useful relative-relationships.
So, in your second example of 'toilet' and 'washroom', even though they rarely appear together, they do tend to appear around the same neighboring words. (They're synonyms in many usages.) The model tries to predict them both, to similar levels, when typical words surround them. And vice-versa: when they appear, the model should generally predict the same sorts of words nearby.
To achieve that, their vectors must be nudged quite close by the iterative training. The only way to get 'toilet' and 'washroom' to predict the same neighbors, through the shallow feed-forward network, is to corral their word-vectors to nearby places. (And further, to the extent they have slightly different shades of meaning – with 'toilet' more the device & 'washroom' more the room – they'll still skew slightly apart from each other towards neighbors that are more 'objects' vs 'places'.)
Similarly, words that are formally antonyms, but easily stand-in for each-other in similar contexts, like 'hot' and 'cold', will be somewhat close to each other at the end of training. (And, their various nearer-synonyms will be clustered around them, as they tend to be used to describe similar nearby paradigmatically-warmer or -colder words.)
On the other hand, your example "have a good day" probably doesn't have a giant influence on either 'good' or 'day'. Both words' more unique (and thus predictively-useful) senses are more associated with other words. The word 'good' alone can appear everywhere, so has weak relationships everywhere, but still a strong relationship to other synonyms/antonyms on an evaluative ("good or bad", "likable or unlikable", "preferred or disliked", etc) scale.
All those random/non-predictive instances tend to cancel-out as noise; the relationships that have some ability to predict nearby words, even slightly, eventually find some relative/nearby arrangement in the high-dimensional space, so as to help the model for some training examples.
Note that a word2vec model isn't necessarily an effective way to predict nearby words. It might never be good at that task. But the attempt to become good at neighboring-word prediction, with fewer free parameters than would allow a perfect-lookup against training data, forces the model to reflect underlying semantic or syntactic patterns in the data.
(Note also that some research shows that a larger window influences word-vectors to reflect more topical/domain similarity – "these words are used about the same things, in the broad discourse about X" – while a tiny window makes the word-vectors reflect a more syntactic/typical similarity - "these words are drop-in replacements for each other, fitting the same role in a sentence". See for example Levy/Goldberg "Dependency-Based Word Embeddings", around its Table 1.)

‘Embedding’ mean a semantic vector representation. e.g. how to represent words such that synonyms are nearer than antonyms or other unrelated words.
Embeddings algorithms like Word2vec maps entities be it e-commerce
items or words (say in English language), to N-dimensional vectors.
Now since you have a mathematical representation of the entities in
a Euclidean space, you can use associated semantics such as distance
between vectors. e.g:
For a given item say ‘Levis Jeans’ recommend the most related items
which are often co-purchased with it.
This can be easily done: search the nearest vectors to the vector of
‘Levis Jeans’, and recommend them. You will find that the nearest
vectors correspond to items such as T-shirts etc., which are
relevant to the Levis Jeans. Similarly it preserves
distance/similarity between words e.g.: King - Queen = Man - Woman !
Yes, Word2vec captures such co-occurrance relationships, when
mapping the items/words to vectors also called as ‘item/word
embeddings’.
This is not specifically targeted to sentence embeddings but nevertheless here you get some crucial insights extremely relevant to the core logic behind embedding generation. Read till the end.

Related

Use the polarity distribution of word to detect the sentiment of new words

I have just started a project in NLP. Suppose I have a graph for each word that shows the polarity distribution of sentiments for that word in different sentences. I want to know what I can use to recognize the feelings of new words? Any other use you have in mind I will be happy to share.
I apologize for any possible errors in my writing. Thanks a lot
Assuming you've got some words that have been hand-labeled with positive/negative sentiments, but then you encounter some new words that aren't labeled:
If you encounter the new words totally alone, outside of contexts, there's not much you can do. (Maybe, you could go out to try to find extra texts with those new words, such as vis dictionaries or the web, then use those larger texts in the next approach.)
If you encounter the new words inside texts that also include some of your hand-labeled words, you could try guessing that the new words are most like the words you already know that are closest-to, or used-in-the-same-places. This would leverage what's called "the distributional hypothesis" – words with similar distributions have similar meanings – that underlies a lot of computer natural-language analysis, including word2vec.
One simple thing to try along these lines: across all your texts, for every unknown word U, tally up the counts all neighboring words within N positions. (N could be 1, or larger.) From that, pick the top 5 words occuring most often near the unknown word, and look up your prior labels, and avergae them together (perhaps weighted by the number of occurrences.)
You'll then have a number for the new word.
Alternatively, you could train a word2vec set-of-word-vectors for all of your texts, including the unknown & know words. Then, ask that model for the N most-similar neighbors to your unknown word. (Again, N could be small or large.) Then, from among those neighbors with known labels, average them together (again perhaps weighted by similarity), to get a number for the previously unknown word.
I wouldn't particularly expect either of these techniques to work very well. The idea that individual words can have specific sentiment is somewhat weak given the way that in actual language, their meaning is heavily modified, or even reversed, by the surrounding grammar/context. But in each case these simple calculate-from-neighbors techniqyes are probably better than random guesses.
If your real aim is to calculate the overall sentiment of longer texts, like sentences, paragraphs, reviews, etc, then you should discard your labels of individual words an acquire/create labels for full texts, and apply real text-classification techniques to those larger texts. A simple word-by-word approach won't do very well compared to other techniques – as long as those techniques have plenty of labeled training data.

Does adding a list of Word2Vec embeddings give a meaningful represenation?

I'm using a pre-trained word2vec model (word2vec-google-news-300) to get the embeddings for a given list of words. Please note that this is NOT a list of words that we get after tokenizing a sentence, it is just a list of words that describe a given image.
Now I'd like to get a single vector representation for the entire list. Does adding all the individual word embeddings make sense? Or should I consider averaging?
Also, I would like the vector to be of a constant size so concatenating the embeddings is not an option.
It would be really helpful if someone can explain the intuition behind considering either one of the above approaches.
Averaging is most typical, when someone is looking for a super-simple way to turn a bag-of-words into a single fixed-length vector.
You could try a simple sum, as well.
But note that the key difference between the sum and average is that the average divides by the number of input vectors. Thus they both result in a vector that's pointing in the exact same 'direction', just of different magnitude. And, the most-often-used way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. So for a lot of cosine-similarity-based ways of later comparing the vectors, sum-vs-average will give identical results.
On the other hand, if you're comparing the vectors in other ways, like via euclidean-distances, or feeding them into other classifiers, sum-vs-average could make a difference.
Similarly, some might try unit-length-normalizing all vectors before use in any comparisons. After such a pre-use normalization, then:
euclidean-distance (smallest to largest) & cosine-similarity (largest-to-smallest) will generate identical lists of nearest-neighbors
average-vs-sum will result in different ending directions - as the unit-normalization will have upped some vectors' magnitudes, and lowered others, changing their relative contributions to the average.
What should you do? There's no universally right answer - depending on your dataset & goals, & the ways your downstream steps use the vectors, different choices might offer slight advantages in whatever final quality/desirability evaluation you perform. So it's common to try a few different permutations, along with varying other parameters.
Separately:
The GoogleNews vectors were trained on news articles back around 2013; their word senses thus may not be optimal for an image-labeling task. If you have enough of your own data, or can collect it, training your own word-vectors might result in better results. (Both the use of domain-specific data, & the ability to tune training parameters based on your own evaluations, could offer benefits - especially when your domain is unique, or the tokens aren't typical natural-language sentences.)
There are other ways to create a single summary vector for a run-of-tokens, not just arithmatical-combo-of-word-vectors. One that's a small variation on the word2vec algorithm often goes by the name Doc2Vec (or 'Paragraph Vector') - it may also be worth exploring.
There are also ways to compare bags-of-tokens, leveraging word-vectors, that don't collapse the bag-of-tokens to a single fixed-length vector 1st - and while they're more expensive to calculate, sometimes offer better pairwise similarity/distance results than simple cosine-similarity. One such alternate comparison is called "Word Mover's Distance" - at some point,, you may want to try that as well.

Using Word2Vec for polysemy solving problems

I have some questions about Word2Vec:
What determines the dimension of the result model vectors?
What is elements of this vectors?
Can I use Word2Vec for polysemy solving problems (state = administrative unit vs state = condition), if I already have texts for every meaning of words?
(1) You pick the desired dimensionality, as a meta-parameter of the model. Rigorous projects with enough time may try different sizes, to see what works best for their qualitative evaluations.
(2) Individual dimensions/elements of each word-vector (floating-point numbers), in vanilla word2vec are not easily interpretable. It's only the arrangement of words as a whole that has usefulness – placing similar words near each other, and making relative directions (eg "towards 'queen' from 'king'") match human intuitions about categories/continuous-properties. And, because the algorithms use explicit randomization, and optimized multi-threaded operation introduces thread-scheduling randomness to the order-of-training-examples, even the exact same data can result in different (but equally good) vector-coordinates from run-to-run.
(3) Basic word2vec doesn't have an easy fix, but there's a bunch of hints of polysemy in the vectors, and research work to do more to disambiguate contrasting senses.
For example, generally more-polysemous word-tokens wind up with word-vectors that are some combination of their multiple senses, and (often) of a smaller-magnitude than less-polysemous words.
This early paper used multiple representations per word to help discover polysemy. Similar later papers like this one use clustering-of-contexts to discover polysemous words then relabel them to give each sense its own vector.
This paper manages an impressive job of detecting alternate senses via postprocessing of normal word2vec vectors.

Are the features of Word2Vec independent each other?

I am new to NLP and studying Word2Vec. So I am not fully understanding the concept of Word2Vec.
Are the features of Word2Vec independent each other?
For example, suppose there is a 100-dimensional word2vec. Then the 100 features are independent each other? In other words, if the "sequence" of the features are shuffled, then the meaning of word2vec is changed?
Word2vec is a 'dense' embedding: the individual dimensions generally aren't independently interpretable. It's just the 'neighborhoods' and 'directions' (not limited to the 100 orthogonal axis dimensions) that have useful meanings.
So, they're not 'independent' of each other in a statistical sense. But, you can discard any of the dimensions – for example, the last 50 dimensions of all your 100-dimensional vectors – and you still have usable word-vectors. So in that sense they're still independently useful.
If you shuffled the order-of-dimensions, the same way for every vector in your set, you've then essentially just rotated/reflected all the vectors similarly. They'll all have different coordinates, but their relative distances will be the same, and if "going toward word B from word A" used to vaguely indicate some human-understandable aspect like "largeness", then even after performing your order-of-dimensions shuffle, "going towards word B from word A" will mean the same thing, because the vectors "thataway" (in the transformed coordinates) will be the same as before.
The first thing to understand here is that how word2Vec is formalized. Shifting away from traditional representations of words, the word2vec model tries to encode the meaning of the world into different features. For eg lets say every word in the english dictionary can be manifested in a set of say '4' features. The features could be , lets say "f1":"gender", "f2":"color","f3":"smell","f4":"economy".
So now when a word2vec vector is written , what it signifies is how much manifestation of a particular feature it has. Lets take an example to understand this. Consider a Man(V1) who is dark,not so smelly and is not very rich and is neither poor. Then the first feature ie gender is represented as 1 (since we are taking 1 as male and -1 as female). The second feature color is -1 here as it is exactly opposite to white (which we are taking as 1). Smell and economy are similary given 0.3 and 0.4 values.
Now consider another man(V2) who also has the same anatomy and social status like the first man. Then his word2vec vector would also be similar.
V1=>[1,-1,0.3,0.4]
V2=>[1,-1,0.4,0.3]
This kind of representation helps us represent words into features that are independent or orthogonal to each other.The orthogonality helps in finding similarity or dissimilarity based on some mathematical operation lets say cosine dot product.
The sequence of the number in a word2vec is important since every number represents the weight of a particular feature: gender, color,smell,economy. So shuffling the positions would result in a completely different vector

Supervised Learning for User Behavior over Time

I want to use machine learning to identify the signature of a user who converts to a subscriber of a website given their behavior over time.
Let's say my website has 6 different features which can be used before subscribing and users can convert to a subscriber at any time.
For a given user I have stats which represent the intensity on a continuous range of that user's interaction with features 1-6 on a daily basis so:
D1: f1,f2,f3,f4,f5,f6
D2: f1,f2,f3,f4,f5,f6
D3: f1,f2,f3,f4,f5,f6
D4: f1,f2,f3,f4,f5,f6
Let's say on day 5, the user converts.
What machine using algorithms would help me identify which are the most common patterns in feature usage which lead to a conversion?
(I know this is a super basic classification question, but I couldn't find a good example using longitudinal data, where input vectors are ordered by time like I have)
To develop the problem further, let's assume that each feature has 3 intensities at which the user can interact (H, M, L).
We can then represent each user as a string of states of interaction intensity. So, for a user:
LLLLMM LLMMHH LLHHHH
Would mean on day one they only interacted significantly with features 5 and 6, but by the third day they were interacting highly with features 3 through 6.
N-gram Style
I could make these states words and the lifetime of a user a sentence. (Would probably need to add a "conversion" word to the vocabulary as well)
If I ran these "sentences" through an n-gram model, I could get the likely future state of a user given his/her past few state which is somewhat interesting. But, what I really want to know the most common sets of n-grams that lead to the conversion word. Rather than feeding in an n-gram and getting the next predicted word, I want to give the predicted word and get back the 10 most common n-grams (from my data) which would be likely to lead to the word.
Amaç Herdağdelen suggests identifying n-grams to practical n and then counting how many n-gram states each user has. Then correlating with conversion data (I guess no conversion word in this example). My concern is that there would be too many n-grams to make this method practical. (if each state has 729 possibilities, and we're using trigrams, thats a lot of possible trigrams!)
Alternatively, could I just go thru the data logging the n-grams which led to the conversion word and then run some type of clustering on them to see what the common paths are to a conversion?
Survival Style
Suggested by Iterator, I understand the analogy to a survival problem, but the literature here seems to focus on predicting time to death as opposed to the common sequence of events which leads to death. Further, when looking up the Cox Proportional Hazard model, I found that it does not event accommodate variables which change over time (its good for differentiating between static attributes like gender and ethnicity)- so it seems very much geared toward a different question than mine.
Decision Tree Style
This seems promising though I can't completely wrap my mind around how to structure the data. Since the data is not flat, is the tree modeling the chance of moving from one state to another down the line and when it leads to conversion or not? This is very different than the decision tree data literature I've been able to find.
Also, need clarity on how to identify patterns which lead to conversion instead a models predicts likely hood of conversion after a given sequence.
Theoretically, hidden markov models may be a suitable solution to your problem. The features on your site would constitute the alphabet, and you can use the sequence of interactions as positive or negative instances depending on whether a user finally subscribed or not. I don't have a guess about what the number of hidden states should be, but finding a suitable value for that parameter is part of the problem, after all.
As a side note, positive instances are trivial to identify, but the fact that a user has not subscribed so far doesn't necessarily mean s/he won't. You might consider to limit your data to sufficiently old users.
I would also consider converting the data to fixed-length vectors and apply conceptually simpler models that could give you some intuition about what's going on. You could use n-grams (consecutive interaction sequences of length n).
As an example, assuming that the interaction sequence of a given user ise "f1,f3,f5", "f1,f3,f5" would constitute a 3-gram (trigram). Similarly, for the same user and the same interaction sequence you would have "f1,f3" and "f3,f5" as the 2-grams (bigrams). In order to represent each user as a vector, you would identify all n-grams up to a practical n, and count how many times the user employed a given n-gram. Each column in the vector would represent the number of times a given n-gram is observed for a given user.
Then -- probably with the help of some suitable normalization techniques such as pointwise mutual information or tf-idf -- you could look at the correlation between the n-grams and the final outcome to get a sense of what's going on, carry out feature selection to find the most prominent sequences that users are involved in, or apply classification methods such as nearest neighbor, support machine or naive Bayes to build a predictive model.
This is rather like a survival analysis problem: over time the user will convert or will may drop out of the population, or will continue to appear in the data and not (yet) fall into neither camp. For that, you may find the Cox proportional hazards model useful.
If you wish to pursue things from a different angle, namely one more from the graphical models perspective, then a Kalman Filter may be more appealing. It is a generalization of HMMs, suggested by #AmaçHerdağdelen, which work for continuous spaces.
For ease of implementation, I'd recommend the survival approach. It is the easiest to analyze, describe, and improve. After you have a firm handle on the data, feel free to drop in other methods.
Other than Markov chains, I would suggest decision trees or Bayesian networks. Both of these would give you a likely hood of a user converting after a sequence.
I forgot to mention this earlier. You may also want to take a look at the Google PageRank algorithm. It would help you account for the user completely disappearing [not subscribing]. The results of that would help you to encourage certain features to be used. [Because they're more likely to give you a sale]
I think Ngramm is most promising approach, because all sequnce in data mining are treated as elements depndent on few basic steps(HMM, CRF, ACRF, Markov Fields) So I will try to use classifier based on 1-grams and 2 -grams.

Resources