the difference between TF-IDF and TF in SVM linear kernel - svm

Because the IDF is a constant number.
All value in one dimension multiply a constant number.
In SVM Linear kernel, The result will be different ?

Your initial question doesn't really make sense. You mix up two different worlds:
1) TF/IDF: features for text representation
2) SVM - Linear Kernel: The simplest approach for SVMs (indeed used for text).
The difference of TF and TF/IDF is on whether the corpus-frequencies of words are used or not. The TF/IDF is by far a better choice, independent of classifier.
Using only TF we don't really care if a word is common or not. Thus, common words like e.g. articles receive a large weight even if they contribute no real information.
In TF/IDF the more frequent a word is in the corpus, the smaller weight it receives. Thus, common words like articles receive small weights but rare words, that it is assumed to carry more information, receive larger weights.
N.B. In the above, "articles" are used as an example they should normally removed in a preprocessing step.

Related

Finding both target and center word2vec matrices

I've read and heard(In the CS224 of Stanford) that the Word2Vec algorithm actually trains two matrices(that is, two sets of vectors.) These two are the U and the V set, one for words being a target and one for words being the context. The final output is the average of these two.
I have two questions in mind. one is that:
Why do we get an average of two vectors? Why it makes sense? Don't we lose some information?
The second question is, using pre-trained word2vec models, how can I get access to both matrices? Is there any downloadable word2vec with both sets of vectors? I don't have enough resources to train a new one.
Thanks
That relayed description isn't quite right. The word-vectors traditionally retrieved from a word2vec model come from a "projection matrix" which converts individual words to a right-sized input-vector for the shallow neural network.
(You could think of the projection matrix as turning a one-hot encoding into a dense-embedding for that word, but libraries typically implement this via a dictionary-lookup – eg: "what row of the vectors-matrix should I consult for this word-token?")
There's another matrix of weights leading to the model's output nodes, whose interpretation varies based on the training mode. In the common default of negative-sampling, there's one node per known word, so you could also interpret this matrix as having a vector per word. (In hierarchical-softmax mode, the known-words aren't encoded as single output nodes, so it's harder to interpret the relationship of this matrix to individual words.)
However, this second vector per word is rarely made directly available by libraries. Most commonly, the word-vector is considered simply the trained-up input vector, from the projection matrix. For example, the export format from Google's original word2vec.c release only saves-out those vectors, and the large "GoogleNews" vector set they released only has those vectors. (There's no averaging with the other output-side representation.)
Some work, especially that of Mitra et all of Microsoft Research (in "Dual Embedding Space Models" & associated writeups) has noted those output-side vectors may be of value in some applications as well – but I haven't seen much other work using those vectors. (And, even in that work, they're not averaged with the traditional vectors, but consulted as a separate option for some purposes.)
You'd have to look at the code of whichever libraries you're using to see if you can fetch these from their full post-training model representation. In the Python gensim library, this second matrix in the negative-sampling case is a model property named syn1neg, following the naming of the original word2vec.c.

What is the stochastic aspect of Word2Vec?

I'm vectorizing words on a few different corpora with Gensim and am getting results that are making me rethink how Word2Vec functions. My understanding was that Word2Vec was deterministic, and that the position of a word in a vector space would not change from training to training. If "My cat is running" and "your dog can't be running" are the two sentences in the corpus, then the value of "running" (or its stem) seems necessarily fixed.
However, I've found that that value indeed does vary across models, and words keep changing where they are on a vector space when I train the model. The differences are not always hugely meaningful, but they do indicate the existence of some random process. What am I missing here?
This is well-covered in the Gensim FAQ, which I quote here:
Q11: I've trained my Word2Vec/Doc2Vec/etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)
Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization
during training. (For example, the training windows are randomly
truncated as an efficient way of weighting nearer words higher. The
negative examples in the default negative-sampling mode are chosen
randomly. And the downsampling of highly-frequent words, as controlled
by the sample parameter, is driven by random choices. These
behaviors were all defined in the original Word2Vec paper's algorithm
description.)
Even when all this randomness comes from a
pseudorandom-number-generator that's been seeded to give a
reproducible stream of random numbers (which gensim does by default),
the usual case of multi-threaded training can further change the exact
training-order of text examples, and thus the final model state.
(Further, in Python 3.x, the hashing of strings is randomized each
re-launch of the Python interpreter - changing the iteration ordering
of vocabulary dicts from run to run, and thus making even the same
string-of-random-number-draws pick different words in different
launches.)
So, it is to be expected that models vary from run to run, even
trained on the same data. There's no single "right place" for any
word-vector or doc-vector to wind up: just positions that are at
progressively more-useful distances & directions from other vectors
co-trained inside the same model. (In general, only vectors that were
trained together in an interleaved session of contrasting uses become
comparable in their coordinates.)
Suitable training parameters should yield models that are roughly as
useful, from run-to-run, as each other. Testing and evaluation
processes should be tolerant of any shifts in vector positions, and of
small "jitter" in the overall utility of models, that arises from the
inherent algorithm randomness. (If the observed quality from
run-to-run varies a lot, there may be other problems: too little data,
poorly-tuned parameters, or errors/weaknesses in the evaluation
method.)
You can try to force determinism, by using workers=1 to limit
training to a single thread – and, if in Python 3.x, using the
PYTHONHASHSEED environment variable to disable its usual string hash
randomization. But training will be much slower than with more
threads. And, you'd be obscuring the inherent
randomness/approximateness of the underlying algorithms, in a way that
might make results more fragile and dependent on the luck of a
particular setup. It's better to tolerate a little jitter, and use
excessive jitter as an indicator of problems elsewhere in the data or
model setup – rather than impose a superficial determinism.
While I don't know any implementation details of Word2Vec in gensim, I do know that, in general, Word2Vec is trained by a simple neural network with an embedding layer as the first layer. The weight matrix of this embedding layer contains the word vectors that we are interested in.
This being said, it is in general also quite common to initialize the weights of a neural network randomly. So there you have the origin of your randomness.
But how can the results be different, regardless of different (random) starting conditions?
A well trained model will assign similar vectors to words that have similar meaning. This similarity is measured by the cosine of the angle between the two vectors. Mathematically speaking, if v and w are the vectors of two very similar words then
dot(v, w) / (len(v) * len(w)) # this formula gives you the cosine of the angle between v and w
will be close to 1.
Also, it will allow you to do arithmetics like the famous
king - man + woman = queen
For illustration purposes imagine 2D-vectors. Would these arithmetical properties get lost if you e.g. rotate everything by some angle around the origin? With a little mathematical background I can assure you: No, they won't!
So, your assumption
If "My cat is running" and "your dog can't be running" are the two
sentences in the corpus, then the value of "running" (or its stem)
seems necessarily fixed.
is wrong. The value of "running" is not fixed at all. What is (somehow) fixed, however, is the similarity (cosine) and arithmetical relationship to other words.

What's a good measure for classifying text documents?

I have written an application that measures text importance. It takes a text article, splits it into words, drops stopwords, performs stemming, and counts word-frequency and document-frequency. Word-frequency is a measure that counts how many times the given word appeared in all documents, and document-frequency is a measure that counts how many documents the given word appeared.
Here's an example with two text articles:
Article I) "A fox jumps over another fox."
Article II) "A hunter saw a fox."
Article I gets split into words (afters stemming and dropping stopwords):
["fox", "jump", "another", "fox"].
Article II gets split into words:
["hunter", "see", "fox"].
These two articles produce the following word-frequency and document-frequency counters:
fox (word-frequency: 3, document-frequency: 2)
jump (word-frequency: 1, document-frequency: 1)
another (word-frequency: 1, document-frequency: 1)
hunter (word-frequency: 1, document-frequency: 1)
see (word-frequency: 1, document-frequency: 1)
Given a new text article, how do I measure how similar this article is to previous articles?
I've read about df-idf measure but it doesn't apply here as I'm dropping stopwords, so words like "a" and "the" don't appear in the counters.
For example, I have a new text article that says "hunters love foxes", how do I come up with a measure that says this article is pretty similar to ones previously seen?
Another example, I have a new text article that says "deer are funny", then this one is a totally new article and similarity should be 0.
I imagine I somehow need to sum word-frequency and document-frequency counter values but what's a good formula to use?
A standard solution is to apply the Naive Bayes classifier which estimates the posterior probability of a class C given a document D, denoted as P(C=k|D) (for a binary classification problem, k=0 and 1).
This is estimated by computing the priors from a training set of class labeled documents, where given a document D we know its class C.
P(C|D) = P(D|C) * P(D) (1)
Naive Bayes assumes that terms are independent, in which case you can write P(D|C) as
P(D|C) = \prod_{t \in D} P(t|C) (2)
P(t|C) can simply be computed by counting how many times does a term occur in a given class, e.g. you expect that the word football will occur a large number of times in documents belonging to the class (category) sports.
When it comes to the other factor P(D), you can estimate it by counting how many labeled documents are given from each class, may be you have more sports articles than finance ones, which makes you believe that there is a higher likelihood of an unseen document to be classified into the sports category.
It is very easy to incorporate factors, such as term importance (idf), or term dependence into Equation (1). For idf, you add it as a term sampling event from the collection (irrespective of the class).
For term dependence, you have to plugin probabilities of the form P(u|C)*P(u|t), which means that you sample a different term u and change (transform) it to t.
Standard implementations of Naive Bayes classifier can be found in the Stanford NLP package, Weka and Scipy among many others.
It seems that you are trying to answer several related questions:
How to measure similarity between documents A and B? (Metric learning)
How to measure how unusual document C is, compared to some collection of documents? (Anomaly detection)
How to split a collection of documents into groups of similar ones? (Clustering)
How to predict to which class a document belongs? (Classification)
All of these problems are normally solved in 2 steps:
Extract the features: Document --> Representation (usually a numeric vector)
Apply the model: Representation --> Result (usually a single number)
There are lots of options for both feature engineering and modeling. Here are just a few.
Feature extraction
Bag of words: Document --> number of occurences of each individual word (that is, term frequencies). This is the basic option, but not the only one.
Bag of n-grams (on word-level or character-level): co-occurence of several tokens is taken into account.
Bag of words + grammatic features (e.g. POS tags)
Bag of word embeddings (learned by an external model, e.g. word2vec). You can use embedding as a sequence or take their weighted average.
Whatever you can invent (e.g. rules based on dictionary lookup)...
Features may be preprocessed in order to decrease relative amount of noise in them. Some options for preprocessing are:
dividing by IDF, if you don't have a hard list of stop words or believe that words might be more or less "stoppy"
normalizing each column (e.g. word count) to have zero mean and unit variance
taking logs of word counts to reduce noise
normalizing each row to have L2 norm equal to 1
You cannot know in advance which option(s) is(are) best for your specific application - you have to do experiments.
Now you can build the ML model. Each of 4 problems has its own good solutions.
For classification, the best studied problem, you can use multiple kinds of models, including Naive Bayes, k-nearest-neighbors, logistic regression, SVM, decision trees and neural networks. Again, you cannot know in advance which would perform best.
Most of these models can use almost any kind of features. However, KNN and kernel-based SVM require your features to have special structure: representations of documents of one class should be close to each other in sense of Euclidean distance metric. This sometimes can be achieved by simple linear and/or logarithmic normalization (see above). More difficult cases require non-linear transformations, which in principle may be learned by neural networks. Learning of these transformations is something people call metric learning, and in general it is an problem which is not yet solved.
The most conventional distance metric is indeed Euclidean. However, other distance metrics are possible (e.g. manhattan distance), or different approaches, not based on vector representations of texts. For example, you can try to calculate Levenstein distance between texts, based on count of number of operations needed to transform one text to another. Or you can calculate "word mover distance" - the sum of distances of word pairs with closest embeddings.
For clustering, basic options are K-means and DBScan. Both these models require your feature space have this Euclidean property.
For anomaly detection you can use density estimations, which are produced by various probabilistic algorithms: classification (e.g. naive Bayes or neural networks), clustering (e.g. mixture of gaussian models), or other unsupervised methods (e.g. probabilistic PCA). For texts, you can exploit the sequential language structure, estimating probabilitiy of each word conditional on the previous words (using n-grams or convolutional/recurrent neural nets) - this is called language models, and it is usually more efficient than bag-of-word assumption of Naive Bayes, which ignores word order. Several language models (one for each class) may be combined into one classifier.
Whatever problem you solve, it is strongly recommended to have a good test set with the known "ground truth": which documents are close to each other, or belong to the same class, or are (un)usual. With this set, you can evaluate different approaches to feature engineering and modelling, and choose the best one.
If you don't have resourses or willingness to do multiple experiments, I would recommend to choose one of the following approaches to evaluate similarity between texts:
word counts + idf normalization + L2 normalization (equivalent to the solution of #mcoav) + Euclidean distance
mean word2vec embedding over all words in text (the embedding dictionary may be googled up and downloaded) + Euclidean distance
Based on one of these representations, you can build models for the other problems - e.g. KNN for classifications or k-means for clustering.
I would suggest tf-idf and cosine similarity.
You can still use tf-idf if you drop out stop-words. It is even probable that whether you include stop-words or not would not make such a difference: the Inverse Document Frequency measure automatically downweighs stop-words since they are very frequent and appear in most documents.
If your new document is entirely made of unknown terms, the cosine similarity will be 0 with every known document.
When I search on df-idf I find nothing.
tf-idf with cosine similarity is very accepted and common practice
Filtering out stop words does not break it. For common words idf gives them low weight anyway.
tf-idf is used by Lucene.
Don't get why you want to reinvent the wheel here.
Don't get why you think the sum of df idf is a similarity measure.
For classification do you have some predefined classes and sample documents to learn from? If so can use Naive Bayes. With tf-idf.
If you don't have predefined classes you can use k means clustering. With tf-idf.
It depend a lot on your knowledge of the corpus and classification objective. In like litigation support documents produced to you, you have and no knowledge of. In Enron they used names of raptors for a lot of the bad stuff and no way you would know that up front. k means lets the documents find their own clusters.
Stemming does not always yield better classification. If you later want to highlight the hits it makes that very complex and the stem will not be the length of the word.
Have you evaluated sent2vec or doc2vec approaches? You can play around with the vectors to see how close the sentences are. Just an idea. Not a verified solution to your question.
While in English a word alone may be enough, it isn't the case in some other more complex languages.
A word has many meanings, and many different uses cases. One text can talk about the same things while using fews to none matching words.
You need to find the most important words in a text. Then you need to catch their possible synonyms.
For that, the following api can help. It is doable to create something similar with some dictionaries.
synonyms("complex")
function synonyms(me){
var url = 'https://api.datamuse.com/words?ml=' + me;
fetch(url).then(v => v.json()).then((function(v){
syn = JSON.stringify(v)
syn = JSON.parse(syn)
for(var k in syn){
document.body.innerHTML += "<span>"+syn[k].word+"</span> "
}
})
)
}
From there comparing arrays will give much more accuracy, much less false positive.
A sufficient solution, in a possibly similar task:
Use of a binary bag-of-word (BOW) approach for the vector representation (frequent words aren't higher weighted than seldom words), rather than a real TF approach
The embedding "word2vec" approach, is sensitive to sequence and distances effects. It might make - depending on your hyper-parameters - a difference between 'a hunter saw a fox' and 'a fox saw a jumping hunter' ... so you have to decide, if this means adding noise to your task - or, alternatively, to use it as an averaged vector only, over all of your text
Extract high within-sentence-correlation words ( e.g., by using variables- mean-normalized- cosine-similaritities )
Second Step: Use this list of high-correlated words, as a positive list, i.e. as new vocab for an new binary vectorizer
This isolated meaningful words for the 2nd step cosine comparisons - in my case, even for rather small amounts of training texts

reducing word2vec dimension from Google News Vector Dataset

I loaded google's news vector -300 dataset. Each word is represented with a 300 point vector. I want to use this in my neural network for classification. But 300 for one word seems to be too big. How can i reduce the vector from 300 to say 100 without compromising on the quality.
tl;dr Use a dimensionality reduction technique like PCA or t-SNE.
This is not a trivial operation that you are attempting. In order to understand why, you must understand what these word vectors are.
Word embeddings are vectors that attempt to encode information about what a word means, how it can be used, and more. What makes them interesting is that they manage to store all of this information as a collection of floating point numbers, which is nice for interacting with models that process words. Rather than pass a word to a model by itself, without any indication of what it means, how to use it, etc, we can pass the model a word vector with the intention of providing extra information about how natural language works.
As I hope I have made clear, word embeddings are pretty neat. Constructing them is an area of active research, though there are a couple of ways to do it that produce interesting results. It's not incredibly important to this question to understand all of the different ways, though I suggest you check them out. Instead, what you really need to know is that each of the values in the 300 dimensional vector associated with a word were "optimized" in some sense to capture a different aspect of the meaning and use of that word. Put another way, each of the 300 values corresponds to some abstract feature of the word. Removing any combination of these values at random will yield a vector that may be lacking significant information about the word, and may no longer serve as a good representation of that word.
So, picking the top 100 values of the vector is no good. We need a more principled way to reduce the dimensionality. What you really want is to sample a subset of these values such that as much information as possible about the word is retained in the resulting vector. This is where a dimensionality reduction technique like Principle Component Analysis (PCA) or t-distributed Stochastic Neighbor Embeddings (t-SNE) come into play. I won't describe in detail how these methods work, but essentially they aim to capture the essence of a collection of information while reducing the size of the vector describing said information. As an example, PCA does this by constructing a new vector from the old one, where the entries in the new vector correspond to combinations of the main "components" of the old vector, i.e those components which account for most of the variety in the old data.
To summarize, you should run a dimensionality reduction algorithm like PCA or t-SNE on your word vectors. There are a number of python libraries that implement both (e.g scipy has a PCA algorithm). Be warned, however, that the dimensionality of these word vectors is already relatively low. To see how this is true, consider the task of naively representing a word via its one-hot encoding (a one at one spot and zeros everywhere else). If your vocabulary size is as big as the google word2vec model, then each word is suddenly associated with a vector containing hundreds of thousands of entries! As you can see, the dimensionality has already been reduced significantly to 300, and any reduction that makes the vectors significantly smaller is likely to lose a good deal of information.
#narasimman I suggest that you simply keep the top 100 numbers in the output vector of the word2vec model. The output is of type numpy.ndarray so you can do something like:
>>> word_vectors = KeyedVectors.load_word2vec_format('modelConfig/GoogleNews-vectors-negative300.bin', binary=True)
>>> type(word_vectors["hello"])
<type 'numpy.ndarray'>
>>> word_vectors["hello"][:10]
array([-0.05419922, 0.01708984, -0.00527954, 0.33203125, -0.25 ,
-0.01397705, -0.15039062, -0.265625 , 0.01647949, 0.3828125 ], dtype=float32)
>>> word_vectors["hello"][:2]
array([-0.05419922, 0.01708984], dtype=float32)
I don't think that this will screw up the result if you do it to all the words (not sure though!)

What is relation between tsne and word2vec?

As I know of, tsne is reducing dimension of word vector.
Word2vec is generate word embedding model with huge amount of data.
What is the relation between two?
Does Word2vec use tsne inside?
(I use Word2vec from Gensim)
Internally they both use gradient-descent to reach their final optimized states. And both can be considered dimensionality-reduction operations. But, word2vec does not internally use t-SNE (or vice-versa).
t-SNE ("t-distributed stochastic neighbor embedding") typically reduces many-dimensional data to 2- or 3-dimensions, for the purposes of plotting a visualization. It involves learning a mapping from the original dimensionality, to the fewer dimensions, which still keeps similar points near each other.
word2vec takes many text examples and learns a shallow neural-network that's good at predicting words from nearby words. A particular layer of that neural-network's weights, which represent individual words, then becomes the learned N-dimensional word-vectors, with the value of N often 100 to 600.
(There's an alternative way to create word-vectors called GLoVE that works a little more like t-SNE, in that it trains directly from the high-dimensional co-occurrence matrix of words, rather than from the many in-context co-occurrence examples. But it's still not t-SNE itself.)
You could potentially run t-SNE with a target dimensionality of 100-400. But since that end-result wouldn't yet yield nice plots, the maintenance of 'nearness' that's central to t-SNE won't have delivered its usual intended benefit.
You could potentially learn word2vec (or GLoVE) vectors of just 2- or 3-dimensions, but most of the useful similarities/arrangements that people seek from word-vectors would be lost in the crowding. And in a plot, you'd probably not see as strong visual 'clumping' of related-word categories, because t-SNE's specific high-to-low dimensionality nearness-preservation goal wasn't applied.

Resources