Latent Semantic Analysis: How to choose component number to perform TruncatedSVD - scikit-learn

I am practicing to use LSA to classify Enron dataset (all emails). My understanding is to successfully perform any further classification or clustering, I need to perform a lower rank approximation using TruncatedSVD to maximize the variance.
I have done all the pre-processing i could think of including 1) removing all punctuation 2) removing words less than 2 characters 3) remove documents with text size less than 1500 byte (tfidf works better with longer text) 4) remove stop words
However, if i set component to 100 per SKlearn suggests for LSA, i can only get 35% of variance (svd.explained_variance_ratio_.sum()). I tried with component = 2000, and can get 80%. ( i read somewhere saying one needs to get 90% variance as recommended?)
So my question is to perform a successful LSA, 1) how to test and pick the number of component 2) is high component number normal? 3) anything i can do to increase variance while keeping component number low?

Related

What are the negative & sample parameters?

I am new to NLP and doc2Vec. I want to understand the parameters of doc2Vec. Thank you
Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, sample = 0, seed=0)
vector_size:I believe this is to control over-fitting. A larger feature vector will learn more details so it tends to over-fit. Is there a method to determine a appropriate vector size based on the number of document or total words in all doc?
negative: how many “noise words” should be drawn. What is noise word?
sample: the threshold for configuring which higher-frequency words are randomly down sampled. So what does sample=0 mean?
As a beginner, only vector_size will be of initial interest.
Typical values are 100-1000, but larger dimensionalities require far more training data & more memory. There's no hard & fast rules – try different values, & see what works for your purposes.
Very vaguely, you'll want your count of unique vocabulary words to be much larger than the vector_size, at least the square of the vector_size: the gist of the algorithm is to force many words into a smaller-number of dimensions. (If for some reason you're running experiments on tiny amounts of data with a tiny vocabulary – for which word2vec isn't really good anyway – you'll have to shrink the vector_size very low.)
The negative value controls a detail of how the internal neural network is adjusted: how many random 'noise' words the network is tuned away from predicting for each target positive word it's tuned towards predicting. The default of 5 is good unless/until you have a repeatable way to rigorously score other values against it.
Similarly, sample controls how much (if at all) more-frquent words are sometimes randomly skipped (down-sampled). (So many redundant usage examples are overkill, wasting training time/effort that could better be spent on rarer words.) Again, you'd only want to tinker with this if you've got a way to compare the results of alternate values. Smaller values make the downsampling more aggressive (dropping more words). sample=0 would turn off such down-sampling completely, leaving all training text words used.
Though you didn't ask:
dm=0 turns off the default PV-DM mode in favor of the PV-DBOW mode. That will train doc-vectors a bit faster, and often works very well on short texts, but won't train word-vectors at all (unless you turn on an extra dbow_words=1 mode to add-back interleaved ski-gram word-vector training).
hs is an alternate mode to train the neural-network that uses multi-node encodings of words, rather than one node per (positive or negative) word. If enabled via hs=1, you should disable the negative-sampling with negative=0. But negative-sampling mode is the default for a reason, & tends to get relatively better with larger amounts of training data - so it's rare to use this mode.

Scikit learn models gives weight to random variable? Should I remove features with less importance?

I do some feature selection by removing correlated variables and backwards elimination. However, after all that is done as a test I threw in a random variable, and then trained logistic regression, random forest and XGBoost. All 3 models have the feature importance of the random feature as greater than 0. First, how can that be? Second, all models have it ranked toward the bottom, but it's not the lowest feature. Is this a valid step for another round of feature selection -i.e. remove all those who score below the random feature?
The random feature is created with
model_data['rand_feat'] = random.randint(100, size=(model_data.shape[0]))
This can happen, What random is the number you sample, but this random sampling can still generate a pattern by chance. I dont know whether you are doing classification or regression but lets consider the simple example of binary classification. We have class 1 and 0 and 1000 data point from each. When you sample a random number for each data point, it can happen that for example a majority of class 1 gets some value higher than 50, whereas majority of class 0 gets a random number smaller than 50.
So in the end effect, this might result into some pattern. So I would guess everytime you run your code the random feature importance changes. It is always ranked low because it is very unlikely that a good pattern is generated(e.g all 1s get higher than 50 and all 0s get lower than 50).
Finally, yes you should consider to drop the features with low value
I agree with berkay's answer that a random variable can have patterns that are by chance associated to your outcome variable. Secondly, I will neither include random variable in model building nor as my filtering threshold because if random variable has by chance significant or nearly significant association to the outcome it will suppress the expression of important features of original data and you probably end up losing those important features.
In early phase of model development I always include two random variables.
For me it is like a 'sanity check' since these are in effect junk variables or junk features.
If any of my features are worse in importance than the junk features then that is a warning sign that I need to look more carefully at the worth* of those features or to do some better feature engineering.
For example what does theory suggest about the inclusion of those features?

Getting less than 1 score while trying to check the cosine similarities of same document

I have used doc2vec to find the similarities in multiple documents, but when i am checking the same document which i created my model, the score should be '1' right? as the used document and the to be predict document is same. Sadly, I am getting different score when trying to find the similarities. Below is the attached code. Please tell me how to make this right, I can't find what is wrong here. Pleas help me...doc2vec - calculating document cosine similarity
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
df['Tagged_data'] = df['sent_to_word_tokenize_text'].apply(lambda x: [TaggedDocument(d, [i]) for i, d in enumerate(x)])
sadguru_model = Doc2Vec(df['Tagged_data'][0], vector_size = 1000, window = 500, dm = 1, min_count = 1, workers = 2, epochs = 100)
test_doc = word_tokenize(' '.join([word for word in df['Sentence_Tokenized_Text'][0]]))
# Sadguru model document
index0 = sadguru_model.docvecs.most_similar(positive=sadguru_model.infer_vector(test_doc)], topn =1) output: index0 = [(4014, 0.5270981788635254)]
output: index0 = [(4014, 0.5270981788635254)]
Doc2Vec doesn't discover true, unique vectors for every input document. Rather, it progressively learns useful-approximation vectors, using an internal algorithm that itself makes use of a lot of random initialization and random sampling. As a result:
if your training data includes the same document (same words) twice, with different document-ids, they won't get identical vectors
re-inferring vectors on a trained model, with the exact same words as an in-training document, won't result in identical vectors to the same original document
For more info, see the Gensim FAQ questions 11 & 12.
If your data & parameters are sufficient, then you can expect that two identical documents should have "very close" vectors, and a re-inference of the same document-words creates a vector "very close" to the same document in the original training set. (There's no precise definition of "very close", but in a working model, such same-word documents will be closer to each other than other documents in the training set.)
So you should expect 'high' similarities approaching 1.0, but essentially never 1.0 exactly, unless you've made two identical vectors on purpose with a lot of special effort.
However, you're not even seeing that 'very close' result, because it looks like your training parameters (and probably, training corpus) are way out-of-whack compared to normal or best practices. Specifically:
A vector_size=1000 is only appropriate for gigantic datasets, of millions (ideally tens-of-millions) of documents. If you're using vectors larger than your data can fill with meaningful distinctions, your models' results will appear increasingly random - especially in the case of identical or very-similar documents, because now instead of the stochastic, iterative process gradually nudging them to the same 'neighborhood' of values, they could wind up all over the place.
A window=500 is unprecedented. The default is 5, sometimes values up to 20 are used, or occasionally giant values **if and only if* the documents themselves are tiny, such that the effective window is still just "the whole document of a manageable size". On a real-sized corpus with documents over 500 words, window=500 would be amazingly expensive to calculate & likely result if far-worse vectors than a more typical value.
A min_count=1 is almost always a bad idea. Words that appear only once, or a few times, don't have the variety of subtly-varying uses that are needed for Doc2Vec (& related algorithms like Word2Vec, FastText, etc) to learn meaningful representations. Instead, single/rare uses contribute weird nonrepresentative examples, and often just function as noise preventing other words with enough examples from being better-understood. Far more people should be increasing the value over 5, as their training data grows, than reducing it.
An epochs=100 is highly uncommon, mostly used if struggling to squeeze some results from insufficient data by intensively re-training on it. (The cases where that makes the most sense would also be those where, due to small data, you decrease the vector_size to below the default of 100.) For Doc2Vec, epochs of 10-20 is most common in published results.
Try a vector_size no larger than the square root of the count of unique documents you have, leave the min_count at its default (or at least 2), leave the window at its default (unless you specifically have very-small documents), and try epochs=20 (unless you have very few documents and find improvement with slightly more).
Then you'll likely find your self-similarity test to return some high value – perhaps 0.9 or more – rather than 0.52, but still not 1.0.

Using scipy.stats.entropy on gmm.predict_proba() values

Background so I don't throw out an XY problem -- I'm trying to check the goodness of fit of a GMM because I want statistical back-up for why I'm choosing the number of clusters I've chosen to group these samples. I'm checking AIC, BIC, entropy, and root mean squared error. This question is about entropy.
I've used kmeans to cluster a bunch of samples, and I want an entropy greater than 0.9 (stats and psychology are not my expertise and this problem is both). I have 59 samples; each sample has 3 features in it. I look for the best covariance type via
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=n_components,
covariance_type=cv_type)
gmm.fit(data3)
where the n_components_range is just [2] (later I'll check 2 through 5).
Then I take the GMM with the lowest AIC or BIC, saved as best_eitherAB, (not shown) of the four. I want to see if the label assignments of the predictions are stable across time (I want to run for 1000 iterations), so I know I then need to calculate the entropy, which needs class assignment probabilities. So I predict the probabilities of the class assignment via gmm's method,
probabilities = best_eitherAB.predict_proba(data3)
all_probabilities.append(probabilities)
After all the iterations, I have an array of 1000 arrays, each contains 59 rows (sample size) by 2 columns (for the 2 classes). Each inner row of two sums to 1 to make the probability.
Now, I'm not entirely sure what to do regarding the entropy. I can just feed the whole thing into scipy.stats.entropy,
entr = scipy.stats.entropy(all_probabilities)
and it spits out numbers - as many samples as I have, I get a 2 item numpy matrix for each. I could feed just one of the 1000 tests in and just get 1 small matrix of two items; or I could feed in just a single column and get a single values back. But I don't know what this is, and the numbers are between 1 and 3.
So my questions are -- am I totally misunderstanding how I can use scipy.stats.entropy to calculate the stability of my classes? If I'm not, what's the best way to find a single number entropy that tells me how good my model selection is?

How to reduce an unknown size data into a fixed size data? Please read details

Example:
Given n number of images marked 1 to n where n is unknown, I can calculate a property of every image which is a scalar quantity. Now I have to represent this property of all images in a fixed size vector (say 5 or 10).
One naive approach can be this vector- [avg max min std_deviation]
And I also want to include the effect of relative positions of those images.
What your are looking for is called feature extraction.
There are many techniques for the same, for images:
For your purpose try:
PCA
Auto-encoders
Convolutional Auto-encoders, 1 & 2
You could also look into conventional (old) methods like SIFT, HOG, Edge Detection, but they all will need an extra step for making them to a smaller-fixed size.

Resources