There's an example in "Introduction to Information Retrieval" book on how to identify terms in the corpus with high mutual information scores with six categories of documents:
Is there some implementation in scikt-learn that would let me calculate such scores?
mutual_info_classif seems to calculate only mutual information between features and the target variables, according to this example, which is great for feature selection, but not for identifying relevant terms for each category.
Related
I have developed a pipeline to extract text from documents, preprocess the text, and train a gensim Doc2vec model on given documents. Given a document in my corpus, I would like to recommend other documents in the corpus.
I want to know how I can evaluate my model without having a pre-defined list of "good" recommendations. Any ideas?
One simple self-check that can be used to catch some big problems with a Doc2Vec model training pipeline – like gross misparameterizations, or insufficient data/epochs – is to re-infer vectors for the training texts (using .infer_vector()), and check that generally:
the bulk-trained vector for the same text is "close to" the re-inferred vector - such as its nearest-neighbor, or one of the top neighbors, in a .most_similar() operation on the re-inferred text
the overall list of nearest-neighbors (from .most_similar()) for the bulk-trained vector, & the re-inferred vector, are very similar.
They won't necessarily be identical, for reasons explained in Q11 & Q12 of the Gensim Project FAQ, but if they're wildly-different, then something foundational has gone wrong, like:
insufficient (in quantity or quality/form) training data
misparameterizations, like too few epochs or too-large (overfitting-prone) vectors for the quantity of data
Ultimately, though, the variety of data sources & intended uses & possible dimensions of "recommendation-worthy" mean that you need cusomt inputs, based on your project's needs, usually from the intended audience (or your own ability to simulate/represent it).
In the original paper introducing the "Paragraph Vector" algorithm (what's inside the Doc2Vec class), and a followup evaluating it on Wikipedia & arXiv articles, several of the evaluations used triplets of documents, where 2 of the triplet were conjectured to be "necessarily similar" based on some preexisting system's groupings, and the 3rd randomly-chosen.
The algorithm's performance, and relative performance under different parameter choices, was scored based on how often it placed the 2 presumptively-related documents closer-together than the 3rd randomly-chosen document.
For example, one of the original paper's evaluations use brief search-engine-result snippets as documents, and considered any 2 documents that appeared as sibling top-10 results for the same query as presumptively-related. Two of the followup paper's evaluation used the human-curated categories of Wikipedia or arXiv as signalling that articles of the same category should be presumptively-related.
It's imperfect, but allowed the creation of large evaluation sets from already-existing systems/data, which generally pointed results in the same direction as human senses-of-relatedness.
Perhaps you can find a similar preexisting guide for your data. Or, as you perform ad-hoc checking, be sure to capture every judgement you make, so that it becomes, over time, a growing dataset of desirable pairings that are either (a) better than some other result that was co-presented; or (b) just "presumably good enough" that they usually should rank higher than other random 3rd documents. A large amount of imprecision in such desirability-data is tolerable, as it can even out as the set of probe-pairings grows, and the power of being able to automate bulk quantitative evaluations (reusing old assessments against new parameters/models) drives far more overall improvement than any small glitches in the evaluations cost.
I looked at multiple tutorials on how to derive n-grams (here I will stick to bigrams) and included them in the analysis in NLP.
My quesiton is that whether we need to include all the possible combinations of bigrams as features because not all the bigrams would be meaningful.
For example, if we have a sentence such as "I like this movie because it was fun and scary" and consider bigrams as well, these include (after pre-processing):
bigrams=["like movie","movie fun", "fun scary"]
I am not sure this might be a good approach but what I can think of now is to include some frequent bigrams only as features.
or is there other practical norms to efficiently include meaningful bigrams only (although meaningful might be subjective and context-dependent)?
We may consider each bigram as a feature of different importance. Then the question can be reformulated as "How to choose the most important features?". As you have already mentioned, one way is to consider the top max features ordered by term frequency across the corpus. Other possible ways to choose the most important features are:
Apply the TF-IDF weighting scheme. You will also be able to control two additional hyperparameters: max document frequency and min document frequency;
Use Principle Component Analysis to select the most informative features from a big feature set.
Train any estimator in scikit-learn and then select the features from the trained model.
These are the most widespread methods of feature selection in the NLP field. It is still possible use other methods like recursive feature elimination or sequential feature selection, but these methods are not feasible if the number of informative features is low (like 1000) and the total number of features is high (like 10000).
Azure Machine Learning has an item called Train Matchbox Recommender. It can be configured with a Number of traits. Unfortunately, the documentation does not describe what such a trait is.
What are traits? Is this related to latent variables?
This page may have better descriptions on it.
Basically, traits are the features the algorithm will learn about each user related to each item. For example, in the restaurant ratings recommender traits could include a user's birth year, if they're a student or working professional, martial status, etc.
Hope that helps!
In my opinion yes, this is related to latent variables and defines the dimensions of the compressed trait matrix. The columns of this matrix - the latent trait vectors - can be interpreted as stereotypes of users. Hence, the number of traits corresponds with the number of stereotype users in the recommendation model.
You can find more explanation in the corresponding research paper "Matchbox: Large Scale Bayesian Recommendations":
Users and items are represented by feature vectors which are mapped into a low-dimensional 'trait-space' in which similarity is measured in terms of inner products (source).
In classical Collaborative Filtering this latent representation is often computed using Singular Value Decomposition (SVD) through least-squares method and selecting only the first/largest k dimensions of this matrix. This reduces the dimensionality of the user-item-rating matrix. In this Hacker Noon article by Steeve Huang you can find a more detailed explanation.
Update 1: added interpretation of latent vectors
Akaike Information Criterion (AIC) and Predictive sum of squares error (PSSE) are information theoretic and predictive capability assessment measures to rank the models.
I was trying to rank the models based on these two criterions. However, the rankings are completely contradicting, meaning, 6 models which are ranked best based on AIC is being ranked least based on PSSE.
In this kind of situations how do I decide which measure is the best. I tried to look for some article or research papers but unfortunately I could not find much. Any information would be appreciated.
Thank you!
I have a dataset which the instances are of about 200 features, about 11 of these features are numerical (integer) and the rest are binary (1/0) , these features may be correlated and they are of different probability distributions ,
It's been a while that I've been for a good similarity score which works for a mixed vector and takes into account the correlation between the features,
Do you know such similarity score?
Thanks,
Arian
In your case, the similarity function relies heavily on the input data patterns. You might benefit from learning a distance metric for the input space of data from a given collection
of pair of similar/dissimilar points that preserves the distance relation among the
training data.
Here is a nice survey paper.
The numerous types of distance measures, Euclidean, Manhattan, etc are going provide different levels of accuracy depending on the dataset. Best to read papers covering your method of data fitting and see what heuristics they use. Not to mention that some methods require only homogeneous data that scale accordingly. Here is a paper that talks about a whole host of measures that you might find attractive.
And as always, test and cross validate to see if there really is an impact from the mixing of feature types.