I would like to try out some variations around Latent Semantic Analysis (LSA) with scikit-learn. Besides pure frequency counts from CountVectorizer() and the weighted result of TfidfTransformer(), I'd like to test weighting by entropy (and log-entropy) (used in the original papers and reported to perform very well).
Any suggestions on how to proceed? I know Gensim has an implementation (LogEntropyModel()) but would prefer to stick with scikit-learn.
Related
I am working on a classification task — geolocation of Twitter users based on their tweets.
I did many experiments by using sklearn's SVC, NuSVC and LinearSVC and bag-of-words model. The accuracies are 35%, 60% and 80%. The difference between SVC and LinearSVC is more than double which is shocking.
I am not quite sure why this is happening exactly. It might be because of overfitting or underfitting? Why is there so much difference between the classifiers?
In general non-linear kernels are more suitable to model more complex functions than linear functions, but it depends on the data, the chosen hyper parameters (e.g. penalty and kernel) and how you evaluate your results.
LinearSVC
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
Source: sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
SVC
The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
Source: sklearn.svm.SVC.html#sklearn.svm.SVC
At first you should test a LinearSVC model, because it has just a few hyper parameters and should give you a first result. After that you can try to train a bunch of SVC models and pick the best. For that I recommend to make a gridsearch over C, kernel, degree, gamma, coef0 and tol.
Say we have used the TFIDF transform to encode documents into continuous-valued features.
How would we now use this as input to a Naive Bayes classifier?
Bernoulli naive-bayes is out, because our features aren't binary anymore.
Seems like we can't use Multinomial naive-bayes either, because the values are continuous rather than categorical.
As an alternative, would it be appropriate to use gaussian naive bayes instead? Are TFIDF vectors likely to hold up well under the gaussian-distribution assumption?
The sci-kit learn documentation for MultionomialNB suggests the following:
The multinomial Naive Bayes classifier is suitable for classification
with discrete features (e.g., word counts for text classification).
The multinomial distribution normally requires integer feature counts.
However, in practice, fractional counts such as tf-idf may also work.
Isn't it fundamentally impossible to use fractional values for MultinomialNB?
As I understand it, the likelihood function itself assumes that we are dealing with discrete-counts (since it deals with counting/factorials)
How would TFIDF values even work with this formula?
Technically, you are right. The (traditional) Multinomial N.B. model considers a document D as a vocabulary-sized feature vector x, where each element xi is the count of term i i document D. By definition, this vector x then follows a multinomial distribution, leading to the characteristic classification function of MNB.
When using TF-IDF weights instead of term counts, our feature vectors are (most likely) not following a multinomial distribution anymore, so the classification function is not theoretically well-founded anymore. However, it does turn out that tf-idf weights instead of counts work (much) better.
How would TFIDF values even work with this formula?
In the exact same way, except that the feature vector x is now a vector of tf-idf weights and not counts.
You can also check out the Sublinear tf-idf weighting scheme, implemented in sklearn tfidf-vectorizer. In my own research I found this one performing even better: it uses a logarithmic version of the term frequency. The idea is that when a query term occurs 20 times in doc. a and 1 time in doc. b, doc. a should (probably) not be considered 20 times as important but more likely log(20) times as important.
I'd like to understand if I can and if it's valid approach to train your MNB model with SGD. My application is text classification. In sklearn I've found out that there is no MNB available, and by default it's SVM, however NB is the linear model, isn't it?
So if my likelihood parameters (with Laplacian smoothing) can be estimated as
Can I update my parameters with SGD and minimize the cost function?
Please let me know if SGD is irrelevant here. Thanks in advance.
UPDATE:
So I got the answer and hope that I got it right, that MNB's parameters are updated by the word occurence in the given input text (like tf-idf). But I still don't understand clearly why we can't use SGD for MNB training. I'd understand it if it's explained in explicit description or with some mathematical interpretation. Thanks
In sklearn I've found out that there is no MNB available
Multinomial naive Bayes is implemented in scikit-learn. There is no gradient descent to use. This implementation just uses relative frequencies counts (with smoothing) to find the parameters of the model in a single pass (which the standard and most efficient way to fit an MNB model):
http://scikit-learn.org/stable/modules/naive_bayes.html
I've been working on classifying emails from two authors. I've been successful in executing the same using supervised learning along with TFIDF vectorization of text, PCA and SelectPercentile feature selection. I used scikit-learn package to achieve the same.
Now I wanted to try the same using Unsupervised Learning KMeans algorithm to cluster the emails into two groups. I have created dataset wherein I have each data point as a single line in the python list. Since I am a newbie to unsupervised so I wanted to ask if I can apply the same dimensionality reduction tools as used in supervised (TFIDF, PCA and SelectPercentile). If not then what are their counterparts? I am using scikit-learn for coding it up.
I looked around on stackoverflow but couldn't get a satisfactory answer.
I am really stuck at this point.
Please help!
Following are the techniques for dimensionality reduction that can be applied in case of Unsupervised Learning:-
PCA: principal component analysis
Exact PCA
Incremental PCA
Approximate PCA
Kernel PCA
SparsePCA and MiniBatchSparsePCA
Random projections
Gaussian random projection
Sparse random projection
Feature agglomeration
Standard Scaler
Mentioned above are some of the approaches that can be used for dimensionality reduction of huge data in case on unsupervised learning.
You can read more about the details here.
I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.
The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.