model interaction between words for a sentiment analysis task - nlp

I am wondering what is the most appropriate way to model the interaction between two words/variables in a language model for a sentiment analysis task. For example, in the following dataset:
You didn't solve my problem,NEU
I never made that purchase,NEU
You never solve my problems,NEG
The words "solve" and "never", in isolation, doesn't have a negative sentiment. But, when they appear together, they do. Formally speaking: assuming we have a feature «solve» that takes the value 0 when the word «solve» is absent, and 1 when the word is present, and another feature «never» with the same logic: the difference in the probability of Y=NEG between «solve»=0 and «solve»=1 is different when «never»=0 and «never»=1.
But a basic logistic regression (using, for example, sklearn), wouldn't be able to handle this kind of situation.
With sklearn.preprocessing.PolynomialFeatures it is possible to add interaction coefficients, but it is hardly the most effective option.

Related

Decision Trees - Scikit, Python

I am trying to create a decision tree based on some training data. I have never created a decision tree before, but have completed a few linear regression models. I have 3 questions:
With linear regression I find it fairly easy to plot graphs, fit models, group factor levels, check P statistics etc. in an iterative fashion until I end up with a good predictive model. I have no idea how to evaluate a decision tree. Is there a way to get a summary of the model, (for example, .summary() function in statsmodels)? Should this be an iterative process where I decide whether a factor is significant - if so how can I tell?
I have been very unsuccessful in visualising the decision tree. On the various different ways I have tried, the code seems to run without any errors, yet nothing appears / plots. The only thing I can do successfully is tree.export_text(model), which just states feature_1, feature_2, and so on. I don't know what any of the features actually are. Has anybody come across these difficulties with visualising / have a simple solution?
The confusion matrix that I have generated is as follows:
[[ 0 395]
[ 0 3319]]
i.e. the model is predicting all rows to the same outcome. Does anyone know why this might be?
Scikit-learn is a library designed to build predictive models, so there are no tests of significance, confidence intervals, etc. You can always build your own statistics, but this is a tedious process. In scikit-learn, you can eliminate features recursively using RFE, RFECV, etc. You can find a list of feature selection algorithms here. For the most part, these algorithms get rid off the least important feature in each loop according to feature_importances (where the importance of each feature is defined as its contribution to the reduction in entropy, gini, etc.).
The most straight forward way to visualize a tree is tree.plot_tree(). In particular, you should try passing the names of the features to feature_names. Please show us what you have tried so far if you want a more specific answer.
Try another criterion, set a higher max_depth, etc. Sometimes datasets have unidentifiable records. For example, two observations with the exact same values in all features, but different target labels. Is this the case in your dataset?

Text Classification + NLP + Data-mining + Data Science: Should I do stop word removal and stemming before applying tf-idf?

I am working on a text classification problem. The problem is explained below:
I have a dataset of events which contains three columns - name of the event, description of the event, category of the event. There are about 32 categories in the dataset, such as, travel, sport, education, business etc. I have to classify each event to a category depending on its name and description.
What I understood is this particular task of classification is highly dependent on keywords, rather than, semantics. I am giving you two examples:
If the word 'football' is found either in the name or description or in both, it is highly likely that the event is about sport.
If the word 'trekking' is found either in the name or description or in both, it is highly likely that the event is about travel.
We are not considering multiple categories for an event(however, that's a plan for future !! )
I hope applying tf-idf before Multinomial Naive Bayes would lead to decent result for this problem. My question is:
Should I do stop word removal and stemming before applying tf-idf or should I apply tf-idf just on raw text? Here text means entries in name of event and description columns.
The question is too generic and you are not providing samples of the dataset, code, and not even indicating the language you are using. To this regard, I will presume that you are using English, since the two words that you are providing as an example are "football" and "trekking". The answer will however necessarily be generic.
Should I do stop word removal
Yes. Have a look at this to see the most frequent words in the English language. As you can see they have no semantic meaning, and thus would not contribute to solving the classification task that you have proposed. if stopwords is a list containing stopwords, the parameter stop_words=stopwords passed to the CountVectorizer or TfidfVectorizer constructor will automatically exclude the stopwords when invoking the .fit_transform() method.
Should I do stemming
It depends. Languages other than English, whose grammar rules allow for a big number of possible prefixes-suffixes, normally require stemming when performing classification task, in order to reach any useful result. The English language however has very poor grammar rules, and thus you can often get away without stemming/lemmatization. You should check the results obtained against the desired accuracy first, and if it is insufficient, try adding a stemming/lemmatization step in the preprocessing of your data. Stemming is a computationally expensive process for large corpora, and I personally use it only for languages that require it.
I hope applying tf-idf before Multinomial Naive Bayes would lead to decent result for this problem
Careful with this. While tf-idf in practice works with Naive Bayesian classifiers, this is not the way that specific classifier is meant to be used. From the documentation,
The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. It is in your best interest to tackle the classification task with CountVectorizer first and score it, and after you have a baseline accuracy for evaluating the TfidfVectorizer, check whether its results are better or worse than those of the CountVectorizer.
If you post some code and a sample of your dataset we can help you with that, otherwise this should be enough.

Latent Class Analysis Model Selection

When conducting Latent Class Analysis sometimes the information criterion (i.e., AIC, BIC, aBIC) don't select the same model. Such is the case in a study of substance use patterns that I am conducting among 774 men who have sex with men. Figure 1 shows the fit criterion plotted for each number of latent classes. BIC and CAIC select the three class model (See Figure 2). However, the aBIC selects a five class model (See Figure 2).
How do you select a model solution under these circumstances? Is there a way to select variables or collapse variables down in order to optimize results?
It is never easy to select the number of classes for LCA, but there are some rules of thumb that I follow:
Based on Nylund, Asparouhov & Muthén (2007) you want to follow BIC and bootstrap likelihood ratio test (BLRT). Even then, they seldom agree – BLRT will tell you to pick a model with more classes, BIC will be more conservative and suggest fewer classes. But this is as close as you can get by using statistical tests.
Rely on the available theory underlying your model. Look for potential discrepancies with your theoretical knowledge and try to deduce from the theory how many classes are to be expected. There is no golden rule, LCA is a good method, but without theory it is quite meaningless. If you have little theory, what you can do to double check your findings is to relate your latent variable to a distal outcome (covariate) about which you might have some theory and see if it works out. For example, you suspect that one of your latent classes will be dominated by one gender: associate your latent variable with gender and see.
Parsimony rule: simple models are preferred to complex ones (Collins & Lanza, 2010). If a simpler model does all the work, why choose a complex one?
In your case, I would start with a 3 class model, since it is suggested by BIC and parsimony. After finishing the analysis and interpreting the findings, I would re-run the model with 4/5 classes and see if I would reach substantially different findings - something that is worth reporting on, any important or contradicting findings to what I have found with a 3 class model. If it just adds complexity, but does not contradict or improve what I have already known, I'd stick to a 3 class model.
Looking at the results, I think that the 5 class model does not provide anything beyond the 3 classes. In the 3 class model, you have one class of extensive drug users (16%), moderate drug users dominated by cannabis, popper, hallucinogens and cocaine (40%), and finally a class of light users dominated by alcohol and cannabis (44%). The 5 class model split the first two groups into specific smaller sub-groups, but you have to decide whether these splits are important for your research - whether they make sense for your research question.
I would also recommend checking bivariate residuals. It is possible that the model misfit that is suggesting more classes is generated by a residual association between your indicators. If you can justify it theoretically (for example by finding some similarity between the indicators beyond the latent class), you can add the residual association and obtain a similarly good fit with the 3 class model.
One last point, avoid using AIC for LCA altogether - it is a very poorly performing index! Use cAIC, BIC and aBIC instead. AIC does not correct for the sample size, which can be quite problematic with larger samples.
Sources:
Collins, L. M., & Lanza, S. T. (2010). Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. New York: Wiley.

How to create Training data for Text classification on 4 categories

My machine learning goal is to search for potential risks (will cost more money) and opportunities (will save money) from a Project Requirements document.
My idea is to classify sentences from the data into one of these categories: Risk, Opportunity and Irrelevant (no risk, no opportunity, default categorie).
I will use a multinomial Bayes classifier for this with tf-dif.
Now I need to have data for my training set and test set. The way I will do this is label every sentence from requirement documents with 1 of the 3 categories. Is this a good approach?
Or should I only label sentences which are obviously a risk/opportunity/irrelevant?
Also, is the Irrelevant categorie a good idea?
I believe the three-class approach is a good one. This is similar to sentiment analysis, where you typically have positive, negative and neutral documents (or sentences). The neutral comprises the vast majority of the instances, so your classification problem will be unbalanced. That is not necessarily an issue, but for difficult problems like this one, a naive bayes classifier might simply classify everything in the neutral/irrelevant bucket since the prior for neutral will be quite high.
your sampling (labeling) should be representative of the reality. Don't try to create a dataset of 1000 risk, 1000 opportunity, 1000 irrelevant. Instead, take a sample of say 10000 requirements, and assign the proper label to each, even if it means having much more 'Irrelevant' than 'Risk' for instance.
text classification models require many instances, since the search space is vast. I wonder if you have considered the fact that to get reliable results (say over 90%), you may need to manually label thousands of instances.
and even if you have thousands of training instances, your problem looks particularly difficult, unless there are some obvious keywords to trigger 'risk' or 'opportunity' that I don't understand. Ask yourself: would this be easy for a human to judge? If you asked 3 judges to classify your requirements, would they all come up with the same answer? If not, then it might be 10s of thousands of training documents that you will need, and the classification accuracy may still be disappointing.

Document classification using LSA/SVD

I am trying to do document classification using Support Vector Machines (SVM). The documents I have are collection of emails. I have around 3000 documents to train the SVM classifier and have a test document set of around 700 for which I need classification.
I initially used binary DocumentTermMatrix as the input for SVM training. I got around 81% accuracy for the classification with the test data. DocumentTermMatrix was used after removing several stopwords.
Since I wanted to improve the accuracy of this model, I tried using LSA/SVD based dimensional reduction and use the resulting reduced factors as input to the classification model (I tried with 20, 50, 100 and 200 singular values from the original bag of ~ 3000 words). The performance of the classification worsened in each case. (Another reason for using LSA/SVD was to overcome memory issues with one of the response variable that had 65 levels).
Can someone provide some pointers on how to improve the performance of LSA/SVD classification? I realize this is general question without any specific data or code but would appreciate some inputs from the experts on where to start the debugging.
FYI, I am using R for doing the text preprocessing (packages: tm, snowball,lsa) and building classification models (package: kernelsvm)
Thank you.
Here's some general advice - nothing specific to LSA, but it might help improving the results nonetheless.
'binary documentMatrix' seems to imply your data is represented by binary values, i.e. 1 for a term existing in a document, and 0 for non-existing term; moving to other scoring scheme
(e.g. tf/idf) might lead to better results.
LSA is a good metric for dimensional reduction in some cases, but less so in others. So depending in the exact nature of your data, it might be a good idea to consider additional methods, e.g. Infogain.
If the main incentive for reducing the dimensionality is the one parameter with 65 levels, maybe treating this parameter specifically, e.g. by some form of quantization, would lead to a better tradeoff?
This might not be the best tailored answer. Hope these suggestions may help.
Maybe you could use lemmatization over stemming to reduce unacceptable outcomes.
Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude
heuristic process that chops off the ends of words in the hope of achieving this
goal correctly most of the time, and often includes the removal of derivational
affixes. Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma.
One instance:
go,goes,going ->Lemma: go,go,go ||Stemming: go, goe, go
And use some predefined set of rules; such that short term words are generalized. For instance:
I'am -> I am
should't -> should not
can't -> can not
How to deal with parentheses inside a sentence.
This is a dog(Its name is doggy)
Text inside parentheses often referred to alias names of the entities mentioned. You can either removed them or do correference analysis and treat it as a new sentence.
Try to use Local LSA, which can improve the classification process compared to Global LSA. In addition, LSA's power depends entirely on its parameters, so try to tweak parameters (start with 1, then 2 or more) and compare results to enhance the performance.

Resources