I am trying to build a classifier to detect subjectivity. I have text files tagged with subjective and objective . I am little lost with the concept of features creation from this data. I have found the lexicon of the subjective and objective tag. One thing that I can do is to create a feature of having words present in respective dictionary. Maybe the count of words present in subjective and objective dictionary. After that I intend to use naive bayes or SVM to develop the model
My problem is as follow
Is my approach correct ?
Can I create more features ? If possible suggest some or point me to some paper or link
Can I do some test like chi -sq etc to identify effective words from the dictionary ?
You are basically on the right track. I would try and apply classifier with features you already have and see how well it will work, before doing anything else.
Actually best way to improve your work is to google for subjectivity classification papers and read them (there are a quite a number of them). For example this one lists typical features for this task.
And yes Chi-squared can be used to construct dictionaries for text classification (other commonly used methods are TD*IDF, pointwise mutal information and LDA)
Also, recently new neural network-based methods for text classification such as paragraph vector and dynamic convolutional neural networks with k-max pooling demonstrated state-of-the-art results on sentiment analysis, thus they should probably be good for subjectivity classification as well.
Related
To improve the recomender system for Buyer Material Groups, our company is willing to train a model using customer historial spend data. The model should be trained on historical "Short text descriptions" to predict the appropriate BMG. The dataset has more that 500.000 rows and the text descriptions are multilingual (up to 40 characters).
1.Question: can i use supervised learning if i consider the fact that the descriptions are in multiple languages? If Yes, are classic approaches like multinomial naive bayes or SVM suitable?
2.Question: if i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?
if you have other ideas or approaches please feel free :). (It is a matter of a simple text classification problem)
Can I use supervised learning if i consider the fact that the descriptions are in multiple languages?
Yes, this is not a problem except it makes your data more sparse. If you actually only have 40 characters (is that not 40 words?) per item, you may not have enough data. Also the main challenge for supervised learning will be whether you have labels for the data.
If Yes, are classic approaches like multinomial naive bayes or SVM suitable?
They will work as well as they always have, though these days building a vector representation is probably a better choice.
If i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?
Assuming the numerical labels are labels on the original data, you can add them as tokens like LABEL001 and the model can learn representations of them if you want to make an unsupervised recommender.
Honestly these days I wouldn't start with Naive Bayes or classical models, I'd go straight to word vectors as a first test for clustering. Using fasttext or word2vec is pretty straightforward. The main problem is that if you really only have 40 characters per item, that just might not be enough data to cluster usefully.
I am building an NLP pipeline and I am trying to get my head around in regards to the optimal structure. My understanding at the moment is the following:
Step1 - Text Pre-processing [a. Lowercasing, b. Stopwords removal, c. stemming, d. lemmatisation,]
Step 2 - Feature extraction
Step 3 - Classification - using the different types of classifier(linearSvC etc)
From what I read online there are several approaches in regard to feature extraction but there isn't a solid example/answer.
a. Is there a solid strategy for feature extraction ?
I read online that you can do [a. Vectorising usin ScikitLearn b. TF-IDF]
but also I read that you can use Part of Speech or word2Vec or other embedding and Name entity recognition.
b. What is the optimal process/structure of using these?
c. On the text pre-processing I am ding the processing on a text column on a df and the last modified version of it is what I use as an input in my classifier. If you do feature extraction do you do that in the same column or you create a new one and you only send to the classifier the features from that column?
Thanks so much in advance
The preprocessing pipeline depends mainly upon your problem which you are trying to solve. The use of TF-IDF, word embeddings etc. have their own restrictions and advantages.
You need to understand the problem and also the data associated with it. In order to make the best use of the data, we need to implement the proper pipeline.
Specifically for text related problems, you will find word embeddings to be very useful. TF-IDF is useful when the problem needs to be solved emphasising the words with lesser frequency. Word embeddings, on the other hand, convert the text to a N-dimensional vector which may show up similarity with some other vector. This could bring a sense of association in your data and the model can learn the best features possible.
In simple cases, we can use a bag of words representation to tokenize the texts.
So, you need to discover the best approach for your problem. If you are solving a problems which closely resembles the famous NLP problems like IMDB review classification, sentiment analysis on Twitter data, then you can find a number of approaches on the internet.
I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.
Autoencoders can be used to reduce dimensionallity in feature vectors - as far as I understand. In text classification a feature vector is normally constructed via a dictionary - which tends to be extremely large. I have no experience in using autoencoders, so my questions are:
Could autoencoders be used to reduce dimensionallity in text classification? (Why? / Why not?)
Has anyone already done this? A source would be nice, if so.
The existing works use auto encoder for creating models in the sentence level. Basically after training the model using Autoencode, you can get a vector for a sentence. Since any document consists of sentences you can get a set of vectors for the document, and do the document classification. In my experience with various vector representation (e.g. those generated from autoencodes) doing so might give answers worse than classification with bag of words.
I want to classify text messages into several categories like, "relation building", "coordination", "information sharing", "knowledge sharing" & "conflict resolution". I am using NLTK library to process these data. I would like to know which classifier, in nltk, is better for this particular multi-class classification problem.
I am planning to use Naive Bayes Classification, is it advisable?
Naive Bayes is the simplest and easy to understand classifier and for that reason it's nice to use. Decision Trees with a beam search to find the best classification are not significantly harder to understand and are usually a bit better. MaxEnt and SVM tend be more complex, and SVM requires some tuning to get right.
Most important is the choice of features + the amount/quality of data you provide!
With your problem, I would focus first on ensuring you have a good training/testing dataset and also choose good features. Since you are asking this question you haven't had much experience with machine learning for NLP, so I'd say start of easy with Naive Bayes as it doesn't use complex features- you can just tokenize and count word occurrences.
EDIT:
The question How do you find the subject of a sentence? and my answer are also worth looking at.
Yes, Training a Naive Bayes Classifier for each category and then labeling each message to a class based on which Classifier provides the highest score is a standard first approach to problems like this. There are more sophisticated single class classifier algorithms which you could substitute in for Naive Bayes if you find performance inadequate, such as a Support Vector Machine ( Which I believe is available in NLTK via a Weka plug in, but not positive). Unless you can think of anything specific in this problem domain that would make Naieve Bayes especially unsuitable, its ofen the go-to "first try" for a lot of projects.
The other NLTK classifier I would consider trying would be MaxEnt as I believe it natively handles multiclass classification. (Though the multiple binary classifer approach is very standard and common as well). In any case the most important thing is to collect a very large corpus of properly tagged text messages.
If by "Text Messages" you are referring to actual cell phone text messages these tend to be very short and the language is very informal and varied, I think feature selection may end up being a larger factor in determining accuracy than classifier choice for you. For example, using a Stemmer or Lemmatizer that understands common abbreviations and idioms used, tagging part of speech or chunking , entity extraction, extracting probably relationships between terms may provide more bang than using more complex classifiers.
This paper talks about classifying Facebook status messages based on sentiment, which has some of the same issues, and may provide some insights into this. The links is to a google cache because I'm having problems w/ the original site:
http://docs.google.com/viewer?a=v&q=cache:_AeBYp6i1ooJ:nlp.stanford.edu/courses/cs224n/2010/reports/ssoriajr-kanej.pdf+maxent+classifier+multiple+classes&hl=en&gl=us&pid=bl&srcid=ADGEESi-eZHTZCQPo7AlcnaFdUws9nSN1P6X0BVmHjtlpKYGQnj7dtyHmXLSONa9Q9ziAQjliJnR8yD1Z-0WIpOjcmYbWO2zcB6z4RzkIhYI_Dfzx2WqU4jy2Le4wrEQv0yZp_QZyHQN&sig=AHIEtbQN4J_XciVhVI60oyrPb4164u681w&pli=1