I am trying to do a timeline detection problem using text classification. As a newbie I am confused as to how I can go about with this. Is this a classification problem? i.e, Can I use the years(timelines) as outcomes and solve this as a classification problem?
You should be able to solve this as a classification problem as you suggest. An option could be to find or build a corpus consisting of texts tagged with the period in which they're set, and train a classification algorithm on this data set.
Another option could be to train a word space model on such a data set, and generate vectors for different periods of time (e.g. the 50s, 60s etc.). You could then create a document vector for the text you wish to classify, and find which of these time vectors yields the best match.
Might not work, but it could be interesting to see what results you get.
Hope this helps!
Related
i am working with bert for relation extraction from binary classification tsv file, it is the first time to use bert so there is some points i need to understand more?
how can i get an output like giving it a test data and show the classification results whether it is classified correctly or not?
how bert extract features of the sentences, and is there a method to know what are the features that is chosen?
i used once the hidden layers and another time i didn't use i got the accuracy of not using the hidden layer higher than using it, is there an reason for that?
i been trying to learn a bit of machine learning for a project that I'm working in. At the moment I managed to classify text using SVM with sklearn and spacy having some good results, but i want to not only classify the text with svm, I also want it to be classified based on a list of keywords that I have. For example: If the sentence has the word fast or seconds I would like it to be classified as performance.
I'm really new to machine learning and I would really appreciate any advice.
I assume that you are already taking a portion of your data, classifying it manually and then using the result as your training data for the SVM algorithm.
If yes, then you could just append your list of keywords (features) and desired classifications (labels) to your training data. If you are not doing it already, I'd recommend using the SnowballStemmer on your training data features.
I am building an NLP pipeline and I am trying to get my head around in regards to the optimal structure. My understanding at the moment is the following:
Step1 - Text Pre-processing [a. Lowercasing, b. Stopwords removal, c. stemming, d. lemmatisation,]
Step 2 - Feature extraction
Step 3 - Classification - using the different types of classifier(linearSvC etc)
From what I read online there are several approaches in regard to feature extraction but there isn't a solid example/answer.
a. Is there a solid strategy for feature extraction ?
I read online that you can do [a. Vectorising usin ScikitLearn b. TF-IDF]
but also I read that you can use Part of Speech or word2Vec or other embedding and Name entity recognition.
b. What is the optimal process/structure of using these?
c. On the text pre-processing I am ding the processing on a text column on a df and the last modified version of it is what I use as an input in my classifier. If you do feature extraction do you do that in the same column or you create a new one and you only send to the classifier the features from that column?
Thanks so much in advance
The preprocessing pipeline depends mainly upon your problem which you are trying to solve. The use of TF-IDF, word embeddings etc. have their own restrictions and advantages.
You need to understand the problem and also the data associated with it. In order to make the best use of the data, we need to implement the proper pipeline.
Specifically for text related problems, you will find word embeddings to be very useful. TF-IDF is useful when the problem needs to be solved emphasising the words with lesser frequency. Word embeddings, on the other hand, convert the text to a N-dimensional vector which may show up similarity with some other vector. This could bring a sense of association in your data and the model can learn the best features possible.
In simple cases, we can use a bag of words representation to tokenize the texts.
So, you need to discover the best approach for your problem. If you are solving a problems which closely resembles the famous NLP problems like IMDB review classification, sentiment analysis on Twitter data, then you can find a number of approaches on the internet.
I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.
Autoencoders can be used to reduce dimensionallity in feature vectors - as far as I understand. In text classification a feature vector is normally constructed via a dictionary - which tends to be extremely large. I have no experience in using autoencoders, so my questions are:
Could autoencoders be used to reduce dimensionallity in text classification? (Why? / Why not?)
Has anyone already done this? A source would be nice, if so.
The existing works use auto encoder for creating models in the sentence level. Basically after training the model using Autoencode, you can get a vector for a sentence. Since any document consists of sentences you can get a set of vectors for the document, and do the document classification. In my experience with various vector representation (e.g. those generated from autoencodes) doing so might give answers worse than classification with bag of words.