NLP software for classification of large datasets - nlp

Background
For years I've been using my own Bayesian-like methods to categorize new items from external sources based on a large and continually updated training dataset.
There are three types of categorization done for each item:
30 categories, where each item must belong to one category, and at most two categories.
10 other categories, where each item is only associated with a category if there is a strong match, and each item can belong to as many categories as match.
4 other categories, where each item must belong to only one category, and if there isn't a strong match the item is assigned to a default category.
Each item consists of English text of around 2,000 characters. In my training dataset there are about 265,000 items, which contain a rough estimate of 10,000,000 features (unique three word phrases).
My homebrew methods have been fairly successful, but definitely have room for improvement. I've read the NLTK book's chapter "Learning to Classify Text", which was great and gave me a good overview of NLP classification techniques. I'd like to be able to experiment with different methods and parameters until I get the best classification results possible for my data.
The Question
What off-the-shelf NLP tools are available that can efficiently classify such a large dataset?
Those I've tried so far:
NLTK
TIMBL
I tried to train them with a dataset that consisted of less than 1% of the available training data: 1,700 items, 375,000 features. For NLTK I used a sparse binary format, and a similarly compact format for TIMBL.
Both seemed to rely on doing everything in memory, and quickly consumed all system memory. I can get them to work with tiny datasets, but nothing large. I suspect that if I tried incrementally adding the training data the same problem would occur either then or when doing the actual classification.
I've looked at Google's Prediction API, which seem to do much of what I'm looking for but not everything. I'd also like to avoid relying on an external service if possible.
About the choice of features: in testing with my homebrew methods over the years, three word phrases produced by far the best results. Although I could reduce the number of features by using words or two word phrases, that would most likely produce inferior results and would still be a large number of features.

After this post and based on the personal experience, I would recommend Vowpal Wabbit. It is said to have one of the fastest text classification algorithms.

MALLET has a number of classifiers (NB, MaxEnt, CRF, etc). It's written Andrew McCallum's group. SVMLib is another good option, but SVM models typically require a bit more tuning than MaxEnt. Alternatively some sort of online clustering like K-means might not be bad in this case.
SVMLib and MALLET are quite fast (C and Java) once you have your model trained. Model training can take a while though! Unfortunately it's not always easy to find example code. I have some examples of how to use MALLET programmatically (along with the Stanford Parser, which is slow and probably overkill for your purposes). NLTK is a great learning tool and is simple enough that is you can prototype what you are doing there, that's ideal.
NLP is more about features and data quality than which machine learning method you use. 3-grams might be good, but how about character n-grams across those? Ie, all the character ngrams in a 3-gram to account for spelling variations/stemming/etc? Named entities might also be useful, or some sort of lexicon.

I would recommend Mahout as it is intended for handling very large scale data sets.
The ML algorithms are built over Apache Hadoop(map/reduce), so scaling is inherent.
Take a look at classification section below and see if it helps.
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Have you tried MALLET?
I can't be sure that it will handle your particular dataset but I've found it to be quite robust in previous tests of mine.
However, I my focus was on topic modeling rather than classification per se.
Also, beware that with many NLP solutions you needn't input the "features" yourself (as the N-grams, i.e. the three-words-phrases and two-word-phrases mentioned in the question) but instead rely on the various NLP functions to produce their own statistical model.

Related

Which additional features to use apart from Doc2Vec embeddings for Document Similarity?

So I am doing a project on document similarity and right now my features are only the embeddings from Doc2Vec. Since that is not showing any good results, after hyperparameter optimization and word embedding before the doc embedding... What other features can I add, so as to get better results?
My dataset is 150 documents, 500-700 words each, with 10 topics(labels), each document having one topic. Documents are labeled on a document level, and that labeling is currently used only for evaluation purposes.
Edit: The following is answer to gojomo's questions and elaborating on my comment on his answer:
The evaluation of the model is done on the training set. I am comparing if the label is the same as the most similar document from the model. For this I am first getting the document vector using the model's method 'infer_vector' and then 'most_similar' to get the most similar document. The current results I am getting are 40-50% of accuracy. A satisfactory score would be of at least 65% and upwards.
Due to the purpose of this research and it's further use case, I am unable to get a larger dataset, that is why I was recommended by a professor, as this is a university project, to add some additional features to the document embeddings of Doc2Vec. As I had no idea what he ment, I am asking the community of stackoverflow.
The end goal of the model is to do clusterization of the documents, again the labels for now being used only for evaluation purposes.
If I don't get good results with this model, I will try out the simpler ones mentioned by #Adnan S #gojomo such as TF-IDF, Word Mover's Distance, Bag of words, just presumed I would get better results using Doc2Vec.
You should try creating TD-IDF with 2 and 3 grams to generate a vector representation for each document. You will have to train the vocabulary on all the 150 documents. Once you have the TF-IDF vector for each document, you can use cosine similarity between any two of them.
Here is a blog article with more details and doc page for sklearn.
How are you evaluating the results as not good, and how will you know when your results are adequate/good?
Note that just 150 docs of 400-700 words each is a tiny, tiny dataset: typical datasets used published Doc2Vec results include tens-of-thousands to millions of documents, of hundreds to thousands of words each.
It will be hard for any of the Word2Vec/Doc2Vec/etc-style algorithms to do much with so little data. (The gensim Doc2Vec implementation includes a similar toy dataset, of 300 docs of 200-300 words each, as part of its unit-testing framework, and to eke out even vaguely useful results, it must up the number of training epochs, and shrink the vector size, significantly.)
So if intending to use Doc2Vec-like algorithms, your top priority should be finding more training data. Even if, in the end, only ~150 docs are significant, collecting more documents that use similar domain language can help improve the model.
It's unclear what you mean when you say there are 10 topics, and 1 topic per document. Are those human-assigned categories, and are those included as part of the training texts or tags passed to the Doc2Vec algorithm? (It might be reasonable to include it, depending on what your end-goals and document-similarity evaluations consist of.)
Are these topics the same as the labelling you also mention, and are you ultimately trying to predict the topics, or just using the topics as a check of the similarity-results?
As #adnan-s suggests in the other answer, it may also be worth trying more-simple count-based 'bag of words' document representations, including potentially on word n-grams or even character n-grams, or TF-IDF weighted.
If you have adequate word-vectors, as trained from your data or from other compatible sources, the "Word Mover's Distance" measure can be another interesting way to compute pairwise similarities. (However, it may be too expensive to calculate between many-hundred-word texts - working much faster on shorter texts.)
As others have already suggested your training set of 150 documents probably isn't big enough to create good representations. You could, however, try to use a pre-trained model and infer the vectors of your documents.
Here is a link where you can download a (1.4GB) DBOW model trained on English Wikipedia pages, working with 300-dimensional document vectors. I obtained the link from jhlau/doc2vec GitHub repository. After downloading the model you can use it as follows:
from gensim.models import Doc2Vec
# load the downloaded model
model_path = "enwiki_dbow/doc2vec.bin"
model = Doc2Vec.load(model_path)
# infer vector for your document
doc_vector = model.infer_vector(doc_words)
Where doc_words is a list of words in your document.
This, however, may not work for you in case your documents are very specific. But you can still give it a try.

Multiclass text classification with python and nltk

I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.

Natural Language Generation - how to go beyond templates

We've build a system that analyzes some data and outputs some results in plain English (i.e. no charts etc.). The current implementation relies on lots of templates and some randomization in order to give as much diversity to the text as possible.
We'd like to switch to something more advanced with the hope that the produced text is less repetitive and sounds less robotic. I've searched a lot on google but I cannot find something concrete to start from. Any ideas?
EDIT: The data fed to the NLG mechanism are in JSON format. Here is an example about web analytics data. The json file may contain for example a metric (e.g. visits), it's value in the last X days, whether the last value is expected or not and which dimensions (e.g. countries or marketing channels) affected its change.
The current implementation could give something like this:
Overall visits in the UK mainly from ABC email campaign reached 10K (+20% DoD) and were above the expected value by 10%. Users were mainly landing on XXX page while the increase was consistent across devices.
We're looking to finding a way to depend less on templates, sound even more natural and increase the vocabulary.
What you are looking for is a hot research area and a pretty tough task. Currently there is no way to generate 100% meaningful diverse and natural sentences. one approach to generate sentences is using n-grams. using these method you can generate sentences that look more natural and diverse that may look good but probably meaningless and grammatically incorrect.
A more up to date approach is using Deep learning. anyway if you want to generate meaningful sentences, maybe your best way is using your current template based method.
You can find an introduction to basics of n-gram based NLG here:
Generating Random Text with Bigrams
this tool sounds to implement some of the most famous techniques for natural language generation: simplenlg
Have you tried Neural Networks especially LSTM and GRU architectures? These models are the most recent developments in predicting sequences of words. Generating natural language means to generate a sequence of words such that it makes sense with respect to the input and earlier words in the sequence. This is equivalent to predicting time series. LSTM is designed for predicting time series. Hence, it is commonly used to predict a sequence of words, given an input sequence, an input word, or any other input that can be embedded in a vector.
Deep learning libraries such as Tensorflow, Keras, and Torch all have sequence to sequence implementations that can be used for generating natural language by predicting a sequence of words given an input.
Note that usually these models need a huge amount of training data.
You need to meet two criteria in order to benefit from such models:
You should be able to represent your input as a vector.
You need a relatively large amount of input/target pairs.

Topic modelling, but with known topics?

Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as to how I can shoehorn my problem into an LDA framework, as there are very good tools available to solve LDA problems.
For the sake of being thorough, I have the following pieces of information as input:
A set of documents (segments of DNA from one organism, where each segment is a document)
A document can only have one topic in this scenario
A set of topics (segments of DNA from other organisms)
Words in this case are triplets of bases (for now)
The question I want to answer is: For the current document, what is its topic? In other words, for the given DNA segment, which other organism (same species) did it most likely come from? There could have been mutations and such since the exchange of segments occurred, so the two segments won't be identical.
The main difference between this and the classical LDA model is that I know the topics ahead of time.
My initial idea was to take a pLSA model (http://en.wikipedia.org/wiki/PLSA) and just set the topic nodes explicitly, then perform standard EM learning (if only there were a decent library that could handle Bayesian parameter learning with latent variables...), followed by inference using whatever algorithm (which shouldn't matter, because the model is a polytree anyway).
Edit: I think I've solved it, for anyone who might stumble across this. I figured out that you can use labelled LDA and just assign every label to every document. Since each label has a one-to-one correspondence with a topic, you're effectively saying to the algorithm: for each document, choose the topic from this given set of topics (the label set), instead of making up your own.
I have a similar problem, and just thought I'd add the solutions I'm going with for completeness's sake.
I also have a set of documents (pdf documents anywhere from 1 to 200
pages), though mine are regular English text data.
A set of known topics (mine include subtopics, but I won't address that here). Unlike the previous example, I may desire multiple topic labels.
Words (standard English, though named entities and acronyms are included in my corpus)
LDAesk approach: Guided LDA
Guided LDA lets you seed words for your LDA categories. If you have n-topics for your final decisions you just create your guidedLDA algorithm with n-seed topics, each of which contain the keywords that makeup their topic name. Eg: I want to cluster into known topics "biochemistry" and "physics". Then I seed my guidedLDA with d = {0: ['biochemsitry'], 1: ['physics']}. You can incorporate other guiding words if you can identify them, however the guidedLDA algorithm I'm using (python version) makes it relatively easy to identify the top n-words for a given topic. You can run guidedLDA once with only basic seed words then use the top n-words output to consider for more words to add to topics. These top n-words also are potentially helpful for the other approach I'm mentioning.
Non-LDAesk approach: ~KNN
What I've ended up doing is using a word embedding model (word2vec has been superior to alternatives for my case) to create a "topic vector" for every topic based on the words that make up the topic/subtopic. Eg: I have a category Biochemistry with a subcategory Molecular Biology. The most basic topic vector is just the word2vec vectors for Biochemistry, Molecular, and Biology all averaged together.
For every document I want to determine a topic for, I turn it into a "document vector" (same dimension & embedding model as how I made my topic vectors - I've found just averaging all the word2vec vectors in the doc has been the best solution for my so far, after a bit of preprocessing like removing stopwords). Then I just find the k-closest topic vectors to the input document vector.
I should note that there's some ability to hand tune this by changing the words that makeup the topic vectors. One way to potentially identify further keywords is to use the guidedLDA model I mentioned earlier.
I would note that when I was testing these two solutions on a different corpus with labeled data (which I didn't use aside from evaluating accuracy and such) this ~KNN approach proved better than the GuidedLDA approach.
Why not simply use a supervised topic model. Jonathan Chang's lda package in R has an slda function that is quite nice. There is also a very helpful demo. Just install the package and run demo(slda).

News Article Categorization (Subject / Entity Analysis via NLP?); Preferably in Node.js

Objective: a node.js function that can be passed a news article (title, text, tags, etc.) and will return a category for that article ("Technology", "Fashion", "Food", etc.)
I'm not picky about exactly what categories are returned, as long as the list of possible results is finite and reasonable (10-50).
There are Web APIs that do this (eg, alchemy), but I'd prefer not to incur the extra cost (both in terms of external HTTP requests and also $$) if possible.
I've had a look at the node module "natural". I'm a bit new to NLP, but it seems like maybe I could achieve this by training a BayesClassifier on a reasonable word list. Does this seem like a good/logical approach? Can you think of anything better?
I don't know if you are still looking for an answer, but let me put my two cents for anyone who happens to come back to this question.
Having worked in NLP i would suggest you look into the following approach to solve the problem.
Don't look for a single package solution. There are great packages out there, no doubt for lots of things. But when it comes to active research areas like NLP, ML and optimization, the tools tend to be atleast 3 or 4 iterations behind whats there is academia.
Coming to the core problem. What you want to achieve is text classification.
The simplest way to achieve this would be an SVM multiclass classifier.
Simplest yes, but also with very very (see the double stress) reasonable classification accuracy, runtime performance and ease of use.
The thing which you would need to work on would be the feature set used to represent your news article/text/tag. You could use a bag of words model. add named entities as additional features. You can use article location/time as features. (though for a simple category classification this might not give you much improvement).
The bottom line is. SVM works great. they have multiple implementations. and during runtime you don't really need much ML machinery.
Feature engineering on the other hand is very task specific. But given some basic set of features and a good labelled data you can train a very decent classifier.
here are some resources for you.
http://svmlight.joachims.org/
SVM multiclass is what you would be interested in.
And here is a tutorial by SVM zen himself!
http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf
I don't know about the stability of this but from the code its a binary classifier SVM. which means if you have a known set of tags of size N you want to classify the text into, you will have to train N binary SVM classifiers. One each for the N category tags.
Hope this helps.

Resources