Stanford CoreNLP Train custom NER model - nlp

I was making some tests by training custom models with crf, and since i don't have a proper training file i would like to make by myself a list of 5 tags and maybe 10 words only to start with and the plan is to keep improving the model with more incoming data in the future. but the results i get are plenty of false positives (it tags many words which have nothing to do with the original one in the training file) i imagine since the models created are probabilistic and take into considerarion more than just separate words
Let's say i want to train corenlp to detect a small list of words without caring about the context are there some special settings for that? if not, is there a way to calculate how much data is needed to get an accurate model?

After some tests and research find out a really good option for my case is RegexNER which works in a deterministic way and can also be combined with NER. So far tried with smaller set of rules and does the job pretty well. Next step is to determine how scalable and usable is in a high traffic stress scenario (the one i'm interested of) and compare with other solutions based in python

Related

Can we compare word vectors from different models using transfer learning?

I want to train two word2vec/GLoVe models on different corpora and then compare the vectors of a single word. I know that it makes no sense to do so as different models start at different random states, but what if we use pre-trained word vectors as the starting point. Can we assume that the two models will continue to build upon the pre-trained vectors by incorporating the respective domain-specific knowledge, and not go into completely different states?
Tried to find some research papers which discuss this problem, but couldn't find any.
Simply starting your models with pre-trained bectors would eliminate some of the randomness, but with each training epoch on your new corpora:
there's still randomness introduced by negative-sampling (if using that default mode), by frequent-word downsampling (if using default values of the sample parameter in word2vec), and by the interplay of different threads
each epoch with your new corpora will be pulling the word-vectors for present words to new, better positions for that corpora, but leaving original words unmoved. The net movements over many epochs could move words arbitrarily far from where they started, in response to the whole-corpus-effects on all words.
So, doing so wouldn't necessarily achieve your goal in a reliable (or theoretically-defensible) way, though it might kinda-work – at least better than starting from purely random initialization – especially if your corpora are small and you do few training epochs. (That's usually a bad idea – you want big varied training data and enough passes for extra passes to make little incremental difference. But doing those things "wrong" could make your results look "better" in this scenario, where you don't want your training to change the original coordinate-space "too much". I wouldn't rely on such an approach.)
Especially if the words you need to compare are a small subset of the total vocabulary, a couple things you could consider:
combine the corpora into one training corpus, shuffled together, but for those words you need to compare, replace them with corpora-specific tokens. For example, replace 'sugar' with 'sugar_c1' and 'sugar_c2' – leaving the vast majority of surrounding words to be the same tokens (and thus learn a single vector across the whole corpus). Then, the two variant tokens for the "same word" will learn different vectors, based on their differing contexts that still share many of the same tokens.
using some "anchor set" of words that you know (or confidently conjecture) either do mean the same across both contexts, or should mean the same, train two models but learn a transformation between the two space based on those guide words. Then, when you apply that transformation to other words, that weren't used to learn the transformation, they'll land in contrasting positions in each others' spaces, maybe achieving the comparison you need. This is a technique that's been used for language-to-language translation, and there's a helper class and example notebook included with the Python gensim library.
There may be other better approaches, these are just two quick ideas that might work without much change to existing libraries. A project like 'HistWords', which used word-vector training to try to track evolving changes in word-meaning over time, might also have ideas for usable techniques.

What is an appropriate training set size for sentiment analysis?

I'm looking to use some tweets about measles/ the mmr vaccine to see how sentiment about vaccination changes over time. I plan on creating the training set from the corpus of data I currently have (unless someone has a recommendation on where I can get similar data).
I would like to classify a tweet as either: Pro-vaccine, Anti-Vaccine, or Neither (these would be factual tweets about outbreaks).
So the question is: How big is big enough? I want to avoid problems of overfitting (so I'll do a test train split) but as I include more and more tweets, the number of features needing to be learned increases dramatically.
I was thinking 1000 tweets (333 of each). Any input is appreciated here, and if you could recommend some resources, that would be great too.
More is always better. 1000 tweets on a 3-way split seems quite ambitious, I would even consider 1000 per class for a 3-way split on tweets quite low. Label as many as you can within a feasible amount of time.
Also, it might be worth taking a cascaded approach (esp. with so little data), i.e. label a set vaccine vs non-vaccine, and within the vaccine subset you'd have a pro vs anti set.
In my experience trying to model a catch-all "neutral" class, that contains everything that is not explicitly "pro" or "anti" is quite difficult because there is so much noise. Especially with simpler models such as Naive Bayes, I have found the cascaded approach to be working quite well.

News Article Categorization (Subject / Entity Analysis via NLP?); Preferably in Node.js

Objective: a node.js function that can be passed a news article (title, text, tags, etc.) and will return a category for that article ("Technology", "Fashion", "Food", etc.)
I'm not picky about exactly what categories are returned, as long as the list of possible results is finite and reasonable (10-50).
There are Web APIs that do this (eg, alchemy), but I'd prefer not to incur the extra cost (both in terms of external HTTP requests and also $$) if possible.
I've had a look at the node module "natural". I'm a bit new to NLP, but it seems like maybe I could achieve this by training a BayesClassifier on a reasonable word list. Does this seem like a good/logical approach? Can you think of anything better?
I don't know if you are still looking for an answer, but let me put my two cents for anyone who happens to come back to this question.
Having worked in NLP i would suggest you look into the following approach to solve the problem.
Don't look for a single package solution. There are great packages out there, no doubt for lots of things. But when it comes to active research areas like NLP, ML and optimization, the tools tend to be atleast 3 or 4 iterations behind whats there is academia.
Coming to the core problem. What you want to achieve is text classification.
The simplest way to achieve this would be an SVM multiclass classifier.
Simplest yes, but also with very very (see the double stress) reasonable classification accuracy, runtime performance and ease of use.
The thing which you would need to work on would be the feature set used to represent your news article/text/tag. You could use a bag of words model. add named entities as additional features. You can use article location/time as features. (though for a simple category classification this might not give you much improvement).
The bottom line is. SVM works great. they have multiple implementations. and during runtime you don't really need much ML machinery.
Feature engineering on the other hand is very task specific. But given some basic set of features and a good labelled data you can train a very decent classifier.
here are some resources for you.
http://svmlight.joachims.org/
SVM multiclass is what you would be interested in.
And here is a tutorial by SVM zen himself!
http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf
I don't know about the stability of this but from the code its a binary classifier SVM. which means if you have a known set of tags of size N you want to classify the text into, you will have to train N binary SVM classifiers. One each for the N category tags.
Hope this helps.

Document classification using LSA/SVD

I am trying to do document classification using Support Vector Machines (SVM). The documents I have are collection of emails. I have around 3000 documents to train the SVM classifier and have a test document set of around 700 for which I need classification.
I initially used binary DocumentTermMatrix as the input for SVM training. I got around 81% accuracy for the classification with the test data. DocumentTermMatrix was used after removing several stopwords.
Since I wanted to improve the accuracy of this model, I tried using LSA/SVD based dimensional reduction and use the resulting reduced factors as input to the classification model (I tried with 20, 50, 100 and 200 singular values from the original bag of ~ 3000 words). The performance of the classification worsened in each case. (Another reason for using LSA/SVD was to overcome memory issues with one of the response variable that had 65 levels).
Can someone provide some pointers on how to improve the performance of LSA/SVD classification? I realize this is general question without any specific data or code but would appreciate some inputs from the experts on where to start the debugging.
FYI, I am using R for doing the text preprocessing (packages: tm, snowball,lsa) and building classification models (package: kernelsvm)
Thank you.
Here's some general advice - nothing specific to LSA, but it might help improving the results nonetheless.
'binary documentMatrix' seems to imply your data is represented by binary values, i.e. 1 for a term existing in a document, and 0 for non-existing term; moving to other scoring scheme
(e.g. tf/idf) might lead to better results.
LSA is a good metric for dimensional reduction in some cases, but less so in others. So depending in the exact nature of your data, it might be a good idea to consider additional methods, e.g. Infogain.
If the main incentive for reducing the dimensionality is the one parameter with 65 levels, maybe treating this parameter specifically, e.g. by some form of quantization, would lead to a better tradeoff?
This might not be the best tailored answer. Hope these suggestions may help.
Maybe you could use lemmatization over stemming to reduce unacceptable outcomes.
Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude
heuristic process that chops off the ends of words in the hope of achieving this
goal correctly most of the time, and often includes the removal of derivational
affixes. Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma.
One instance:
go,goes,going ->Lemma: go,go,go ||Stemming: go, goe, go
And use some predefined set of rules; such that short term words are generalized. For instance:
I'am -> I am
should't -> should not
can't -> can not
How to deal with parentheses inside a sentence.
This is a dog(Its name is doggy)
Text inside parentheses often referred to alias names of the entities mentioned. You can either removed them or do correference analysis and treat it as a new sentence.
Try to use Local LSA, which can improve the classification process compared to Global LSA. In addition, LSA's power depends entirely on its parameters, so try to tweak parameters (start with 1, then 2 or more) and compare results to enhance the performance.

NLP software for classification of large datasets

Background
For years I've been using my own Bayesian-like methods to categorize new items from external sources based on a large and continually updated training dataset.
There are three types of categorization done for each item:
30 categories, where each item must belong to one category, and at most two categories.
10 other categories, where each item is only associated with a category if there is a strong match, and each item can belong to as many categories as match.
4 other categories, where each item must belong to only one category, and if there isn't a strong match the item is assigned to a default category.
Each item consists of English text of around 2,000 characters. In my training dataset there are about 265,000 items, which contain a rough estimate of 10,000,000 features (unique three word phrases).
My homebrew methods have been fairly successful, but definitely have room for improvement. I've read the NLTK book's chapter "Learning to Classify Text", which was great and gave me a good overview of NLP classification techniques. I'd like to be able to experiment with different methods and parameters until I get the best classification results possible for my data.
The Question
What off-the-shelf NLP tools are available that can efficiently classify such a large dataset?
Those I've tried so far:
NLTK
TIMBL
I tried to train them with a dataset that consisted of less than 1% of the available training data: 1,700 items, 375,000 features. For NLTK I used a sparse binary format, and a similarly compact format for TIMBL.
Both seemed to rely on doing everything in memory, and quickly consumed all system memory. I can get them to work with tiny datasets, but nothing large. I suspect that if I tried incrementally adding the training data the same problem would occur either then or when doing the actual classification.
I've looked at Google's Prediction API, which seem to do much of what I'm looking for but not everything. I'd also like to avoid relying on an external service if possible.
About the choice of features: in testing with my homebrew methods over the years, three word phrases produced by far the best results. Although I could reduce the number of features by using words or two word phrases, that would most likely produce inferior results and would still be a large number of features.
After this post and based on the personal experience, I would recommend Vowpal Wabbit. It is said to have one of the fastest text classification algorithms.
MALLET has a number of classifiers (NB, MaxEnt, CRF, etc). It's written Andrew McCallum's group. SVMLib is another good option, but SVM models typically require a bit more tuning than MaxEnt. Alternatively some sort of online clustering like K-means might not be bad in this case.
SVMLib and MALLET are quite fast (C and Java) once you have your model trained. Model training can take a while though! Unfortunately it's not always easy to find example code. I have some examples of how to use MALLET programmatically (along with the Stanford Parser, which is slow and probably overkill for your purposes). NLTK is a great learning tool and is simple enough that is you can prototype what you are doing there, that's ideal.
NLP is more about features and data quality than which machine learning method you use. 3-grams might be good, but how about character n-grams across those? Ie, all the character ngrams in a 3-gram to account for spelling variations/stemming/etc? Named entities might also be useful, or some sort of lexicon.
I would recommend Mahout as it is intended for handling very large scale data sets.
The ML algorithms are built over Apache Hadoop(map/reduce), so scaling is inherent.
Take a look at classification section below and see if it helps.
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Have you tried MALLET?
I can't be sure that it will handle your particular dataset but I've found it to be quite robust in previous tests of mine.
However, I my focus was on topic modeling rather than classification per se.
Also, beware that with many NLP solutions you needn't input the "features" yourself (as the N-grams, i.e. the three-words-phrases and two-word-phrases mentioned in the question) but instead rely on the various NLP functions to produce their own statistical model.

Resources