Stanford NER prop file meaning of DistSim - nlp

In one of the example .prop files coming with the Stanford NER software there are two options I do not understand:
useDistSim = true
distSimLexicon = /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters
Does anyone have a hint what DistSim stands for and where I can find any more documentation on how to use these options?
UPDATE: I just found out that DistSim means distributional similarity. I still wonder what that means in this context.

"DistSim" refers to using features based on word classes/clusters, built using distributional similarity clustering methods (e.g., Brown clustering, exchange clustering). Word classes group words which are similar, semantically and/or syntactically, and allow an NER system to generalize better, including handling words not in the training data of the NER system better. Many of our distributed models use a distributional similarity clustering features as well as word identity features, and gain significantly from doing so. In Stanford NER, there are a whole bunch of flags/properties that affect how distributional similarity is interpreted/used: useDistSim, distSimLexicon, distSimFileFormat, distSimMaxBits, casedDistSim, numberEquivalenceDistSim, unknownWordDistSimClass, and you need to look at the code in NERFeatureFactory.java to decode the details, but in the simple case, you just need the first two, and they need to be used while training the model, as well as at test time. The default format of the lexicon is just a text file with a series of lines with two tab separated columns of word clusterName. The cluster names are arbitrary.

Related

Domain-specific word similarity

Does anyone know how of an accurate tool or method that can be used to compute word embeddings or find similarity among domain-specific words? I'm working on an NLP project that involves computing cosine similarity between technical terms, such as "address" and "socket", but pre-trained models like word2vec aren't giving useful embeddings or accurate cosine similarities because they aren't specific to technical terms. Since the more general-nontechnical meanings of "address" and "socket" aren't similar to one another, these pretrained models aren't giving them sufficiently high similarity scores for the purposes of my project. Would appreciate any advice people would be able to offer. Thank you!
With sufficient data from your specific domain, you can train your own word2vec model - whose resulting word-vectors, being only influenced by your domain data, will be far more reflective of the in-domain meanings.
Similarly, if you have a mixture of data where you have hints that some word uses are for different senses of a polysemous word, you could try preprocessing your text, using those hints, replacing the ambiguous tokens (like say 'address') with a larger number of distinct tokens (like 'address*networking', 'address*delivery', etc). Even with a lot of error in such a process, its results might be sufficient for a specific purpose.
For example, maybe you'd assume all docs of a certain type – like articles from a particular publication – always mean 'address*networking' when they write 'address'. That crude replacement, on just some subset of docs sufficient to collect enough varied examples of 'address*networking' usage, might leave you with a good-enough word-vector for 'address*networking'.
(More generally, deciding which word sense of multiple candidates is meant by a particular word is called "word sense disambiguation", and it might be possible to use other preexisting code for performing that to help preprocess texts - replacing ambiguous tokens with more-speciific stand-ins – before performing word2vec training.)
Even without such assistive pre-processing, there've been a number of research attempts to extend word2vec to better model words with multiple contrasting meanings. Googling for [word2vec polysemy] or [polysemous embeddings] should turn up a bunch of examples.
But I don't know any of those techniques that have become widely-used, or that are explicitly supported by major word2vec libraries, so I can't specifically recommend or show working code for any. I don't know a standard best-practice or off-the-shelf solution – you'd have to treat adopting those ideas from research papers as an R&D project, performing a lot of your own implementation/evaluation to see if any help with your goals.

Cluster similar words using word2vec

I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.

How does TreeTagger get the lemma of a word?

I am using TreeTagger to get the lemmas of words in Spanish, but I have observed there are too much words which are not transformed as should be. I would like to know how this operations works, if it is done with techniques such as decision trees or machine learning algorithms or it simply contains a list of words with its corresponding lemma. Does someone know it?
Thanks!!
On basis of personal communication via email with H. Schmid, the author of TreeTagger, the answer to your question is:
The lemmatization function is based on the XTAG Project, which includes a morphological analyzer. Within the XTAG project several corpora have been analyzed. Considerung TreeTagger, especially the analysis of the Penn Treebank Corpus seems relevant, since this corpus is the training corpus for the English parameter file of TreeTagger. Considering lemmatization, the lemmata have simply been stored in a lexicon. TreeTagger finally uses this lexicon as a lookup table.
Hence, with TreeTagger you may only retreive the lemmata that are available in the lexicon.
In case you need additional funtionality regarding lemmatization beyond the options in TreeeTagger, you will need a morphological analyzer and, depending on your approach, a suitable training corpus, although this does not seem mandatoriy, since several analyzers perform quite well even when directly applied on the corpus of interest to be analyzed.

Semantic Similarity across multiple languages

I am using word embeddings for finding similarity between two sentences. Using word2vec, I also get a similarity measure if one sentence is in English and the other one in Dutch (though not very good).
So I started wondering if it's possible to compute the similarity between two sentences in two different languages (without an explicit translation), especially if the languages have some similarities (Englis/Dutch)?
Let's assume that your sentence-similarity scheme uses only word-vectors as an input – as in simple word-vector averaging schemes, or Word Mover's Distance.
It should be possible to do what you've suggested, provided that:
you have good sets of word-vectors for each language's words
the coordinate spaces of the word-vectors are compatible, meaning the words for the exact-same things in both languages have nearly-identical coordinates (and other words with similar meanings have close coordinates)
That second quality is not automatically assured. In fact, given the random initialization of word2vec models, and other randomization introduced by the algorithm/implementation, even subsequent training runs on the exact same data won't place words into the exact same places. So word-vectors trained on totally-separate English/Dutch corpuses won't likely place equivalent words at the same coordinates.
But, you can learn an algebraic-transformation between two spaces, based on certain anchor/reference word-pairs (that you know should have similar vectors). You can then apply that transformation to all words in one of the two sets, which results in you having vectors for those 'foreign' words within the comparable coordinate-space of the 'canonical' word-set.
In fact this very idea was used in one of the first word2vec papers:
"Exploiting Similarities among Languages for Machine Translation"
If you were to apply a similar transformation on one of your language word-vector sets, then use those transformed vectors as inputs to your sentence-vector scheme, those sentence-vectors would likely have some useful comparability to sentence-vectors in the other language, bootstrapped from word-vectors in the same coordinate-space.
Update: There's a very interesting recent paper that manages to train word-vectors in multiple languages simultaneously, using a corpus that includes both raw sentences in each single language, and a (smaller) set of aligned-sentences that are known to mean the same in both languages. Gensim doesn't yet support this mode, but there's discussion of supporting it in a future refactor.
I've recently produced a Python implementation of the technique mentioned in the paper from #gojomo's answer: transvec.
You'll need to provide word translation pairs as training data (I just threw words from my corpus into Google Translate to get as many such pairs as I can) and then you can use a wrapper model from transvec to produce comparable word embeddings for multiple languages. Here's an example:
import gensim.downloader
from transvec.transformers import TranslationWordVectorizer
# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")
# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]
bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)
# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]
Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.
For the case of documents rather than words, things are a little trickier because Doc2Vec can't use pre-trained Word2Vec models as a starting point. However, you can get an approximate document vector by simply taking the mean of all the word vectors from that document. If you provide a 2d array to TranslationWordVectorizer's transform method, it will do exactly this and provide you with an approximate document vector so you can find documents with similar meaning even if the languages are different.

NLP software for classification of large datasets

Background
For years I've been using my own Bayesian-like methods to categorize new items from external sources based on a large and continually updated training dataset.
There are three types of categorization done for each item:
30 categories, where each item must belong to one category, and at most two categories.
10 other categories, where each item is only associated with a category if there is a strong match, and each item can belong to as many categories as match.
4 other categories, where each item must belong to only one category, and if there isn't a strong match the item is assigned to a default category.
Each item consists of English text of around 2,000 characters. In my training dataset there are about 265,000 items, which contain a rough estimate of 10,000,000 features (unique three word phrases).
My homebrew methods have been fairly successful, but definitely have room for improvement. I've read the NLTK book's chapter "Learning to Classify Text", which was great and gave me a good overview of NLP classification techniques. I'd like to be able to experiment with different methods and parameters until I get the best classification results possible for my data.
The Question
What off-the-shelf NLP tools are available that can efficiently classify such a large dataset?
Those I've tried so far:
NLTK
TIMBL
I tried to train them with a dataset that consisted of less than 1% of the available training data: 1,700 items, 375,000 features. For NLTK I used a sparse binary format, and a similarly compact format for TIMBL.
Both seemed to rely on doing everything in memory, and quickly consumed all system memory. I can get them to work with tiny datasets, but nothing large. I suspect that if I tried incrementally adding the training data the same problem would occur either then or when doing the actual classification.
I've looked at Google's Prediction API, which seem to do much of what I'm looking for but not everything. I'd also like to avoid relying on an external service if possible.
About the choice of features: in testing with my homebrew methods over the years, three word phrases produced by far the best results. Although I could reduce the number of features by using words or two word phrases, that would most likely produce inferior results and would still be a large number of features.
After this post and based on the personal experience, I would recommend Vowpal Wabbit. It is said to have one of the fastest text classification algorithms.
MALLET has a number of classifiers (NB, MaxEnt, CRF, etc). It's written Andrew McCallum's group. SVMLib is another good option, but SVM models typically require a bit more tuning than MaxEnt. Alternatively some sort of online clustering like K-means might not be bad in this case.
SVMLib and MALLET are quite fast (C and Java) once you have your model trained. Model training can take a while though! Unfortunately it's not always easy to find example code. I have some examples of how to use MALLET programmatically (along with the Stanford Parser, which is slow and probably overkill for your purposes). NLTK is a great learning tool and is simple enough that is you can prototype what you are doing there, that's ideal.
NLP is more about features and data quality than which machine learning method you use. 3-grams might be good, but how about character n-grams across those? Ie, all the character ngrams in a 3-gram to account for spelling variations/stemming/etc? Named entities might also be useful, or some sort of lexicon.
I would recommend Mahout as it is intended for handling very large scale data sets.
The ML algorithms are built over Apache Hadoop(map/reduce), so scaling is inherent.
Take a look at classification section below and see if it helps.
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Have you tried MALLET?
I can't be sure that it will handle your particular dataset but I've found it to be quite robust in previous tests of mine.
However, I my focus was on topic modeling rather than classification per se.
Also, beware that with many NLP solutions you needn't input the "features" yourself (as the N-grams, i.e. the three-words-phrases and two-word-phrases mentioned in the question) but instead rely on the various NLP functions to produce their own statistical model.

Resources