Moses is a software to build machine translation models. And KenLM is the defacto language model software that moses uses.
I have a textfile with 16GB of text and i use it to build a language model as such:
bin/lmplz -o 5 <text > text.arpa
The resulting file (text.arpa) is 38GB. Then I binarized the language model as such:
bin/build_binary text.arpa text.binary
And the binarized language model (text.binary) grows to 71GB.
In moses, after training the translation model, you should tune the weights of the model by using MERT algorithm. And this can simply be done with https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/mert-moses.pl.
MERT works fine with small language model but with the big language model, it takes quite some days to finish.
I did a google search and found KenLM's filter, which promises to filter the language model to a smaller size: https://kheafield.com/code/kenlm/filter/
But i'm clueless as to how to make it work. The command help gives:
$ ~/moses/bin/filter
Usage: /home/alvas/moses/bin/filter mode [context] [phrase] [raw|arpa] [threads:m] [batch_size:m] (vocab|model):input_file output_file
copy mode just copies, but makes the format nicer for e.g. irstlm's broken
parser.
single mode treats the entire input as a single sentence.
multiple mode filters to multiple sentences in parallel. Each sentence is on
a separate line. A separate file is created for each sentence by appending
the 0-indexed line number to the output file name.
union mode produces one filtered model that is the union of models created by
multiple mode.
context means only the context (all but last word) has to pass the filter, but
the entire n-gram is output.
phrase means that the vocabulary is actually tab-delimited phrases and that the
phrases can generate the n-gram when assembled in arbitrary order and
clipped. Currently works with multiple or union mode.
The file format is set by [raw|arpa] with default arpa:
raw means space-separated tokens, optionally followed by a tab and arbitrary
text. This is useful for ngram count files.
arpa means the ARPA file format for n-gram language models.
threads:m sets m threads (default: conccurrency detected by boost)
batch_size:m sets the batch size for threading. Expect memory usage from this
of 2*threads*batch_size n-grams.
There are two inputs: vocabulary and model. Either may be given as a file
while the other is on stdin. Specify the type given as a file using
vocab: or model: before the file name.
For ARPA format, the output must be seekable. For raw format, it can be a
stream i.e. /dev/stdout
But when I tried the following, it gets stuck and does nothing:
$ ~/moses/bin/filter union lm.en.binary lm.filter.binary
Assuming that lm.en.binary is a model file
Reading lm.en.binary
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
What should one do to the Language Model after binarization? Is there any other steps to manipulate large language models to reduce the
computing load when tuning?
What is the usual way to tune on a large LM file?
How to use KenLM's filter?
(more details on https://www.mail-archive.com/moses-support#mit.edu/msg12089.html)
Answering how to use filter command of KenLM
cat small_vocabulary_one_word_per_line.txt \
| filter single \
"model:LM_large_vocab.arpa" \
output_LM_small_vocab.
Note: that single can be replace with union or copy. Read more in the help which is printing if you run the filter binary without arguments.
Related
I have two pretrained word embeddings: Glove.840b.300.txt and custom_glove.300.txt
One is pretrained by Stanford and the other is trained by me.
Both have different sets of vocabulary. To reduce oov, I'd like to add words that don't appear in file1 but do appear in file2 to file1.
How do I do that easily?
This is how I load and save the files in gensim 3.4.0.
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('path/to/thefile')
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
I don't know an easy way.
In particular, word-vectors that weren't co-trained together won't have compatible/comparable coordinate-spaces. (There's no one right place for a word – just a relatively-good place compared to the other words that are in the same model.)
So, you can't just append the missing words from another model: you'd need to transform them into compatible locations. Fortunately, it seems to work to use some set of shared anchor-words, present in both word-vector-sets, to learn a transformation – then apply that the words you want to move over.
There's a class, [TranslationMatrix][1], and demo notebook in gensim showing this process for language-translation (an application mentioned in the original word2vec papers). You could concievably use this, combined with the ability to append extra vectors to a gensim KeyedVectors instance, to create a new set of vectors with a superset of the words in either of your source models.
I'm totally new to word2vec so please bear it with me. I have a set of text files each containing a set of tweets, between 1000-3000. I have chosen a common keyword ("kw1") and I want to find semantically relevant terms for "kw1" using word2vec. For example if the keyword is "apple" I would expect to see related terms such as "ipad" "os" "mac"... based on the input file. So this set of related terms for "kw1" would be different for each input file as word2vec would be trained on individual files (eg., 5 input files, run word2vec 5 times on each file).
My goal is to find sets of related terms for each input file given the common keyword ("kw1"), which would be used for some other purposes.
My questions/doubts are:
Does it make sense to use word2vec for a task like this? is it technically right to use considering the small size of an input file?
I have downloaded the code from code.google.com: https://code.google.com/p/word2vec/ and have just given it a dry run as follows:
time ./word2vec -train $file -output vectors.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 1 -sample 1e-3 -threads 12 -binary 1 -iter 50
./distance vectors.bin
From my results I saw I'm getting many noisy terms (stopwords) when I'm using the 'distance' tool to get related terms to "kw1". So I did remove stopwords and other noisy terms such as user mentions. But I haven't seen anywhere that word2vec requires cleaned input data?
How do you choose right parameters? I see the results (from running the distance tool) varies greatly when I change parameters such as '-window', '-iter'. Which technique should I use to find the correct values for the parameters. (manual trial and error is not possible for me as I'll be scaling up the dataset).
First Question:
Yes, for almost any task that I can imagine word2vec being applied to you are going to have to clean the data - especially if you are interested in semantics (not syntax) which is the usual reason to run word2vec. Also, it is not just about removing stopwords although that is a good first step. Typically you are going to want to have a tokenizer and sentence segmenter as well, I think if you look at the document for deeplearning4java (which has a word2vec implementation) it shows using these tools. This is important since you probably don't care about the relationship between apple and the number "5", apple and "'s", etc...
For more discussion on preprocessing for word2vec see https://groups.google.com/forum/#!topic/word2vec-toolkit/TI-TQC-b53w
Second Question:
There is no automatic tuning available for word2vec AFAIK, since that implys the author of the implementation knows what you plan to do with it. Typically default values for the implementation are the "best" values for whoever implemented on a (or a set of) tasks. Sorry, word2vec isn't a turn-key solution. You will need to understand the parameters and adjust them to fix your task accordingly.
I'm working with the Stack exchange data dump and attempting to identify unique and novel words in the corpus. I'm doing this be referencing a very large wordlist and extracting the words not present in my reference word list.
The problem I am running up against is a number of the unique tokens are non-words, like directory names, error codes, and other strings.
Is there a good method of identifying differentiating word-like strings from non-word-like strings?
I'm using NLTK, but am not limited to that toolkit.
This is an interesting problem because it's so difficult to define what's makes a combination of characters a word. I would suggest to use supervised machine learning.
First, you need take the current output from your program and annotate manually each example as word and non-word.
Then, come up with some features, e.g.
number of characters
first three characters
last three characters
preceeding word
following word
...
Then, use a library like sci-kit learn to create a training model that captures these differences and can predict the likelihood of "wordness" for any sequence of characters.
Potentially a one-class classifier would be useful here. But in any case prepare some data so that you can evaluate the accuracy of this or any other approach.
Suppose I have two features which are both text based; for example, say I'm trying to predict sports games, and I've got:
1) Excerpt from sports commentary (a body of text)
2) Excerpt from Internet fan predictions (also a body of text).
If I were to use a text vectorizer (say HashingVectorizer) on feature 1), with fit_transform(), would it be bad to use it again (fit_transform()) on feature 2, or should I create a new vectorizer for that? I'm just wondering whether reusing fit_transform() on multiple features with the same vectorizer might perhaps have bad side effects.
I would say it depends on whether or not you want reproducibility of the text-to-vector conversion step. For example, if you want to use the same classifier (or whatever) you made from the first data set, you need to reuse the vectorizer. If you fit a new one on a different data set, it will build a different vocabulary, ie pull out different tokens, and make the vectors differently. That might be what you want with a very different data set (if you're going to retrain). It could be that the second data set contains new words that are critical for predictions. Those would be missed if you reused the vectorizer.
By the way, the vectorizers can be pickled if you want to save to disk. For an example, see: how to pickle customized vectorizer?.
I'm trying to perform document classification into two categories (category1 and category2), using Weka.
I've gathered a training set consisting of 600 documents belonging to both categories and the total number of documents that are going to be classified is 1,000,000.
So to perform the classification, I apply the StringToWordVector filter. I set true the followings from the filter:
- IDF transform
- TF ransform
- OutputWordCounts
I'd like to ask a few questions about this process.
1) How many documents shall I use as training set, so that I over-fitting is avoided?
2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?
3) As classification method I usually choose naiveBayes but the results I get are the followings:
-------------------------
Correctly Classified Instances 393 70.0535 %
Incorrectly Classified Instances 168 29.9465 %
Kappa statistic 0.415
Mean absolute error 0.2943
Root mean squared error 0.5117
Relative absolute error 60.9082 %
Root relative squared error 104.1148 %
----------------------------
and if I use SMO the results are:
------------------------------
Correctly Classified Instances 418 74.5098 %
Incorrectly Classified Instances 143 25.4902 %
Kappa statistic 0.4742
Mean absolute error 0.2549
Root mean squared error 0.5049
Relative absolute error 52.7508 %
Root relative squared error 102.7203 %
Total Number of Instances 561
------------------------------
So in document classification which one is "better" classifier?
Which one is better for small data sets, like the one I have?
I've read that naiveBayes performs better with big data sets but if I increase my data set, will it cause the "over-fitting" effect?
Also, about Kappa statistic, is there any accepted threshold or it doesn't matter in this case because there are only two categories?
Sorry for the long post, but I've been trying for a week to improve the classification results with no success, although I tried to get documents that fit better in each category.
1) How many documents shall I use as training set, so that I
over-fitting is avoided? \
You don't need to choose the size of training set, in WEKA, you just use the 10-fold cross-validation. Back to the question, machine learning algorithms influence much more than data set in over-fitting problem.
2) After applying the filter, I get a list of the words in the
training set. Do I have to remove any of them to get a better result
at the classifier or it doesn't play any role? \
Definitely it does. But whether the result get better can not be promised.
3) As classification method I usually choose naiveBayes but the
results I get are the followings: \
Usually, to define whether a classify algorithm is good or not, the ROC/AUC/F-measure value is always considered as the most important indicator. You can learn them in any machine learning book.
To answers your questions:
I would use (10 fold) cross-validation to evaluate your method. The model is trained trained 10 times on 90% of the data and tested on 10% of the data using different parts of the data each time. The results are therefor less biased towards your current (random) selection of train and test set.
Removing stop words (i.e., frequently occurring words with little discriminating value like the, he or and) is a common strategy to improve your classifier. Weka's StringToWordVector allows you to select a file containing these stop words, but it should also have a default list with English stop words.
Given your results, SMO performs the best of the two classifiers (e.g., it has more Correctly Classified Instances). You might also want to take a look at (Lib)SVM or LibLinear (You may need to install them if they are not in Weka natively; Weka 3.7.6 has a package manager allowing for easy installation), which can perform quite well on document classification as well.
Regarding the second question
2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?
I was building a classifier and training it with the famous 20news group dataset, when testing it without the preprocessing the results were not good. So, i pre-processed the data according to the following steps:
Substitute TAB, NEWLINE and RETURN characters by SPACE.
Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
Turn all letters to lowercase.
Substitute multiple SPACES by a single SPACE.
The title/subject of each document is simply added in the beginning of the document's text.
no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining
words. Information about stemming can be found here.
These steps are taken from http://web.ist.utl.pt/~acardoso/datasets/