Mallet: Tokenization by N-grams (1,2) - topic-modeling

I was wondering whether it would be possible to tokenize words in Mallet by n-gram size between 1 and 2?
This is the code that I have used so far:
bin\mallet import-dir --input sample-data\web\en --output sample.txt --keep-sequence-bigrams --remove-stopwords
bin\mallet train-topics --input sample.txt --num-topics 20 --optimize-interval 10 --output-doc-topics sample_composition.txt --output-topic-keys sample_keys.txt
Thank you in advance.

The topic model trainer doesn't use the bigrams feature, it would make the code much more complicated. Two ways to add bigrams would be to modify the input data file before importing it, such that
the cat sat
would become
the cat sat the_cat cat_sat
You can also create a post-hoc report that identifies pairs of words that frequently occur together and get assigned to the same topic with --xml-topic-phrase-report FILENAME.

Related

Get most similar words using GloVe

I am new to GloVe. I successfully ran their demo.sh as given in their website. After running demo I got several files created such as vocab, vectors etc. But they haven't any documentation or anything that describes what files we need to use and how to use to find most similar words.
Hence, please help me to find the most similar words given a word in GloVe (using cosine similarity)? (e.g., like most.similar in Gensim word2vec)
Please help me!
It doesn't really matter how word vectors are generated, you can always calculate cosine similarity between the words. The easiest way to achieve what you asked for is (considering you have gensim):
python -m gensim.scripts.glove2word2vec –input <GloVe vector file> –output <Word2vec vector file>
This will convert glove vector file to w2v format. You can do it manually too - just add extra line to your GloVe file containing total number of vectors and their dimensionality at the top of your file. It looks something a kin of:
180000 300
<The rest of your file>
After that you can just load the file into gensim and everything is working as if it is a regular w2v model.

applying regression on bag of words

I have a text document and did clean the text. Now I have a list of words that I want to apply regression on, but I don't know how to do it. Can anyone please help?
And can I use other Machine learning algorithms on the list of words??
Please provide details on what kind of prediction are you doing?
In general case(using scikit-learn):
Step-1 : Use Snowball Stemmer to stem words
Step-2 : Using this parsed Data create features and labels training and test sets.
Step-3 : Convert text vectorization to lists of numbers using tfidfvectorizer
Step-4 : As it will be a huge set of features, we need to select top 10 (or whatever you want) Percentile using selectpercentile to remove less weighted features.
Now you can use your feature set for whatever purpose you want!
Hope this helps :)
PS: You will need to do some research on nltk and vectorizer for appropriate parameters and tuning

applying word2vec on small text files

I'm totally new to word2vec so please bear it with me. I have a set of text files each containing a set of tweets, between 1000-3000. I have chosen a common keyword ("kw1") and I want to find semantically relevant terms for "kw1" using word2vec. For example if the keyword is "apple" I would expect to see related terms such as "ipad" "os" "mac"... based on the input file. So this set of related terms for "kw1" would be different for each input file as word2vec would be trained on individual files (eg., 5 input files, run word2vec 5 times on each file).
My goal is to find sets of related terms for each input file given the common keyword ("kw1"), which would be used for some other purposes.
My questions/doubts are:
Does it make sense to use word2vec for a task like this? is it technically right to use considering the small size of an input file?
I have downloaded the code from code.google.com: https://code.google.com/p/word2vec/ and have just given it a dry run as follows:
time ./word2vec -train $file -output vectors.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 1 -sample 1e-3 -threads 12 -binary 1 -iter 50
./distance vectors.bin
From my results I saw I'm getting many noisy terms (stopwords) when I'm using the 'distance' tool to get related terms to "kw1". So I did remove stopwords and other noisy terms such as user mentions. But I haven't seen anywhere that word2vec requires cleaned input data?
How do you choose right parameters? I see the results (from running the distance tool) varies greatly when I change parameters such as '-window', '-iter'. Which technique should I use to find the correct values for the parameters. (manual trial and error is not possible for me as I'll be scaling up the dataset).
First Question:
Yes, for almost any task that I can imagine word2vec being applied to you are going to have to clean the data - especially if you are interested in semantics (not syntax) which is the usual reason to run word2vec. Also, it is not just about removing stopwords although that is a good first step. Typically you are going to want to have a tokenizer and sentence segmenter as well, I think if you look at the document for deeplearning4java (which has a word2vec implementation) it shows using these tools. This is important since you probably don't care about the relationship between apple and the number "5", apple and "'s", etc...
For more discussion on preprocessing for word2vec see https://groups.google.com/forum/#!topic/word2vec-toolkit/TI-TQC-b53w
Second Question:
There is no automatic tuning available for word2vec AFAIK, since that implys the author of the implementation knows what you plan to do with it. Typically default values for the implementation are the "best" values for whoever implemented on a (or a set of) tasks. Sorry, word2vec isn't a turn-key solution. You will need to understand the parameters and adjust them to fix your task accordingly.

How to tune a Machine Translation model with huge language model?

Moses is a software to build machine translation models. And KenLM is the defacto language model software that moses uses.
I have a textfile with 16GB of text and i use it to build a language model as such:
bin/lmplz -o 5 <text > text.arpa
The resulting file (text.arpa) is 38GB. Then I binarized the language model as such:
bin/build_binary text.arpa text.binary
And the binarized language model (text.binary) grows to 71GB.
In moses, after training the translation model, you should tune the weights of the model by using MERT algorithm. And this can simply be done with https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/mert-moses.pl.
MERT works fine with small language model but with the big language model, it takes quite some days to finish.
I did a google search and found KenLM's filter, which promises to filter the language model to a smaller size: https://kheafield.com/code/kenlm/filter/
But i'm clueless as to how to make it work. The command help gives:
$ ~/moses/bin/filter
Usage: /home/alvas/moses/bin/filter mode [context] [phrase] [raw|arpa] [threads:m] [batch_size:m] (vocab|model):input_file output_file
copy mode just copies, but makes the format nicer for e.g. irstlm's broken
parser.
single mode treats the entire input as a single sentence.
multiple mode filters to multiple sentences in parallel. Each sentence is on
a separate line. A separate file is created for each sentence by appending
the 0-indexed line number to the output file name.
union mode produces one filtered model that is the union of models created by
multiple mode.
context means only the context (all but last word) has to pass the filter, but
the entire n-gram is output.
phrase means that the vocabulary is actually tab-delimited phrases and that the
phrases can generate the n-gram when assembled in arbitrary order and
clipped. Currently works with multiple or union mode.
The file format is set by [raw|arpa] with default arpa:
raw means space-separated tokens, optionally followed by a tab and arbitrary
text. This is useful for ngram count files.
arpa means the ARPA file format for n-gram language models.
threads:m sets m threads (default: conccurrency detected by boost)
batch_size:m sets the batch size for threading. Expect memory usage from this
of 2*threads*batch_size n-grams.
There are two inputs: vocabulary and model. Either may be given as a file
while the other is on stdin. Specify the type given as a file using
vocab: or model: before the file name.
For ARPA format, the output must be seekable. For raw format, it can be a
stream i.e. /dev/stdout
But when I tried the following, it gets stuck and does nothing:
$ ~/moses/bin/filter union lm.en.binary lm.filter.binary
Assuming that lm.en.binary is a model file
Reading lm.en.binary
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
What should one do to the Language Model after binarization? Is there any other steps to manipulate large language models to reduce the
computing load when tuning?
What is the usual way to tune on a large LM file?
How to use KenLM's filter?
(more details on https://www.mail-archive.com/moses-support#mit.edu/msg12089.html)
Answering how to use filter command of KenLM
cat small_vocabulary_one_word_per_line.txt \
| filter single \
"model:LM_large_vocab.arpa" \
output_LM_small_vocab.
Note: that single can be replace with union or copy. Read more in the help which is printing if you run the filter binary without arguments.

about lda inference

Right now, I'm using LDA topic modelling tool from the MALLET package to do some topic detection on my documents. Everything's fine initially, I got 20 topics from it. However, when I try to infer new document using the model, the result is kinda baffling.
For instance I deliberately run my model over a document that I manually created which contains nothing but keywords from one of the topics "FLU", but the topic distributions I got was <0.1 for every topic. I then try the same thing on one of the already sampled document which has a high score of 0.7 for one of the topics. Again the same thing happened.
Can someone give some clue on the reason?
Tried asking on MALLET mailing list but apparently no one has replied.
I also know very little about MALLET, but the docs mention this...
Topic Inference
--inferencer-filename [FILENAME] Create a topic inference tool based on
the current, trained model. Use the
MALLET command bin/mallet infer-topics
--help to get information on using topic inference.
Note that you must make sure that the
new data is compatible with your
training data. Use the option
--use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet
import-file or import-dir to specify a
training file.
Maybe you forgot to do this? It does sound to me like the data you are training on is not in the same format as the data you are testing on.
I had the same difficulty of Mallet.
Later I found the problem is that the documents must be read in through the Pipe that was once used to read in the training documents.
Here is the sample to read in training documents:
ImportExample importerTrain = new ImportExample();//this is an example class in MALLET to import docs.
InstanceList training= importer.readDirectory(new File(trainingDir));
training.save(new File(outputFile));
While reading in docs in topic inference:
InstanceList training = InstanceList.load(new File(outputFile));
Pipe pipe = training.getPipe();
ImportExample importer = new ImportExample();
importer.pipe = pipe; //use the same pipe
InstanceList testing = importer.readDirectory(new File(testDir));
I got my clue from one question posted in their archive:http://thread.gmane.org/gmane.comp.ai.mallet.devel/829
Disclosure: I'm familiar with the techniques and the math generally used for topic inference, but I have minimal exposure to MALLET.
I hope these semi-educated guesses lead you to a solution. No warranty ;-)
I'm assuming you are using the mallet command hlda for training the model.
A few things that may have gone wrong:
Ensure you used the --keep-sequence option during the import phase of the process. By default mallet saves the inputs as plain Bags of Words, loosing the order in which the words are originally found. This may be ok for basic classification tasks but not for topic modeling.
Remember that the Gibbs sampling used by mallet is a stochastic process; expect variations in particular with small samples. During tests you may want to specify the same random seed for each iteration to ensu
What is the size of your training data? 20 topics seems a lot for initial tests which are typically based on small, manually crafted and/or quickly assembled training and testing sets.
remember that topic inference is based on sequences of words, not isolated keywords (your description of the manually crafted test document mentions "keywords" rather than say "expressions" or "phrases")
Here's how I infer topic distributions for new documents using MALLET. I thought I would post since I have been looking how to do this and there are a lot of answers, but none of them are comprehensive. This includes the training steps as well so you get an idea of how the different files connect to each other.
Create your training data:
$BIN_DIR/mallet import-file --input $DIRECTORY/data.input --output $DIRECTORY/data.mallet --keep-sequence --token-regex '\w+'
where data.input is a document containing your file ID, label, and a sequence of tokens or token IDs. Then train your model on this data with the parameters you like. For example:
$BIN_DIR/mallet train-topics --input $DIRECTORY/data.mallet \
--num-topics $TOPICS --output-state $DIRECTORY/topic-state.gz \
--output-doc-topics $DIRECTORY/doc-topics.gz \
--output-topic-keys $DIRECTORY/topic-words.gz --num-top-words 500 \
--num-iterations 1000
Later, you can create an inferencer using your trained model and training data:
bin/mallet train-topics --input $DIRECTORY/data.mallet --num-topics NUMBER --input-state $DIRECTORY/topic-state.gz --no-inference --inferencer-filename $DIRECTORY/inferencer-model
Now, create file for new documents using pipe from training data:
bin/mallet import-file --input $DIRECTORY/new_data.input --output $DIRECTORY/new_data.mallet --use-pipe-from $DIRECTORY/data.mallet --keep-sequence --token-regex '\w+'
Infer topics on new documents:
bin/mallet infer-topics --inferencer $DIRECTORY/inferencer-model --input $DIRECTORY/new_data.mallet --output-doc-topics $DIRECTORY/new_data_doc_topics --num-iterations 1000

Resources