about lda inference - nlp

Right now, I'm using LDA topic modelling tool from the MALLET package to do some topic detection on my documents. Everything's fine initially, I got 20 topics from it. However, when I try to infer new document using the model, the result is kinda baffling.
For instance I deliberately run my model over a document that I manually created which contains nothing but keywords from one of the topics "FLU", but the topic distributions I got was <0.1 for every topic. I then try the same thing on one of the already sampled document which has a high score of 0.7 for one of the topics. Again the same thing happened.
Can someone give some clue on the reason?
Tried asking on MALLET mailing list but apparently no one has replied.

I also know very little about MALLET, but the docs mention this...
Topic Inference
--inferencer-filename [FILENAME] Create a topic inference tool based on
the current, trained model. Use the
MALLET command bin/mallet infer-topics
--help to get information on using topic inference.
Note that you must make sure that the
new data is compatible with your
training data. Use the option
--use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet
import-file or import-dir to specify a
training file.
Maybe you forgot to do this? It does sound to me like the data you are training on is not in the same format as the data you are testing on.

I had the same difficulty of Mallet.
Later I found the problem is that the documents must be read in through the Pipe that was once used to read in the training documents.
Here is the sample to read in training documents:
ImportExample importerTrain = new ImportExample();//this is an example class in MALLET to import docs.
InstanceList training= importer.readDirectory(new File(trainingDir));
training.save(new File(outputFile));
While reading in docs in topic inference:
InstanceList training = InstanceList.load(new File(outputFile));
Pipe pipe = training.getPipe();
ImportExample importer = new ImportExample();
importer.pipe = pipe; //use the same pipe
InstanceList testing = importer.readDirectory(new File(testDir));
I got my clue from one question posted in their archive:http://thread.gmane.org/gmane.comp.ai.mallet.devel/829

Disclosure: I'm familiar with the techniques and the math generally used for topic inference, but I have minimal exposure to MALLET.
I hope these semi-educated guesses lead you to a solution. No warranty ;-)
I'm assuming you are using the mallet command hlda for training the model.
A few things that may have gone wrong:
Ensure you used the --keep-sequence option during the import phase of the process. By default mallet saves the inputs as plain Bags of Words, loosing the order in which the words are originally found. This may be ok for basic classification tasks but not for topic modeling.
Remember that the Gibbs sampling used by mallet is a stochastic process; expect variations in particular with small samples. During tests you may want to specify the same random seed for each iteration to ensu
What is the size of your training data? 20 topics seems a lot for initial tests which are typically based on small, manually crafted and/or quickly assembled training and testing sets.
remember that topic inference is based on sequences of words, not isolated keywords (your description of the manually crafted test document mentions "keywords" rather than say "expressions" or "phrases")

Here's how I infer topic distributions for new documents using MALLET. I thought I would post since I have been looking how to do this and there are a lot of answers, but none of them are comprehensive. This includes the training steps as well so you get an idea of how the different files connect to each other.
Create your training data:
$BIN_DIR/mallet import-file --input $DIRECTORY/data.input --output $DIRECTORY/data.mallet --keep-sequence --token-regex '\w+'
where data.input is a document containing your file ID, label, and a sequence of tokens or token IDs. Then train your model on this data with the parameters you like. For example:
$BIN_DIR/mallet train-topics --input $DIRECTORY/data.mallet \
--num-topics $TOPICS --output-state $DIRECTORY/topic-state.gz \
--output-doc-topics $DIRECTORY/doc-topics.gz \
--output-topic-keys $DIRECTORY/topic-words.gz --num-top-words 500 \
--num-iterations 1000
Later, you can create an inferencer using your trained model and training data:
bin/mallet train-topics --input $DIRECTORY/data.mallet --num-topics NUMBER --input-state $DIRECTORY/topic-state.gz --no-inference --inferencer-filename $DIRECTORY/inferencer-model
Now, create file for new documents using pipe from training data:
bin/mallet import-file --input $DIRECTORY/new_data.input --output $DIRECTORY/new_data.mallet --use-pipe-from $DIRECTORY/data.mallet --keep-sequence --token-regex '\w+'
Infer topics on new documents:
bin/mallet infer-topics --inferencer $DIRECTORY/inferencer-model --input $DIRECTORY/new_data.mallet --output-doc-topics $DIRECTORY/new_data_doc_topics --num-iterations 1000

Related

Fine-tune a davinci model to be similar to InstructGPT

I have a few-shot GPT-3 text-davinci-003 prompt that produces "pretty good" results, but I quickly run out of tokens per request for interesting use cases. I have a data set (n~20) which I'd like to train the model with more but there is no way to fine-tune these InstructGPT models, only base GPT models.
As I understand it I can either:
A: Find a way to harvest 10x more data (I don't see an easy option here)
or B: Find a way to fine-tune Davinci into something capable of simpler InstructGPT behaviours
(Please let me know if there's a third option. I've attempted to increase epochs from 4 to 10 but the quality is really nowhere near as good).
Is there any way to fine-tune Davinci up to the point where it can model some of the things Instruct does? I don't need full capabilities, but if I can make it narrowed down to my use case it would be ideal.
--
By the way there is a common misconception that fine-tuning a GPT-3 model on a base (davinci, ada, babbage, etc...) will train it on the latest, eg: text-davinci-003. This is not how GPT works and is explained by GPT blog posts and support posts:
https://help.openai.com/en/articles/6819989-can-i-fine-tune-on-text-davinci-003
Please don't claim openai api fine_tunes.create -t "model_prepared.jsonl" -m "davinci" will create a model based on text-davinci-003, it is not true, it uses base davinci.

load Doc2Vec model and get new sentence's vectors for test

I have read lots of examples regarding doc2vec, but I couldn't find any answer. Like a real example, I want to build a model with doc2vec and then train it with some ML models. after that, how can I get the vector of a raw string with the exact trained Doc2vec model? because I need to predict with my ML model with the same size and logical vector
There are a collection of example Jupyter (aka IPython) notebooks in the gensim docs/notebooks directory. You can view them online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
But they'll be in your gensim installation directory, if you can find that for your current working environment.
Those that include doc2vec in their name demonstrate the use of the Doc2Vec class. The most basic intro operates on the 'Lee' corpus that's bundled with gensim for use in its unit tests. (It's really too small for real Doc2Vec success, but by forcing smaller models and many training iterations the notebook just barely manages to get some consistent results.) See:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
It includes a section on inferring a vector for a new text:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Note that inference is performed on a list of string tokens, not a raw string. And those tokens should have been preprocessed/tokenized the same way as the original training data for the model, so that the vocabularies are compatible. (Any unknown words in a new text are silently ignored.)
Note also that especially on short texts, it often helps to provide a much-larger-than-default value of the optional steps parameter to infer_vector() - say 50 or 200 rather than the default 5. It may also help to provide a starting alpha parameter more like the training default of 0.025 than the method-default of 0.1.

applying word2vec on small text files

I'm totally new to word2vec so please bear it with me. I have a set of text files each containing a set of tweets, between 1000-3000. I have chosen a common keyword ("kw1") and I want to find semantically relevant terms for "kw1" using word2vec. For example if the keyword is "apple" I would expect to see related terms such as "ipad" "os" "mac"... based on the input file. So this set of related terms for "kw1" would be different for each input file as word2vec would be trained on individual files (eg., 5 input files, run word2vec 5 times on each file).
My goal is to find sets of related terms for each input file given the common keyword ("kw1"), which would be used for some other purposes.
My questions/doubts are:
Does it make sense to use word2vec for a task like this? is it technically right to use considering the small size of an input file?
I have downloaded the code from code.google.com: https://code.google.com/p/word2vec/ and have just given it a dry run as follows:
time ./word2vec -train $file -output vectors.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 1 -sample 1e-3 -threads 12 -binary 1 -iter 50
./distance vectors.bin
From my results I saw I'm getting many noisy terms (stopwords) when I'm using the 'distance' tool to get related terms to "kw1". So I did remove stopwords and other noisy terms such as user mentions. But I haven't seen anywhere that word2vec requires cleaned input data?
How do you choose right parameters? I see the results (from running the distance tool) varies greatly when I change parameters such as '-window', '-iter'. Which technique should I use to find the correct values for the parameters. (manual trial and error is not possible for me as I'll be scaling up the dataset).
First Question:
Yes, for almost any task that I can imagine word2vec being applied to you are going to have to clean the data - especially if you are interested in semantics (not syntax) which is the usual reason to run word2vec. Also, it is not just about removing stopwords although that is a good first step. Typically you are going to want to have a tokenizer and sentence segmenter as well, I think if you look at the document for deeplearning4java (which has a word2vec implementation) it shows using these tools. This is important since you probably don't care about the relationship between apple and the number "5", apple and "'s", etc...
For more discussion on preprocessing for word2vec see https://groups.google.com/forum/#!topic/word2vec-toolkit/TI-TQC-b53w
Second Question:
There is no automatic tuning available for word2vec AFAIK, since that implys the author of the implementation knows what you plan to do with it. Typically default values for the implementation are the "best" values for whoever implemented on a (or a set of) tasks. Sorry, word2vec isn't a turn-key solution. You will need to understand the parameters and adjust them to fix your task accordingly.

topic modeling on mallet

I'm currently doing the topic modeling things (beginner)
I was thinking using mallet for some tool to get me understand this area, but, my problem is, I'd like to train a model based on, let's say, 1000 documents, to construct a model and using the model on a new single document to generate its potential topics.
But, as far as I read about mallet tutorial, it always says like this tool or API is useful on a corpus of texts, which means, it's used to find topics within several documents.
Is there a way that it can find topic on single document based on the model (or inference parameter it learned / constructed from the 1000 documents?)
Is there any other tool that can do this?
Thanks a lot!
You can refer the example code src/cc/mallet/examples/TopicModel.java which describes how to clustering and infer the new instance.
Actually when you run the simple LDA on a directory the model assigns topic proportions to each of the documents of that directory based on "an already" trained model from a part of your corpus. So, topic proportions are assigned with a certain probability to each of the documents (already ranked by the probability of appearance of that topic to that specific document).

Topic modelling, but with known topics?

Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as to how I can shoehorn my problem into an LDA framework, as there are very good tools available to solve LDA problems.
For the sake of being thorough, I have the following pieces of information as input:
A set of documents (segments of DNA from one organism, where each segment is a document)
A document can only have one topic in this scenario
A set of topics (segments of DNA from other organisms)
Words in this case are triplets of bases (for now)
The question I want to answer is: For the current document, what is its topic? In other words, for the given DNA segment, which other organism (same species) did it most likely come from? There could have been mutations and such since the exchange of segments occurred, so the two segments won't be identical.
The main difference between this and the classical LDA model is that I know the topics ahead of time.
My initial idea was to take a pLSA model (http://en.wikipedia.org/wiki/PLSA) and just set the topic nodes explicitly, then perform standard EM learning (if only there were a decent library that could handle Bayesian parameter learning with latent variables...), followed by inference using whatever algorithm (which shouldn't matter, because the model is a polytree anyway).
Edit: I think I've solved it, for anyone who might stumble across this. I figured out that you can use labelled LDA and just assign every label to every document. Since each label has a one-to-one correspondence with a topic, you're effectively saying to the algorithm: for each document, choose the topic from this given set of topics (the label set), instead of making up your own.
I have a similar problem, and just thought I'd add the solutions I'm going with for completeness's sake.
I also have a set of documents (pdf documents anywhere from 1 to 200
pages), though mine are regular English text data.
A set of known topics (mine include subtopics, but I won't address that here). Unlike the previous example, I may desire multiple topic labels.
Words (standard English, though named entities and acronyms are included in my corpus)
LDAesk approach: Guided LDA
Guided LDA lets you seed words for your LDA categories. If you have n-topics for your final decisions you just create your guidedLDA algorithm with n-seed topics, each of which contain the keywords that makeup their topic name. Eg: I want to cluster into known topics "biochemistry" and "physics". Then I seed my guidedLDA with d = {0: ['biochemsitry'], 1: ['physics']}. You can incorporate other guiding words if you can identify them, however the guidedLDA algorithm I'm using (python version) makes it relatively easy to identify the top n-words for a given topic. You can run guidedLDA once with only basic seed words then use the top n-words output to consider for more words to add to topics. These top n-words also are potentially helpful for the other approach I'm mentioning.
Non-LDAesk approach: ~KNN
What I've ended up doing is using a word embedding model (word2vec has been superior to alternatives for my case) to create a "topic vector" for every topic based on the words that make up the topic/subtopic. Eg: I have a category Biochemistry with a subcategory Molecular Biology. The most basic topic vector is just the word2vec vectors for Biochemistry, Molecular, and Biology all averaged together.
For every document I want to determine a topic for, I turn it into a "document vector" (same dimension & embedding model as how I made my topic vectors - I've found just averaging all the word2vec vectors in the doc has been the best solution for my so far, after a bit of preprocessing like removing stopwords). Then I just find the k-closest topic vectors to the input document vector.
I should note that there's some ability to hand tune this by changing the words that makeup the topic vectors. One way to potentially identify further keywords is to use the guidedLDA model I mentioned earlier.
I would note that when I was testing these two solutions on a different corpus with labeled data (which I didn't use aside from evaluating accuracy and such) this ~KNN approach proved better than the GuidedLDA approach.
Why not simply use a supervised topic model. Jonathan Chang's lda package in R has an slda function that is quite nice. There is also a very helpful demo. Just install the package and run demo(slda).

Resources