How to get document-topics using models.hdpmodel – Hierarchical Dirichlet Process in gensim

How to get document-topics using models.hdpmodel – Hierarchical Dirichlet Process in gensim - document

I just study gensim for topic modeling. when I use
lda_model = gensim.models.ldamodel.LdaModel(...)
the result lda_model has two functions: get_topics() and get_document_topics(). I can find the topic-word and document-topics by them. But, I want to try:
hdp_lda_model = gensim.models.hdpmodel.HdpModel(...)
I can only find there is get_topics() in its result, no something like get_document_topics(). So I cannot find the relation of document and topics. But it should be somewhere. I read some instruction from https://radimrehurek.com/gensim/models/hdpmodel.html. But I did not find any (maybe I miss something?). So is there a function in hdp model, which is like get_document_topics() in lda model?

Both models have a __getitem__ method that does what you want.
For LDA it's actually a wrapper of get_document_topics
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L1503
And for HDP it's wrapping the inference method but doing additionally more than just calling it:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/hdpmodel.py#L427
So, to answer your question. You can do for both models:
lda_model[bow_doc]
or
hdp_lda_model[bow_doc]
and then get a topic distribution for bow_doc
Results in something like:
[(5, 0.05342164806543596),
(7, 0.04307238446604077),
(11, 0.5281130394662548),
(31, 0.28899472194287035),
(60, 0.07985460856925444)]

Related

Simple MultiGPU during inference with huggingface

I have two GPU.
How can I use them for inference with a huggingface pipeline?
Huggingface documentation seems to say that we can easily use the DataParallel class with a huggingface model, but I've not seen any example.
For example with pytorch, it's very easy to just do the following :
net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
output = net(input_var) # input_var can be on any device, including CPU
Is there an equivalent with huggingface ?

I found it's not possible with the pipelines, so:
two ways :
Do it with the Trainer object in huggingface , which also supports inferences, but it's not optimal.
Use Queues from the multiprocessing standard library, but this creates a lot of boiler plate code

LDA Topic Model Performance - Topic Coherence Implementation for scikit-learn

I have a question around measuring/calculating topic coherence for LDA models built in scikit-learn.
Topic Coherence is a useful metric for measuring the human interpretability of a given LDA topic model. Gensim's CoherenceModel allows Topic Coherence to be calculated for a given LDA model (several variants are included).
I am interested in leveraging scikit-learn's LDA rather than gensim's LDA for ease of use and documentation (note: I would like to avoid using the gensim to scikit-learn wrapper i.e. actually leverage sklearn’s LDA). From my research, there is seemingly no scikit-learn equivalent to Gensim’s CoherenceModel.
Is there a way to either:
1 - Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline, either through manually converting the scikit-learn model into gensim format or through a scikit-learn to gensim wrapper (I have seen the wrapper the other way around) to generate Topic Coherence?
Or
2 - Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?
I have done quite a bit of research on implementations for this use case online but haven’t seen any solutions. The only leads I have are the documented equations from scientific literature.
If anyone has any knowledge on any similar implementations, or if you could point me in the right direction for creating a manual method for this, that would be great. Thank you!
*Side note: I understand that perplexity and log-likelihood are available in scikit-learn for performance measurements, but these are not as predictive from what I have read.

Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline
As far as I know, there is no "easy way" to do this. You would have to manually reformat the sklearn data structures to be compatible with gensim. I haven't attempted this myself, but this strikes me as an unnecessary step that might take a long time. There is an old Python 2.7 attempt at a gensim-sklearn-wrapper which you might want to look at, but it seems deprecated - maybe you can get some information/inspiration from that.
Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?
The summing-up of vectors you need can be easily achieved with a loop. You can find code samples for a "manual" coherence calculation for NMF. Calculation depends on the specific measure, of course, but sklearn should return you the data you need for the analysis pretty easily.
Resources
It is unclear to me why you would categorically exclude gensim - the topic coherence pipeline is pretty extensive, and documentation exists.
See, for example, these three tutorials (in Jupyter notebooks).
Demonstration of the topic coherence pipeline in Gensim
Performing Model Selection Using Topic Coherence
Benchmark testing of coherence pipeline on Movies dataset
The formulas for several coherence measures can be found in this paper here.

TensorflowJS text/string classification

Subject
Hello. I wanna implement text classification feature using Tensorflow.js in NodeJS.
Its job will be to match a string with some pre-defined topics.
Examples:
Input: String: "My dog loves walking on the beach"
Pre-defined topcics: Array<String>: ["dog", "cat", "cow"]
Output: There are many output variants I am comfortable with. These are some examples, but if you can suggest better, Do it!
String (the most likely topic) - Example: "dog"
Object (every topic with a predicted score) Example: {"dog": 0.9, "cat": 0.08, "cow": 0.02}
Research
I know similar results can be achieved by filtering the strings for the topic names and doing some algorithms but also can be achieved with ML.
There were already some posts about using strings, classifying text and creating autocomplete with TensorFlow (but not sure about TFjs), like these:
https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub
http://ruder.io/text-classification-tensorflow-estimators/
https://machinelearnings.co/tensorflow-text-classification-615198df9231
How you can help
My goal is to do the topic prediction with TensorflowJS. I need just an example of the best way to train models with strings or how to classify text and then will extend the rest by myself.

Text classification has an added challenge which is to first find the vectors from words. There are various approaches depending on the nature of the problem solved. Before building the model, one might ensure to have the vectors associated to all the words of the corpus. After the representation of a vector from the corpus suffers another issue of sparsity. Hence arises the need of word embedding. The two most popular algorithms for this task are Wor2Vec and GloVe. There are some implementations in js. Or one can create vectors using the bag of word as outlined here.
Once there are the vectors, a Fully Connected Neural Network FCNN will suffice to predict the topic of a text. The other things to take into consideration would be deciding the length of the text. In case a text is to short, there could be some padding, etc ... Here is a model
const model = tf.sequential();
model.add(tf.layers.dense({units: 100, activation: 'relu', inputShape: [lengthSentence]}));
model.add(tf.layers.dense({units: numTopics, activation: 'softmax'}));
model.compile({optimizer: 'sgd', loss: 'categoricalCrossentropy'});
Key Takeaways of the model
The model simply connects the input to the categorical output. It is a very simple model. But in some scenarios, adding an embedding layer after the input layer can be considered.
model.add(tf.layers.embedding({inputDim: inputDimSize, inputLength: lengthSentence, outputDim: embeddingDims}))
In some other case, an LSTM layer can be relevant
tf.layers.lstm({units: lstmUnits, returnSequences: true})

I working in something like this.
My code https://github.com/ran-j/ChatBotNodeJS/blob/master/routes/index.js
Based on https://chatbotsmagazine.com/contextual-chat-bots-with-tensorflow-4391749d0077
And them
classify('is your shop open today?')
[('opentoday', 0.9264171123504639)]
But my code is not working to predict yet

Spark LDA model prediction on new documents [duplicate]

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:
newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)
This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:
newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)
This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string