I'm looking to use Mallet to classify different documents by topics that I have defined. I know that Mallet will first determine the topics, then classify the documents but I want to skip the first step because I already have a list of topics with words associated with them. Is there any way to use pre-defined topic lists that I have created to classify documents with Mallet?
Any guidance is appreciated. Thanks!
If you're doing unsupervised learning (without training examples, i.e. docs for each topic), you cannot trivially just set the topics. The point is that the training algorithm does not know anything about the docs in advance. It just tries to separate/distribute them, based on the features you provide.
If you're doing supervised learning, topics are actually classes and you have documents for each class. Then the algorithm tries to learn which features are significant for each class. In mallet you should use the Classification module.
There are probably some fancy topic modelling ideas, which incorporate / skew the topic distributions according to specific keywords, but I don't think that's possible with Mallet.
Related
I'm investigating various NLP algorithms and tools to solve the following problem; NLP newbie here, so pardon my question if it's too basic.
Let's say, I have a messaging app where users can send text messages to one or more people. When the user types a message, I want the app to suggest to the user who the potential recipients of the message are?
If user "A" sends a lot of text messages regarding "cats" to user "B" and some messages to user "C" and sends a lot of messages regarding "politics" to user "D", then next time user types the message about "cats" then the app should suggest "B" and "C" instead of "D".
So I'm doing some research on topic modeling and word embeddings and see that LDA and Word2Vec are the 2 probable algorithms I can use.
Wanted to pick your brain on which one you think is more suitable for this scenario.
One idea I have is, extract topics using LDA from the previous messages and rank the recipients of the messages based on the # of times a topic has been discussed (ie, the message sent) in the past. If I have this mapping of the topic and a sorted list of users who you talk about it (ranked based on frequency), then when the user types a message, I can again run topic extraction on the message, predict what the message is about and then lookup the mapping to see who can be the possible recipients and show to user.
Is this a good approach? Or else, Word2Vec (or doc2vec or lda2vec) is better suited for this problem where we can predict similar messages using vector representation of words aka word embeddings? Do we really need to extract topics from the messages to predict the recipients or is that not necessary here? Any other algorithms or techniques you think will work the best?
What are your thoughts and suggestions?
Thanks for the help.
Since you are purely looking at topic extraction from previous posts, in my opinion LDA would be a better choice. LDA would describe the statistical relationship of occurrences. Semantics of the words would mostly be ignored (if you are looking for that then you might want to rethink). But also I would suggest to have a look at a hybrid approach. I have not tried it myself but looks quiet interesting.
lda2vec new hybrid approach
Also, if you happen to try it out, would love to know your findings.
I think you're looking for recommender systems (Netflix movie suggestions, amazon purchase recommendations, ect) or Network analysis (Facebook friend recommendations) which utilize topic modeling as an attribute. I'll try to break them down:
Network Analysis:
FB friends are nodes of a network whose edges are friendship relationships. Calculates betweenness centrality, finds shortest paths between nodes, stores shortest edges as a list, closeness centrality is the sum of length between nodes.
Recommender Systems:
recommends what is popular, looks at users similar and suggests things that the user might be interested in, calculates cosine similarity by measuring angels between vectors that point in the same direction.
LDA:
topic modeler for text data - returns topics of interest might be used as a nested algorithm within the algorithms above.
Word2Vec:
This is a neccassary step in building an LDA it looks like this: word -> # say 324 then count frequency say it showed up twice in a sentence:
This is a sentence is.
[(1,1), (2,2), (3,1), (4,1), (2,2)]
It is a neural net you will probably have to use as a pre-processing step.
I hope this helps :)
I have used three different ways to calculate the matching between the resume and the job description. Can anyone tell me that what method is the best and why?
I used NLTK for keyword extraction and then RAKE for
keywords/keyphrase scoring, then I applied cosine similarity.
Scikit for keywords extraction, tf-idf and cosine similarity
calculation.
Gensim library with LSA/LSI model to extract keywords and calculate
cosine similarity between documents and query.
Nobody here can give you the answer. The only way to decide which method works better is to have one or more humans independently match lots and lots of resumes and job descriptions, and compare what they do to what your algorithms do. Ideally you'd have a dataset of already matched resumes and job descriptions (companies must do this kind of thing when people apply), because it takes a lot of work to create a sufficiently large dataset.
Next time you take on this kind of project, start by considering how you are going to evaluate the performance of the solution you'll put together.
As already mentioned in answers, try ti use Doc2Vec.
Seems using Doc2Vec from Gensim on both corpora (CVs and job descriptions) separately and then using cosine similarity between the two vectors is the easiest flow to work. It works better than others on documents which are not similar in form and words content but similar in context and sematics, so merely keywords would not help much here.
Then you can try to train CNN on the corpus of pairs of matched CV&JD with labels like yes/no if available and use it to qulaify CVs/resumees against job descriptions.
Basically I'm going to try these aproaches in my pretty much the same task, pls see https://datascience.stackexchange.com/questions/22421/is-there-an-algorithm-or-nn-to-match-two-documents-basically-not-closely-simila
Since its highly likely that job description and resume content can be different, you should think from semantics point of view. One thing possible you can do is use some domain knowledge. But its pretty difficult to gain domain knowledge for a variety of job types. Researchers sometimes use dictionary to augment the similarity matching between documents.
Researchers are using deep neural networks to capture both syntactic and semantic structure of documents. You can use doc2Vec to compare two documents. Gensim can produce doc2Vec representation for you. I believe that will give better results compared to keyword extraction and similarity computation. You can build your own neural network model to train on job descriptions and resumes. I guess neural networks will be effective for your work.
I'm currently doing the topic modeling things (beginner)
I was thinking using mallet for some tool to get me understand this area, but, my problem is, I'd like to train a model based on, let's say, 1000 documents, to construct a model and using the model on a new single document to generate its potential topics.
But, as far as I read about mallet tutorial, it always says like this tool or API is useful on a corpus of texts, which means, it's used to find topics within several documents.
Is there a way that it can find topic on single document based on the model (or inference parameter it learned / constructed from the 1000 documents?)
Is there any other tool that can do this?
Thanks a lot!
You can refer the example code src/cc/mallet/examples/TopicModel.java which describes how to clustering and infer the new instance.
Actually when you run the simple LDA on a directory the model assigns topic proportions to each of the documents of that directory based on "an already" trained model from a part of your corpus. So, topic proportions are assigned with a certain probability to each of the documents (already ranked by the probability of appearance of that topic to that specific document).
Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as to how I can shoehorn my problem into an LDA framework, as there are very good tools available to solve LDA problems.
For the sake of being thorough, I have the following pieces of information as input:
A set of documents (segments of DNA from one organism, where each segment is a document)
A document can only have one topic in this scenario
A set of topics (segments of DNA from other organisms)
Words in this case are triplets of bases (for now)
The question I want to answer is: For the current document, what is its topic? In other words, for the given DNA segment, which other organism (same species) did it most likely come from? There could have been mutations and such since the exchange of segments occurred, so the two segments won't be identical.
The main difference between this and the classical LDA model is that I know the topics ahead of time.
My initial idea was to take a pLSA model (http://en.wikipedia.org/wiki/PLSA) and just set the topic nodes explicitly, then perform standard EM learning (if only there were a decent library that could handle Bayesian parameter learning with latent variables...), followed by inference using whatever algorithm (which shouldn't matter, because the model is a polytree anyway).
Edit: I think I've solved it, for anyone who might stumble across this. I figured out that you can use labelled LDA and just assign every label to every document. Since each label has a one-to-one correspondence with a topic, you're effectively saying to the algorithm: for each document, choose the topic from this given set of topics (the label set), instead of making up your own.
I have a similar problem, and just thought I'd add the solutions I'm going with for completeness's sake.
I also have a set of documents (pdf documents anywhere from 1 to 200
pages), though mine are regular English text data.
A set of known topics (mine include subtopics, but I won't address that here). Unlike the previous example, I may desire multiple topic labels.
Words (standard English, though named entities and acronyms are included in my corpus)
LDAesk approach: Guided LDA
Guided LDA lets you seed words for your LDA categories. If you have n-topics for your final decisions you just create your guidedLDA algorithm with n-seed topics, each of which contain the keywords that makeup their topic name. Eg: I want to cluster into known topics "biochemistry" and "physics". Then I seed my guidedLDA with d = {0: ['biochemsitry'], 1: ['physics']}. You can incorporate other guiding words if you can identify them, however the guidedLDA algorithm I'm using (python version) makes it relatively easy to identify the top n-words for a given topic. You can run guidedLDA once with only basic seed words then use the top n-words output to consider for more words to add to topics. These top n-words also are potentially helpful for the other approach I'm mentioning.
Non-LDAesk approach: ~KNN
What I've ended up doing is using a word embedding model (word2vec has been superior to alternatives for my case) to create a "topic vector" for every topic based on the words that make up the topic/subtopic. Eg: I have a category Biochemistry with a subcategory Molecular Biology. The most basic topic vector is just the word2vec vectors for Biochemistry, Molecular, and Biology all averaged together.
For every document I want to determine a topic for, I turn it into a "document vector" (same dimension & embedding model as how I made my topic vectors - I've found just averaging all the word2vec vectors in the doc has been the best solution for my so far, after a bit of preprocessing like removing stopwords). Then I just find the k-closest topic vectors to the input document vector.
I should note that there's some ability to hand tune this by changing the words that makeup the topic vectors. One way to potentially identify further keywords is to use the guidedLDA model I mentioned earlier.
I would note that when I was testing these two solutions on a different corpus with labeled data (which I didn't use aside from evaluating accuracy and such) this ~KNN approach proved better than the GuidedLDA approach.
Why not simply use a supervised topic model. Jonathan Chang's lda package in R has an slda function that is quite nice. There is also a very helpful demo. Just install the package and run demo(slda).
I have encountered a very unusual problem. I have a set of phrases (noun phrases) extracted from a large corpus of documents. These phrases are >=2 and <=3 words of length. There is a need to cluster these phrases because the number of phrases extracted are very large in number and showing them as a simple list might not be useful for the user.
We are thinking of nice very simple ways of clustering these. Is there a quick tool/software/method that I could use to cluster these so that all phrases inside a cluster belong to a particular theme/topic, if I keep the number of topics as a fixed initially? I don't have any training set or any other clusters that I can use as a training set.
Topic classification is not an easy problem.
The conventional methods used to classify long documents (100's of words) are usually based on frequent words, and not suitable for very short messages. I believe that your problem is somewhat similar to tweet classification.
Two very interesting papers are:
Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia
(presented at HCI International 2011)
Eddi: Interactive Topic-based Browsing of Social Status Streams (presented at UIST'10)
If you want to include knowledge about the world so that, e.g., cat and dog will be clustered together, you can use WordNet's domains hierarchy.