I'm working on text classification and I want to use Topic models (LDA).
My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news.
I saw two Java toolkits: mallet and lingpipe.
I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it?
Also read a little about the lingpipe, the example from tutorial was using arrays of integers. Is it convenient for large data?
I need to know which implementation of LDA is better for me? Are there any other implementation that suits my data? (in Java)
From the keyword-weight file you can create an artificial text containing the words in random order with the given weights. Run mallet on the so-generated texts to retrieve the topics.
Related
I am beginner to Machine Learning and NLP, I have to create a bot based on FAQ dataset, Each FAQ dataset excel file contains 2 columns "Questions" and its "Answers".
Eg. A record from an excel file (A question & it's answer).
Question - What is RASA-NLU?
Answer - Rasa NLU is trained to identify intent and entities. Better the training, better the identification...
We have 3K+ excel files which has around 10K to 20K such records each excel.
To implement the bot, I would have followed exactly this FAQ bot approach which uses RASA-NLU, but the RASA,Chatterbot also Microsoft's QnA maker are not allowed in my organization.
And Spacy does the NER extraction perfectly for me, so I am looking for a bot creation using Spacy. but I don't know how to proceed further after extracting the entities. (IMHO, I will have to predict the exact question from dataset (and its answer from knowlwdge base) from user query to the bot)
I don't know what NLP algorithm/ ML process to be used or is there any easiest way to create that FAQ bot using extracted NERs.
One way to achieve your FAQ bot is to transform the problem into a classification problem. You have questions and the answers can be the "labels". I suppose that you always have multiple training questions which map to the same answer. You can encode each answer in order to get smaller labels (for instance, you can map the text of the answer to an id).
Then, you can use your training data (the questions) and your labels (the encoded answers) and feed a classifier. After the training your classifier can predict the label of unseen questions.
Of course, this is a supervised approach, so you will need to extract features from your training sentences (the questions). In this case, you can use as a feature the bag-of-word representations and even include the named entities.
An example of how to do text classification in spacy is available here: https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/
I am building an NLP pipeline and I am trying to get my head around in regards to the optimal structure. My understanding at the moment is the following:
Step1 - Text Pre-processing [a. Lowercasing, b. Stopwords removal, c. stemming, d. lemmatisation,]
Step 2 - Feature extraction
Step 3 - Classification - using the different types of classifier(linearSvC etc)
From what I read online there are several approaches in regard to feature extraction but there isn't a solid example/answer.
a. Is there a solid strategy for feature extraction ?
I read online that you can do [a. Vectorising usin ScikitLearn b. TF-IDF]
but also I read that you can use Part of Speech or word2Vec or other embedding and Name entity recognition.
b. What is the optimal process/structure of using these?
c. On the text pre-processing I am ding the processing on a text column on a df and the last modified version of it is what I use as an input in my classifier. If you do feature extraction do you do that in the same column or you create a new one and you only send to the classifier the features from that column?
Thanks so much in advance
The preprocessing pipeline depends mainly upon your problem which you are trying to solve. The use of TF-IDF, word embeddings etc. have their own restrictions and advantages.
You need to understand the problem and also the data associated with it. In order to make the best use of the data, we need to implement the proper pipeline.
Specifically for text related problems, you will find word embeddings to be very useful. TF-IDF is useful when the problem needs to be solved emphasising the words with lesser frequency. Word embeddings, on the other hand, convert the text to a N-dimensional vector which may show up similarity with some other vector. This could bring a sense of association in your data and the model can learn the best features possible.
In simple cases, we can use a bag of words representation to tokenize the texts.
So, you need to discover the best approach for your problem. If you are solving a problems which closely resembles the famous NLP problems like IMDB review classification, sentiment analysis on Twitter data, then you can find a number of approaches on the internet.
I'm currently doing the topic modeling things (beginner)
I was thinking using mallet for some tool to get me understand this area, but, my problem is, I'd like to train a model based on, let's say, 1000 documents, to construct a model and using the model on a new single document to generate its potential topics.
But, as far as I read about mallet tutorial, it always says like this tool or API is useful on a corpus of texts, which means, it's used to find topics within several documents.
Is there a way that it can find topic on single document based on the model (or inference parameter it learned / constructed from the 1000 documents?)
Is there any other tool that can do this?
Thanks a lot!
You can refer the example code src/cc/mallet/examples/TopicModel.java which describes how to clustering and infer the new instance.
Actually when you run the simple LDA on a directory the model assigns topic proportions to each of the documents of that directory based on "an already" trained model from a part of your corpus. So, topic proportions are assigned with a certain probability to each of the documents (already ranked by the probability of appearance of that topic to that specific document).
I am doing Latent Dirichlet Analyses for some research and keep running into a problem. Most lda software requires documents to be in doclines format, meaning a CSV or other delimited file in which each line represents the entirety of a document. However, Blei's lda-c and dynamic topic model software requires that data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count] where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared
in the document. Note that [term_1] is an integer which indexes the
term; it is not a string.
Does anyone know of a utility that will let me quickly convert to this format? Thank you.
If you are working with R, the lda package contains a function lexicalize that will convert raw text into the lda-c format necessary for the lda package.
example <- c("I am the very model of a modern major general",
"I have a major headache")
corpus <- lexicalize(example, lower=TRUE)
Similarly, the topicmodels package has a function dtm2ldaformat that will convert a document term matrix to the lda format. You can convert a plain text document into a document term matrix using the tm package, also in R.
So with these existing functions there's a lot of flexibility in getting text into R for topic modelling.
The Mallet package from University of Massachusetts Amherst is another option.
http://mallet.cs.umass.edu/
http://mallet.cs.umass.edu/topics.php
And here is an excellent step-by-step demo on how to use Mallet:
http://programminghistorian.org/lessons/topic-modeling-and-mallet
You can use mallet with just normal text files as input source.
Gensim offers an implementation of Blei's corpus format. See here. You could write a quick corpus based on your CSV file in Python and then save it in lda-c with gensim. It should not be too hard.
For Python, there is an available function for this(may not be available at the time of the question).
lda.utils.dtm2ldac
The document is https://pythonhosted.org/lda/api.html#module-lda.utils
I am embarking upon a NLP project for sentiment analysis.
I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task.
Here is my task:
I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice)
I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??)
Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron"
Then I would like to check for positive/negative sentiment in each sentence and count them accordingly
NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm.
Here are the troubles I am having:
All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?)
I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary?
I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link?
Any help is much appreciated and will save me much hair!
Cheers Ke
The movie review data has already been marked by humans as being positive or negative (the person who made the review gave the movie a rating which is used to determine polarity). These gold standard labels allow you to train a classifier, which you could then use for other movie reviews. You could train a classifier in NLTK with that data, but applying the results to election tweets might be less accurate than randomly guessing positive or negative. Alternatively, you can go through and label a few thousand tweets yourself as positive or negative and use this as your training set.
For a description of using Naive Bayes for sentiment analysis with NLTK: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
Then in that code, instead of using the movie corpus, use your own data to calculate word counts (in the word_feats method).
Why dont you use WSD. Use Disambiguation tool to find senses. and use map polarity to the senses instead of word. In this case you will get a bit more accurate results as compared to word index polarity.