Topic modeling using mallet - nlp

I'm trying to use topic modeling with Mallet but have a question.
How do I know when do I need to rebuild the model? For instance I have this amount of documents I crawled from the web, using topic modeling provided by Mallet I might be able to create the models and infer documents with it. But overtime, with new data that I crawled, new subjects may appear. In that case, how do I know whether I should rebuild the model from start till current?
I was thinking of doing so for documents I crawled each month. Can someone please advise?
So, is topic modeling more suitable for text under a fixed amount of topics (the input parameter k, no. of topics). If not, how do I really determine what number to use?

The answers to your questions depend in large part on the kind of data you're working with and the size of the corpus.
Regarding frequency, I'm afraid you'll just have to estimate how often your data changes in a meaningful way and remodel at that rate. You could start with a week and see if the new data lead to a significantly different model. If not, try two weeks and so on.
The number of topics you select is determined by what you're looking for in the model. The higher the number, the more fine-grained the results. If you want a broad overview of what's in your corpus, you could select say 10 topics. For a closer look, you could use 200 or some other suitably high number.
I hope that helps.

Related

What are the best metrics for Multi-Object Tracking (MOT) evaluation and why?

I want to compare multiple computer vision Multi-Object Tracking (MOT) methods on my own dataset, so first I want to choose the best metrics for this task. I have carried out some research in scientific literature and I come to the conclusion that there are three main metrics sets:
Metrics from "Tracking of Multiple, Partially Occluded Humans based on Static Body Part
Detection"
CLEAR MOT metrics
ID scores
Therefore, I wonder to which of the above metrics should I attach the greatest importance?
And I would like to ask if anyone has encountered a similar issue and has any thoughts on this topic that could justify and help me to choose the best metrics for the above task.
I know this is old but I see nobody mentioning HOTA (https://arxiv.org/pdf/2009.07736.pdf). This metric has become the new standard for multi-object tracking as can be seen in the latest SOTA tracking research: https://arxiv.org/abs/2202.13514 and https://arxiv.org/pdf/2110.06864.pdf
The reason behind using a metric that is not MOTA and IDF1 is that they overemphasize detection and association respectively. HOTA explicitly measures both types of errors and combines these in a balanced way. HOTA also incorporates measuring the localization accuracy of tracking results which isn’t present in either MOTA or IDF1.
You can refer to the metrics used in the MOT Challenge.
Here's the results for the MOT 2020 Challenge and they have included the metrics used here:
https://motchallenge.net/results/MOT20/
Based on the MOT 20 paper, they said at section 4.1.7 (page 7):
As we have seen in this section, there are a number of reasonable performance measures to assess the quality of a tracking system, which makes it rather difficult to reduce the evaluation to one single number. To nevertheless give an intuition on how each tracker performs compared to its competitors, we compute and show the average rank for each one by ranking all trackers according to each metric and then averaging across all performance measures.
the metrics that you choose related to what your goal after multiple object tracking like : if your goals interest tracking the people inside scence you should the metric ID switch very low and so on ...
you should find the metrics related to your goals.

How to extract categories out of short text documents?

My data contains the answers to the open-ended question: what are the reasons for recommending the organization you work for?
I want to use an algorithm / technique that, using this data, learns the categories (i.e. the reasons) that occur most frequently, and that a new answer to this question can be placed in one of these categories automatically.
I initially thought of topic modeling (for example LDA), but the text documents are very short in this problem (mostly between the 1 and 10 words per document). Therefore, is this an appropriate method? Or are there other models that are suitable for this? Perhaps a cluster method?
Note: the text is in Dutch
No, clustering will work even worse.
It can't do magic.
You'll need to put in additional information, such as labels to solve this problem - use classification.
Find the most common terms that clearly indicate one reason or another and begin labeling posts.

How to use secondary user actions with to improve recommendations with Spark ALS?

Is there a way to use secondary user actions derived from the user click stream to improve recommendations when using Spark Mllib ALS?
I have gone through the explicit and implicit feedback based example mentioned here : https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html that uses the same ratings RDD for the train() and trainImplicit() methods.
Does this mean I need to call trainImplicit() on the same model object with a RDD(user,item,action) for each secondary user action? Or train multiple models , retrieve recommendations based on each action and then combine them linearly?
For additional context, the crux of the question is if Spark ALS can model secondary actions like Mahout's spark item similarity job. Any pointers would help.
Disclaimer: I work with Mahout's Spark Item Similarity.
ALS does not work well for multiple actions in general. First an illustration. The way we consume multiple actions in ALS is to weight one above the other. For instance buy = 5, view = 3. ALS was designed in the days when ratings seemed important and predicting them was the question. We now know that ranking is more important. In any case ALS uses predicted ratings/weights to rank results. This means that a view is really telling ALS nothing since a rating of 3 means what? Like? Dislike? ALS tries to get around this by adding a regularization parameter and this will help in deciding if 3 is a like or not.
But the problem is more fundamental than that, it is one of user intent. When a user views a product (using the above ecom type example) how much "buy" intent is involved? From my own experience there may be none or there may be a lot. The product was new, or had a flashy image or other clickbait. Or I'm shopping and look at 10 things before buying. I once tested this with a large ecom dataset and found no combination of regularization parameter (used with ALS trainImplicit) and action weights that would beat the offline precision of "buy" events used alone.
So if you are using ALS, check your results before assuming that combining different events will help. Using two models with ALS doesn't solve the problem either because from buy events you are recommending that a person buy something, from view (or secondary dataset) you are recommending a person view something. The fundamental nature of intent is not solved. A linear combination of recs still mixes the intents and may very well lead to decreased quality.
What Mahout's Spark Item Similarity does is to correlate views with buys--actually it correlates a primary action, one where you are clear about user intent, with other actions or information about the user. It builds a correlation matrix that in effect scrubs the views of the ones that did not correlate to buys. We can then use the data. This is a very powerful idea because now almost any user attribute, or action (virtually the entire clickstream) may be used in making recs since the correlation is always tested. Often there is little correlation but that's ok, it's an optimization to remove from the calculation since the correlation matrix will add very little to the recs.
BTW if you find integration of Mahout's Spark Item Similarity daunting compared to using MLlib ALS, I'm about to donate an end-to-end implementation as a template for Prediction.io, all of which is Apache licensed open source.

Question about Latent Dirichlet Allocation (MALLET)

Honestly, I'm not familiar with LDA, but am required to use MALLET's topic modeling for one of my projects.
My question is: given a set of documents within a specific timestamp as the training data for the topic model, how appropriate is it to use the model (using the inferencer) to track the topic trends, for documents + or - the training data's timestamp. I mean, is the topic distributions being provided by MALLET a suitable metric to track the popularity of the topics over time if during the model building stage, we only provide a subset of the dataset I am required to analyze.
thanks.
Are you famailiar with Latent Semantic Indexing? Latent Dirichlet Analysis is just a different way of doing the same kind of thing, so LSI or pLSI you may be an easier starting point to gain knowledge about the goals of LDA.
All three techniques lock on to topics in an unsupervised fashion (you tell it how many topics to look for), and then assume that each document covers each topic in varying proportions. Depending on how many topics you allocate, they may behave more like subfields of whatever your corpus is about, and may not be as specific as the "topics" that people think about when they think about trending topics in the news.
Somehow I suspect that you want to assume that each document represents a particular topic. LSI/pLSI/LDA don't do this -- they model each document as a mixture of topics. That doesn't mean you won't get good results, or that this isn't worth trying, but I suspect (though I don't have a comprehensive knowledge of LSI literature) that you'd be tackling a brand new research problem.
(FWIW, I suspect that using clustering methods like k-Means more readily model the assumption that each document has exactly one topic.)
You should check out the topic-models mailing list at Princeton. They discuss theoretical and practical issues relating to topic models.
I'm aware of three approaches to the tracking the popularity of the topics over time.
It sounds like you might benefit from a dynamic topic modeling approach, which looks at how topics change over time. There's a nice video overview of Blei's work on that here and a bunch of PDFs on his home page. He has a package in C that does it.
A related approach is Alice Oh's topic string approach, where she obtains topics by LDA for texts from time-slices and then uses a topic similarity metric to link topics from different time slices into strings (video, PDF). Looks like MALLET could be part of a topic string analysis, but she doesn't mention how she did the LDA analysis.
The simplest approach might be what David Mimno does in his paper, where he calculates the mean year of a topic from the chronological distribution of the words in the topic. He's involved in the development of MALLET, so it's probably entirely done with that package.

How to group / compare similar news articles

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in the same group. I am guessing its some sort of fuzzy logic comparison. How would I go about doing this from a technical standpoint? What are my options? We haven't even started the app yet, so we aren't limited in the technologies we can use.
Thanks, in advance for the help!
This problem breaks down into a few subproblems from a machine learning standpoint.
First, you are going to want to figure out what properties of the news stories you want to group based on. A common technique is to use 'word bags': just a list of the words that appear in the body of the story or in the title. You can do some additional processing such as removing common English "stop words" that provide no meaning, such as "the", "because". You can even do porter stemming to remove redundancies with plural words and word endings such as "-ion". This list of words is the feature vector of each document and will be used to measure similarity. You may have to do some preprocessing to remove html markup.
Second, you have to define a similarity metric: similar stories score high in similarity. Going along with the bag of words approach, two stories are similar if they have similar words in them (I'm being vague here, because there are tons of things you can try, and you'll have to see which works best).
Finally, you can use a classic clustering algorithm, such as k-means clustering, which groups the stories together, based on the similarity metric.
In summary: convert news story into a feature vector -> define a similarity metric based on this feature vector -> unsupervised clustering.
Check out Google scholar, there probably have been some papers on this specific topic in the recent literature. A lot of these things that I just discussed are implemented in natural language processing and machine learning modules for most major languages.
The problem can be broken down to:
How to represent articles (features, usually a bag of words with TF-IDF)
How to calculate similarity between two articles (cosine similarity is the most popular)
How to cluster articles together based on the above
There are two broad groups of clustering algorithms: batch and incremental. Batch is great if you've got all your articles ahead of time. Since you're clustering news, you've probably got your articles coming in incrementally, so you can't cluster them all at once. You'll need an incremental (aka sequential) algorithm, and these tend to be complicated.
You can also try http://www.similetrix.com, a quick Google search popped them up and they claim to offer this service via API.
One approach would be to add tags to the articles when they are listed. One tag would be XYZ. Other tags might describe the article subject.
You can do that in a database. You can have an unlimited number of tags for each article. Then, the "groups" could be identified by one or more tags.
This approach is heavily dependent upon human beings assigning appropriate tags, so that the right articles are returned from the search, but not too many articles. It isn't easy to do really well.

Resources