How to group / compare similar news articles - fuzzy-comparison

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in the same group. I am guessing its some sort of fuzzy logic comparison. How would I go about doing this from a technical standpoint? What are my options? We haven't even started the app yet, so we aren't limited in the technologies we can use.
Thanks, in advance for the help!

This problem breaks down into a few subproblems from a machine learning standpoint.
First, you are going to want to figure out what properties of the news stories you want to group based on. A common technique is to use 'word bags': just a list of the words that appear in the body of the story or in the title. You can do some additional processing such as removing common English "stop words" that provide no meaning, such as "the", "because". You can even do porter stemming to remove redundancies with plural words and word endings such as "-ion". This list of words is the feature vector of each document and will be used to measure similarity. You may have to do some preprocessing to remove html markup.
Second, you have to define a similarity metric: similar stories score high in similarity. Going along with the bag of words approach, two stories are similar if they have similar words in them (I'm being vague here, because there are tons of things you can try, and you'll have to see which works best).
Finally, you can use a classic clustering algorithm, such as k-means clustering, which groups the stories together, based on the similarity metric.
In summary: convert news story into a feature vector -> define a similarity metric based on this feature vector -> unsupervised clustering.
Check out Google scholar, there probably have been some papers on this specific topic in the recent literature. A lot of these things that I just discussed are implemented in natural language processing and machine learning modules for most major languages.

The problem can be broken down to:
How to represent articles (features, usually a bag of words with TF-IDF)
How to calculate similarity between two articles (cosine similarity is the most popular)
How to cluster articles together based on the above
There are two broad groups of clustering algorithms: batch and incremental. Batch is great if you've got all your articles ahead of time. Since you're clustering news, you've probably got your articles coming in incrementally, so you can't cluster them all at once. You'll need an incremental (aka sequential) algorithm, and these tend to be complicated.
You can also try http://www.similetrix.com, a quick Google search popped them up and they claim to offer this service via API.

One approach would be to add tags to the articles when they are listed. One tag would be XYZ. Other tags might describe the article subject.
You can do that in a database. You can have an unlimited number of tags for each article. Then, the "groups" could be identified by one or more tags.
This approach is heavily dependent upon human beings assigning appropriate tags, so that the right articles are returned from the search, but not too many articles. It isn't easy to do really well.

Related

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

Document Analysis and Tagging

Let's say I have a bunch of essays (thousands) that I want to tag, categorize, etc. Ideally, I'd like to train something by manually categorizing/tagging a few hundred, and then let the thing loose.
What resources (books, blogs, languages) would you recommend for undertaking such a task? Part of me thinks this would be a good fit for a Bayesian Classifier or even Latent Semantic Analysis, but I'm not really familiar with either other than what I've found from a few ruby gems.
Can something like this be solved by a bayesian classifier? Should I be looking more at semantic analysis/natural language processing? Or, should I just be looking for keyword density and mapping from there?
Any suggestions are appreciated (I don't mind picking up a few books, if that's what's needed)!
Wow, that's a pretty huge topic you are venturing into :)
There is definitely a lot of books and articles you can read about it but I will try to provide a short introduction. I am not a big expert but I worked on some of this stuff.
First you need to decide whether you are want to classify essays into predefined topics/categories (classification problem) or you want the algorithm to decide on different groups on its own (clustering problem). From your description it appears you are interested in classification.
Now, when doing classification, you first need to create enough training data. You need to have a number of essays that are separated into different groups. For example 5 physics essays, 5 chemistry essays, 5 programming essays and so on. Generally you want as much training data as possible but how much is enough depends on specific algorithms. You also need verification data, which is basically similar to training data but completely separate. This data will be used to judge quality (or performance in math-speak) of your algorithm.
Finally, the algorithms themselves. The two I am familiar with are Bayes-based and TF-IDF based. For Bayes, I am currently developing something similar for myself in ruby, and I've documented my experiences in my blog. If you are interested, just read this - http://arubyguy.com/2011/03/03/bayes-classification-update/ and if you have any follow up questions I will try to answer.
The TF-IDF is a short for TermFrequence - InverseDocumentFrequency. Basically the idea is for any given document to find a number of documents in training set that are most similar to it, and then figure out it's category based on that. For example if document D is similar to T1 which is physics and T2 which is physics and T3 which is chemistry, you guess that D is most likely about physics and a little chemistry.
The way it's done is you apply the most importance to rare words and no importance to common words. For instance 'nuclei' is rare physics word, but 'work' is very common non-interesting word. (That's why it's called inverse term frequency). If you can work with Java, there is a very very good Lucene library which provides most of this stuff out of the box. Look for API for 'similar documents' and look into how it is implemented. Or just google for 'TF-IDF' if you want to implement your own
I've done something similar in the past (though it was for short news articles) using some vector-cluster algorithm. I don't remember it right now, it was what Google used in its infancy.
Using their paper I was able to have a prototype running in PHP in one or two days, then I ported it to Java for speed purposes.
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf

associated words

I am developing a program but stuck on a particular hurdle. I need to find words associated with other words. EG "green" might be associated with "environment", "leaf", "earth", "wind", "electric", "hybrid", etc. All I can find is Google Sets. Is there any other resource that is better?
If you have a large text collection (say Wikipedia, Project Gutenberg) you can use co-occurrence scores extract this kind of data. See e.g. Padó and Lapata and the references therein.
I recently built a tool that mines this kind of associations from Wikipedia database dumps by another method. It requires a lot of memory though; other folks have tried to do the same using randomized methods.
If you're still looking for a resource of semantically related words, I've just recently developed an API that takes a query and returns semantically related words. It offers parts of speech, relationships to the query word, and a word similarity measurement.
https://kiingo.co/rapid-associations-api
Disclaimer: I'm the developer of this API.

blindly classifying new trends in incoming data

how do news outlets like google news automatically classify and rank documents about emerging topics, like "obama's 2011 budget"?
i've got a pile of articles tagged with baseball data like player names and relevance to the article (thanks, opencalais), and would love to create a google news-style interface that ranks and displays new posts as they come in, especially emerging topics. i suppose that a naive bayes classifier could be trained w/ some static categories, but this doesn't really allow for tracking trends like "this player was just traded to this team, these other players were also involved."
No doubt, Google News may use other tricks (or even a combination thereof), but one relatively cheap trick, computationally, to infer topics from free-text would exploit the NLP notion that a word gets its meaning only when connected to other words.
An algorithm susceptible of discovering new topic categories from multiple documents could be outlined as follow:
POS (part-of-speech) tag the text
We probably want to focus more on nouns and maybe even more so on named entities (such as Obama or New England)
Normalize the text
In particular replace inflected words by their common stem. Maybe even replace some adjectives by a corresponding Named Entity (ex: Parisian ==> Paris, legal ==> law)
Also, remove noise words and noise expressions.
identify some words from a list of manually maintained "current / recurring hot words" (Superbowl, Elections, scandal...)
This can be used in subsequent steps to provide more weight to some N-grams
Enumerate all N-grams found in each documents (where N is 1 to say 4 or 5)
Be sure to count, separately, the number of occurrences of each N-gram within a given document and the number of documents which cite a given N-gram
The most frequently cited N-grams (i.e. the ones cited in the most documents) are probably the Topics.
Identify the existing topics (from a list of known topics)
[optionally] Manually review the new topics
This general recipe can also be altered to leverage other attributes of the documents and the text therein. For example the document origin (say cnn/sports vs. cnn/politics ...) can be used to select domain specific lexicons. Another example the process can more or less heavily emphasize the words/expressions from the document title (or other areas of the text with a particular mark-up).
The main algorithms behind Google News have been published in the academic literature by Google researchers:
Original paper.
Talk: Google News Personalization: Scalable Online Collaborative Filtering
Blog discussion.

What tried and true algorithms for suggesting related articles are out there?

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related.
Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name.
How do you go about finding the possibly related documents?
I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql.
edit: the current answer is pretty good but I'd like to see more. Maybe some really bare example code for a thing or two.
This is a pretty big topic -- in addition to the answers people come up with here, I recommend tracking down the syllabi for a couple of information retrieval classes and checking out the textbooks and papers assigned for them. That said, here's a brief overview from my own grad-school days:
The simplest approach is called a bag of words. Each document is reduced to a sparse vector of {word: wordcount} pairs, and you can throw a NaiveBayes (or some other) classifier at the set of vectors that represents your set of documents, or compute similarity scores between each bag and every other bag (this is called k-nearest-neighbour classification). KNN is fast for lookup, but requires O(n^2) storage for the score matrix; however, for a blog, n isn't very large. For something the size of a large newspaper, KNN rapidly becomes impractical, so an on-the-fly classification algorithm is sometimes better. In that case, you might consider a ranking support vector machine. SVMs are neat because they don't constrain you to linear similarity measures, and are still quite fast.
Stemming is a common preprocessing step for bag-of-words techniques; this involves reducing morphologically related words, such as "cat" and "cats", "Bob" and "Bob's", or "similar" and "similarly", down to their roots before computing the bag of words. There are a bunch of different stemming algorithms out there; the Wikipedia page has links to several implementations.
If bag-of-words similarity isn't good enough, you can abstract it up a layer to bag-of-N-grams similarity, where you create the vector that represents a document based on pairs or triples of words. (You can use 4-tuples or even larger tuples, but in practice this doesn't help much.) This has the disadvantage of producing much larger vectors, and classification will accordingly take more work, but the matches you get will be much closer syntactically. OTOH, you probably don't need this for semantic similarity; it's better for stuff like plagiarism detection. Chunking, or reducing a document down to lightweight parse trees, can also be used (there are classification algorithms for trees), but this is more useful for things like the authorship problem ("given a document of unknown origin, who wrote it?").
Perhaps more useful for your use case is concept mining, which involves mapping words to concepts (using a thesaurus such as WordNet), then classifying documents based on similarity between concepts used. This often ends up being more efficient than word-based similarity classification, since the mapping from words to concepts is reductive, but the preprocessing step can be rather time-consuming.
Finally, there's discourse parsing, which involves parsing documents for their semantic structure; you can run similarity classifiers on discourse trees the same way you can on chunked documents.
These pretty much all involve generating metadata from unstructured text; doing direct comparisons between raw blocks of text is intractable, so people preprocess documents into metadata first.
You should read the book "Programming Collective Intelligence: Building Smart Web 2.0 Applications" (ISBN 0596529325)!
For some method and code: First ask yourself, whether you want to find direct similarities based on word matches, or whether you want to show similar articles that may not directly relate to the current one, but belong to the same cluster of articles.
See Cluster analysis / Partitional clustering.
A very simple (but theoretical and slow) method for finding direct similarities would be:
Preprocess:
Store flat word list per article (do not remove duplicate words).
"Cross join" the articles: count number of words in article A that match same words in article B. You now have a matrix int word_matches[narticles][narticles] (you should not store it like that, similarity of A->B is same as B->A, so a sparse matrix saves almost half the space).
Normalize the word_matches counts to range 0..1! (find max count, then divide any count by this) - you should store floats there, not ints ;)
Find similar articles:
select the X articles with highest matches from word_matches
This is a typical case of Document Classification which is studied in every class of Machine Learning. If you like statistics, mathematics and computer science, I recommend that you have a look at the unsupervised methods like kmeans++, Bayesian methods and LDA. In particular, Bayesian methods are pretty good at what are you looking for, their only problem is being slow (but unless you run a very large site, that shouldn't bother you much).
On a more practical and less theoretical approach, I recommend that you have a look a this and this other great code examples.
A small vector-space-model search engine in Ruby. The basic idea is that two documents are related if they contain the same words. So we count the occurrence of words in each document and then compute the cosine between these vectors (each terms has a fixed index, if it appears there is a 1 at that index, if not a zero). Cosine will be 1.0 if two documents have all terms common, and 0.0 if they have no common terms. You can directly translate that to % values.
terms = Hash.new{|h,k|h[k]=h.size}
docs = DATA.collect { |line|
name = line.match(/^\d+/)
words = line.downcase.scan(/[a-z]+/)
vector = []
words.each { |word| vector[terms[word]] = 1 }
{:name=>name,:vector=>vector}
}
current = docs.first # or any other
docs.sort_by { |doc|
# assume we have defined cosine on arrays
doc[:vector].cosine(current[:vector])
}
related = docs[1..5].collect{|doc|doc[:name]}
puts related
__END__
0 Human machine interface for Lab ABC computer applications
1 A survey of user opinion of computer system response time
2 The EPS user interface management system
3 System and human system engineering testing of EPS
4 Relation of user-perceived response time to error measurement
5 The generation of random, binary, unordered trees
6 The intersection graph of paths in trees
7 Graph minors IV: Widths of trees and well-quasi-ordering
8 Graph minors: A survey
the definition of Array#cosine is left as an exercise to the reader (should deal with nil values and different lengths, but well for that we got Array#zip right?)
BTW, the example documents are taken from the SVD paper by Deerwester etal :)
Some time ago I implemented something similiar. Maybe this idea is now outdated, but I hope it can help.
I ran a ASP 3.0 website for programming common tasks and started from this principle: user have a doubt and will stay on website as long he/she can find interesting content on that subject.
When an user arrived, I started an ASP 3.0 Session object and recorded all user navigation, just like a linked list. At Session.OnEnd event, I take first link, look for next link and incremented a counter column like:
<Article Title="Cookie problem A">
<NextPage Title="Cookie problem B" Count="5" />
<NextPage Title="Cookie problem C" Count="2" />
</Article>
So, to check related articles I just had to list top n NextPage entities, ordered by counter column descending.

Resources