Tools for identifying near duplicate documents

Tools for identifying near duplicate documents - nlp

I'm doing a NLP project and identifying near duplicate document is a part of that. Can anyone who has experience with this area suggest the tools (implementations like Weka) available for near duplicate detection?
The project is about generating a statistical report for crimes after analyzing news articles of some local English news papers. The crime articles are firstly classified. Then duplicate articles should be detected and merged. Data collection may contain about 1000 crime related articles for near duplicate detection.
I define near duplicates here as the articles containing the same crime incident. Sometimes different news papers may report the same incidents. Also same news paper may report news articles in different days.
The time taken for duplicate detection is not a problem as this is not online processing. The accuracy is very important here.
Thank you in advance.

Although the notion of duplicate content is pretty straightforward, the notion of near-duplicate content might be problematic.
For instance, do you consider documents relating to the same event (e.g. news articles from different sources) as NDC?
Or do you consider documents exhibiting the same syntactic patterns (e.g. weather forecasts) as NDC?
Considering your objective, I think you are more interested in the former definition of NDC, however it should be expressed more clearly.
As a first experience you might want to try OnIOn (https://code.google.com/p/onion/) a tool dedicated to DC/NDC detection, but considering the size of your corpus (which is small) you might want to implement your own NDC removal system, based on your definition of NDC.
Here I would suggest you to read the seminal paper of Broder et al. (http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf)... to give you some ideas.

Related

Given a number of articles, find which were written by the same author

I have a number of articles, roughly from 1000 to 10000 words each, that have been written by a number of authors. I don't know the author of any article, but I know some authors wrote more than one article.
I want to detect the likelihood, given a pair of articles, that they were written by the same author.
My best guess would be to look for the choice of words and expressions in every article and compute a similarity from that.
I am sure there are more advanced methods that I'm failing to find! Any help?

You may need to do a literature review on "Authorship attribution":
A Survey of Modern Authorship Attribution Methods and
Authorship Attribution
A recent research performed an analysis on the authorship of some books in the Bible as well.

Sentiment Analysis of Entity (Entity-level Sentiment Analysis)

I've been working on document level sentiment analysis since past 1 year. Document level sentiment analysis provides the sentiment of the complete document. For example - The text "Nokia is good but vodafone sucks big time" would have a negative polarity associated with it as it would be agnostic to the entities Nokia and Vodafone. How would it be possible to get entity level sentiment, like positive for Nokia but negative for Vodafone ? Are there any research papers providing a solution to such problems ?

You can try Aspect-level or Entity-level Sentiment Analysis. There are good efforts have been already done to find the opinions about the aspects in a sentence. You can find some of works here. You can also go further and deeper and review those papers that are related to feature (aspect) extraction. What does it mean? Let me give you an example:
"The quality of screen is great, however, the battery life is short."
Document-level sentiment analysis may not give us the real sense of this document here because we have one positive and one negative sentence in the document. However, by aspect-based (aspect-level) opinion mining, we can figure out the senses/polarities towards different entities in the document separately. By doing feature extraction, in the first step, you try to find the features (aspects) in different sentences (in here "quality of screen" or simply "quality" and "battery life"). Afterwards, when you have these aspects, you try to extract opinions related to these aspects ("great" for "quality" and "short" for "battery life"). In researches and academic papers, we also name features (aspects) as target words (those words or entities on which users comment), and the opinions as opinion words, the comments that have been stated about the target words.
By searching the keywords that I have just mentioned, you can become more familiar with these concepts.

You could look for entities and their coreferents, and have a simple heuristic like giving each entity sentiment from the closest sentiment term, perhaps closest by distance in a dependency parse tree, as opposed to linearly. Each of those steps seems to be an open research topic.
http://scholar.google.com/scholar?q=entity+identification
http://scholar.google.com/scholar?q=coreference+resolution
http://scholar.google.com/scholar?q=sentiment+phrase
http://scholar.google.com/scholar?q=dependency+parsing

This can be achieved using Google Cloud Natural Language API.

I also tried getting research articles on this but haven't found any. I would suggest you to try using the aspect based sentiment analysis algorithms. The similarity i found is there we recognize aspects of a single entity in a sentence and then find the sentiment of each aspect.Similarly we can train our model using the same algorithm which can detect the entities as it does for aspects and find the sentiment of such entities. I didn't try this but I am going to.Let me know if this worked or not.Also there are various ways to do this. The following are the links for few articles.
http://arxiv.org/pdf/1605.08900v1.pdf
https://cs224d.stanford.edu/reports/MarxElliot.pdf

Associating free text statements with pre-defined attributes

I have a list of several dozen product attributes that people are concerned with, like
Financing
Manufacturing quality
Durability
Sales experience
and several million free-text statements from customers about the product, e.g.
"The financing was easy but the housing is flimsy."
I would like to score each free text statement in terms of how strongly it relates to each of the attributes, and whether that is a positive or negative association.
In the given example, there would be a strong positive association to Financing and a strong negative association to Manufacturing quality.
It feels like this type of problem is probably the realm of Natural Language Programming (NLP). However, I spent several hours reading up on things like OpenNLP and NLTK and find there's so much domain specific terminology that I cannot figure out where to focus to solve this specific problem.
So my three-part question:
Is NLP the correct route to solve this class of problem?
What aspect of NLP should I focus on learning for this specific problem?
Are there alternatives I have not considered?

A resource you might find handy is SentiWordNet. (http://sentiwordnet.isti.cnr.it/) Which is like a dictionary that has a sentiment grade for words. It will tell you to what degree it thinks a word is positive, negative, or objective.
You can then combine that with some nltk code that looks through your sentences for the words you want to associate the sentiment with. So you would write a script to get some level of meaningful chunks of text that surround the words you were looking at, maybe sentence or clause level. Then you can have another thing that runs through the surrounding words and grab all the sentiment scores from the SentiWordNet.
I have some old code that did this and can place on github if you'd like, but you'd still need to make your own request for SentiWordNet.

I guess your problem is more on association rather than just classification. Now moving forward with this assumption:
Is NLP the correct route to solve this class of problem?
Yes.
What aspect of NLP should I focus on learning for this specific problem?
Part of speech tagging
Sentiment analysis
Maximum entrophy
Are there alternatives I have not considered?
In depth study of automata theory with respect to NLP will help you a lot, it helped me a lot in grasping the implementations like OpenNLP.

Yes, this is a NLP problem by the name of Sentiment analysis. Sentiment analysis is an active research area with different approaches and a task where a lot of other NLP-methods have to work together, so it is certainly not the easiest field to get started with in NLP.
A more or less recent survey of the academic research in the field can be found in Pang & Lee (2008).

Where can I find texts that describe topic-specific events?

So, some background: I'm trying to train a ML system to answer questions about events, where both the event descriptions and questions are posed in natural language; the event descriptions are constrained to being single sentences.
So far the main problem with this has been locating a corpus that describes events with a limited enough vocabulary to pose similar questions across all of the events (e.g. if all of the events involved chess, I could reasonably ask 'what piece moved?' and an answer could be drawn from a decent percentage of the event description sentences).
With that in mind, I'm hoping to find a text source that is tightly focused around describing events within some fairly limited topic (more along the lines of chess commentary than a chess forum, for example).
While I've had some luck with a corpus of air-traffic controller dialogs, most of sentences aren't typical English (they involve a lot of Charlie, Tango, etc.). However, if the format is as I've described then the actual topic of focus is irrelevant, so long as it has one.
Since I plan on building my own corpus out of this text, no tagging is necessary.

The Reuters corpus has a fairly monotonous content (commercial news; CEO appointments, mergers and acquisitions, major deals, etc); I am more familiar with the multilingual v2 but IIRC the v1 corpus was monolingual English. These will be multiple-sentence news stories, but in keeping with journalistic conventions, you can expect the first sentence to form a reasonable gist of the full story. http://about.reuters.com/researchandstandards/corpus/
You might also look at other TREC and especially MUC competition materials; http://en.wikipedia.org/wiki/Message_Understanding_Conference

Have you considered Usenet? It has a bunch of idiosyncratic conventions of its own but something like rec.food.cooking would seem to broadly fit your description. http://groups.google.com/group/rec.food.cooking/ Have a look at e.g. rec.sports.hockey or rec.games.video.arcade as well. There is also the 20 Newsgroups corpus if you are looking for a canonical, well-known corpus, and it contains at least some sports-related newsgroup material. http://people.csail.mit.edu/jrennie/20Newsgroups/
(Maybe in your country the "general public" is comfortable with baseball. Over here it would be football, you know, the kind where you can't use your hands.)

How to group / compare similar news articles

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in the same group. I am guessing its some sort of fuzzy logic comparison. How would I go about doing this from a technical standpoint? What are my options? We haven't even started the app yet, so we aren't limited in the technologies we can use.
Thanks, in advance for the help!

This problem breaks down into a few subproblems from a machine learning standpoint.
First, you are going to want to figure out what properties of the news stories you want to group based on. A common technique is to use 'word bags': just a list of the words that appear in the body of the story or in the title. You can do some additional processing such as removing common English "stop words" that provide no meaning, such as "the", "because". You can even do porter stemming to remove redundancies with plural words and word endings such as "-ion". This list of words is the feature vector of each document and will be used to measure similarity. You may have to do some preprocessing to remove html markup.
Second, you have to define a similarity metric: similar stories score high in similarity. Going along with the bag of words approach, two stories are similar if they have similar words in them (I'm being vague here, because there are tons of things you can try, and you'll have to see which works best).
Finally, you can use a classic clustering algorithm, such as k-means clustering, which groups the stories together, based on the similarity metric.
In summary: convert news story into a feature vector -> define a similarity metric based on this feature vector -> unsupervised clustering.
Check out Google scholar, there probably have been some papers on this specific topic in the recent literature. A lot of these things that I just discussed are implemented in natural language processing and machine learning modules for most major languages.

The problem can be broken down to:
How to represent articles (features, usually a bag of words with TF-IDF)
How to calculate similarity between two articles (cosine similarity is the most popular)
How to cluster articles together based on the above
There are two broad groups of clustering algorithms: batch and incremental. Batch is great if you've got all your articles ahead of time. Since you're clustering news, you've probably got your articles coming in incrementally, so you can't cluster them all at once. You'll need an incremental (aka sequential) algorithm, and these tend to be complicated.
You can also try http://www.similetrix.com, a quick Google search popped them up and they claim to offer this service via API.

One approach would be to add tags to the articles when they are listed. One tag would be XYZ. Other tags might describe the article subject.
You can do that in a database. You can have an unlimited number of tags for each article. Then, the "groups" could be identified by one or more tags.
This approach is heavily dependent upon human beings assigning appropriate tags, so that the right articles are returned from the search, but not too many articles. It isn't easy to do really well.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string