Where can I find texts that describe topic-specific events? - text

So, some background: I'm trying to train a ML system to answer questions about events, where both the event descriptions and questions are posed in natural language; the event descriptions are constrained to being single sentences.
So far the main problem with this has been locating a corpus that describes events with a limited enough vocabulary to pose similar questions across all of the events (e.g. if all of the events involved chess, I could reasonably ask 'what piece moved?' and an answer could be drawn from a decent percentage of the event description sentences).
With that in mind, I'm hoping to find a text source that is tightly focused around describing events within some fairly limited topic (more along the lines of chess commentary than a chess forum, for example).
While I've had some luck with a corpus of air-traffic controller dialogs, most of sentences aren't typical English (they involve a lot of Charlie, Tango, etc.). However, if the format is as I've described then the actual topic of focus is irrelevant, so long as it has one.
Since I plan on building my own corpus out of this text, no tagging is necessary.

The Reuters corpus has a fairly monotonous content (commercial news; CEO appointments, mergers and acquisitions, major deals, etc); I am more familiar with the multilingual v2 but IIRC the v1 corpus was monolingual English. These will be multiple-sentence news stories, but in keeping with journalistic conventions, you can expect the first sentence to form a reasonable gist of the full story. http://about.reuters.com/researchandstandards/corpus/
You might also look at other TREC and especially MUC competition materials; http://en.wikipedia.org/wiki/Message_Understanding_Conference

Have you considered Usenet? It has a bunch of idiosyncratic conventions of its own but something like rec.food.cooking would seem to broadly fit your description. http://groups.google.com/group/rec.food.cooking/ Have a look at e.g. rec.sports.hockey or rec.games.video.arcade as well. There is also the 20 Newsgroups corpus if you are looking for a canonical, well-known corpus, and it contains at least some sports-related newsgroup material. http://people.csail.mit.edu/jrennie/20Newsgroups/
(Maybe in your country the "general public" is comfortable with baseball. Over here it would be football, you know, the kind where you can't use your hands.)

Related

NLP: retrieve vocabulary from text

I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.
With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).
This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.
I think there are (at least) three different approaches and maybe the solution consists of a combination of them:
search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally
For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.
We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.
Any ideas about how to do that? Or just some tips?
Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.
You need lemmatization - it's similar to your 2nd item, but not exactly (difference).
Try nltk lemmatizer for Python or Standford NLP/Clear NLP for Java. Actually nltk uses WordNet, so it is really combination of 1st and 2nd approaches.
In order to cope with mistakes use spelling correction before lemmatization. Take a look at related questions or Google for appropriate libs.
About part of speech tag - unfortunately, nltk doesn't consider POS tag (and context in general), so you should provide it with the tag that can be found by nltk pos tagging. Again, it is already discussed here (and related/linked questions). I'm not sure about Stanford NLP here - I guess it should consider context, but I was sure that NLTK does so. As I can see from this code snippet, Stanford doesn't use POS tags, while Clear NLP does.
About other languages - google for lemmatization models, since algorithm for most languages (at least from the same family) is almost the same, differences are in training data. Take a look here for example of German; it is a wrapper for several lemmatizers, as I can see.
However, you always can use stemmer at cost of precision, and stemmer is more easily available for different languages.
Topic Word has become an integral part of the rising debate in the present world. Some people perceive that Topic Word (Synonyms) beneficial, while opponents reject this notion by saying that it leads to numerous problems. From my point of view, Topic Word (Synonyms) has more positive impacts than negative around the globe. This essay will further elaborate on both positive and negative effects of this trend and thus will lead to a plausible conclusion.
On the one hand, there is a myriad of arguments in favour of my belief. The topic has a plethora of merits. The most prominent one is that the Topic Word (Synonyms). According to the research conducted by Western Sydney University, more than 70 percentages of the users were in favour of the benefits provided by the Topic Word (Synonyms). Secondly, Advantage of Essay topic. Thus, it can say that Topic Word (Synonyms) plays a vital role in our lives.
On the flip side, critics may point out that one of the most significant disadvantages of the Topic Word (Synonyms) is that due to Demerits relates to the topic. For instance, a survey conducted in the United States reveals that demerit. Consequently, this example explicit shows that it has various negative impacts on our existence.
As a result, after inspection upon further paragraphs, I profoundly believe that its benefits hold more water instead of drawbacks. Topic Word (Synonyms) has become a crucial part of our life. Therefore, efficient use of Topic Word (Synonyms) method should promote; however, excessive and misuse should condemn.

Possible approach to sentiment analysis (I apologize, I'm very new to NLP)

So I have an idea for classifying sentiments of sentences talking about a given brand product (in this case, pepsi). Basically, let's say I wanted to figure out how people feel about the taste of pepsi. Given this problem, I want to construct abstract sentence templates, basically possible sentence structures that would indicate an opinion about the taste of pepsi. Here's one example for a three word sentence:
[Pepsi] [tastes] [good, bad, great, horrible, etc.]
I then look through my database of sentences, and try to find ones that match this particular structure. Once I have this, I can simply extract the third component and get a sentiment regarding this particular aspect (taste) of this particular entity (pepsi).
The application for this would be looking at tweets, so this might yield a few tweets from the past year or so, but it wouldn't be enough to get an accurate read on the general sentiment, so I would create other possible structures, like:
[I] [love, hate, dislike, like, etc.] [the taste of pepsi]
[I] [love, hate, dislike, like, etc.] [the way pepsi tastes]
[I] [love, hate, dislike, like, etc.] [how pepsi tastes]
And so on and so forth.
Of course most tweets won't be this simple, there would be possible words that would mean the same as pepsi, or words in between the major components, etc - deviations that it would not be practical to account for.
What I'm looking for is just a general direction, or a subfield of sentiment analysis that discusses this particular problem. I have no problem coming up with a large list of possible structures, it's just the deviations from the structures that I'm worried about. I know this is something like a syntax tree, but most of what I've read about them has just been about generating text - in this case I'm trying to match a sentence to a structure, and pull out the entity, sentiment, and aspect components to get a basic three word answer.
This templates approach is the core idea behind my own sentiment mining work. You might find study of EBMT (example-based machine translation) interesting, as a similar (but under-studied) approach in the realm of machine translation.
Get familiar with Wordnet, for automatically generating rephrasings (there are hundreds of papers that build on WordNet, some of which will be useful to you). (The WordNet book is getting old now, but worth at least a skim read if you can find it in a library.)
I found Bing Liu's book a very useful overview of all the different aspects and approachs to sentiment mining, and a good introduction to further reading. (The Amazon UK reviews are so negative I wondered if it was a different book! The Amazon US reviews are more positive, though.)

Tools for identifying near duplicate documents

I'm doing a NLP project and identifying near duplicate document is a part of that. Can anyone who has experience with this area suggest the tools (implementations like Weka) available for near duplicate detection?
The project is about generating a statistical report for crimes after analyzing news articles of some local English news papers. The crime articles are firstly classified. Then duplicate articles should be detected and merged. Data collection may contain about 1000 crime related articles for near duplicate detection.
I define near duplicates here as the articles containing the same crime incident. Sometimes different news papers may report the same incidents. Also same news paper may report news articles in different days.
The time taken for duplicate detection is not a problem as this is not online processing. The accuracy is very important here.
Thank you in advance.
Although the notion of duplicate content is pretty straightforward, the notion of near-duplicate content might be problematic.
For instance, do you consider documents relating to the same event (e.g. news articles from different sources) as NDC?
Or do you consider documents exhibiting the same syntactic patterns (e.g. weather forecasts) as NDC?
Considering your objective, I think you are more interested in the former definition of NDC, however it should be expressed more clearly.
As a first experience you might want to try OnIOn (https://code.google.com/p/onion/) a tool dedicated to DC/NDC detection, but considering the size of your corpus (which is small) you might want to implement your own NDC removal system, based on your definition of NDC.
Here I would suggest you to read the seminal paper of Broder et al. (http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf)... to give you some ideas.

How to detect if a event/action occurred from a text?

I was wondering if there's a NLP/ML technique for this.
Suppose given a set of sentences,
I watched the movie.
Heard the movie is great, have to watch it.
Got the tickets for the movie.
I am at the movie.
If i have to assign a probability to each of these sentences, that they have "actually" watched the movie, i would assign it in decreasing order of 1,4,3,2.
Is there a way to do this automatically, using some classifier or rules? Any paper/link would help.
These are common issues in textual entailment. I'll refer you to some papers. While their motivation is for textual entailment, I believe your problem should be easier than that.
Determining Modality and Factuality for Textual Entailment
Learning to recognize features of valid textual entailments
Some of these suggestions should help you decide on some features/keywords to consider when ranking.
Except 1, none of the other statements necessarily imply that the person has watched the movie. They could have bought the tickets for somebody else (3) and might be the person who sells popcorn outside the halls (4). I don't think there is any clever system out there that will read between the lines for each sentence and return an answer that exactly agrees with your intuitions (which might be different from that of other people for the same sentence btw).
If this strangely is the only case that you care about (which is possible if you are explicitly working with movie reviews), then it might be worth your time to come with a large number of heuristics patched together that yields a function that near exactly agrees with your intuitions about this.
Otherwise look for context available in all the other sentences these sentences originate from to find relevant clues. Somebody who has actually watched the movie may comment on how they liked it, express opinions about specific scenes, characters and actors from the movie, etc. So if the text contains a lot of high sentiment sentences and refers to words and phrases from the movie, then the person has probably watched the movie. If a lot of it is in future tense, then maybe not.
If you are working with an specific domain, such as "watched the movie or not", or maybe more generally "attended to an event or not", it's basically a case of the Text Classification task.
The common approach in NLP is to use a large amount of sentences tagged as watched or didn't watch to train a machine learning based classifier. The most commonly used features are the presence/absence of keywords, bigrams (sequences of 2 words) and maybe trigrams (sequences of 3 words).
Since you talked about probability, things may get a little more complex. As adi92 noted, in 3 of your sentences the answer is not clear. A way to represent that in the training data could be that a sentence with 0.3 probability of watched appear 3 times tagged as watched and 7 as didn't watch. Most classifiers can have their output easily turned into probabilities.
Anyway, I believe that the main difficulty would be creating a training dataset for the task.

Document Analysis and Tagging

Let's say I have a bunch of essays (thousands) that I want to tag, categorize, etc. Ideally, I'd like to train something by manually categorizing/tagging a few hundred, and then let the thing loose.
What resources (books, blogs, languages) would you recommend for undertaking such a task? Part of me thinks this would be a good fit for a Bayesian Classifier or even Latent Semantic Analysis, but I'm not really familiar with either other than what I've found from a few ruby gems.
Can something like this be solved by a bayesian classifier? Should I be looking more at semantic analysis/natural language processing? Or, should I just be looking for keyword density and mapping from there?
Any suggestions are appreciated (I don't mind picking up a few books, if that's what's needed)!
Wow, that's a pretty huge topic you are venturing into :)
There is definitely a lot of books and articles you can read about it but I will try to provide a short introduction. I am not a big expert but I worked on some of this stuff.
First you need to decide whether you are want to classify essays into predefined topics/categories (classification problem) or you want the algorithm to decide on different groups on its own (clustering problem). From your description it appears you are interested in classification.
Now, when doing classification, you first need to create enough training data. You need to have a number of essays that are separated into different groups. For example 5 physics essays, 5 chemistry essays, 5 programming essays and so on. Generally you want as much training data as possible but how much is enough depends on specific algorithms. You also need verification data, which is basically similar to training data but completely separate. This data will be used to judge quality (or performance in math-speak) of your algorithm.
Finally, the algorithms themselves. The two I am familiar with are Bayes-based and TF-IDF based. For Bayes, I am currently developing something similar for myself in ruby, and I've documented my experiences in my blog. If you are interested, just read this - http://arubyguy.com/2011/03/03/bayes-classification-update/ and if you have any follow up questions I will try to answer.
The TF-IDF is a short for TermFrequence - InverseDocumentFrequency. Basically the idea is for any given document to find a number of documents in training set that are most similar to it, and then figure out it's category based on that. For example if document D is similar to T1 which is physics and T2 which is physics and T3 which is chemistry, you guess that D is most likely about physics and a little chemistry.
The way it's done is you apply the most importance to rare words and no importance to common words. For instance 'nuclei' is rare physics word, but 'work' is very common non-interesting word. (That's why it's called inverse term frequency). If you can work with Java, there is a very very good Lucene library which provides most of this stuff out of the box. Look for API for 'similar documents' and look into how it is implemented. Or just google for 'TF-IDF' if you want to implement your own
I've done something similar in the past (though it was for short news articles) using some vector-cluster algorithm. I don't remember it right now, it was what Google used in its infancy.
Using their paper I was able to have a prototype running in PHP in one or two days, then I ported it to Java for speed purposes.
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf

Resources