How to extract keywords from a block of text in Haskell - haskell

So I know this is a kind of a large topic, but I need to accept a chunk of text, and extract the most interesting keywords from it. The text comes from TV captions, so the subject can range from news to sports to pop culture references. It is possible to provide the type of show the text came from.
I have an idea to match the text against a dictionary of terms I know to be interesting somehow.
Which libraries for Haskell can help me with this?
Assuming I do have a dictionary of interesting terms, and a database to store them in, is there a particular approach you'd recommend to matching keywords within the text?
Is there an obvious approach I'm not thinking of?

I'd stem the words in the chunks and then search for all terms in the dict
just two random libs:
stem http://hackage.haskell.org/packages/archive/stemmer/0.2/doc/html/NLP-Stemmer-C.html
search http://hackage.haskell.org/packages/archive/sphinx/0.2.1/doc/html/Text-Search-Sphinx.html

To expand on bpgergo answer (but I don't have any haskell-specific info), it's pretty straightforward to enter documents into a relational database and index them with SOLR/lucene or sphinx, either of which should have a stemmer in their default/suggested configuration. And then you can search on which docs have pairs, triples, etc of your list of "interesting terms"
You might look at Named entity recognition, statistically unusual Phrase Detection, auto-tag generation, topics like that. Lingpipe is a good place to start, also these books:
http://alias-i.com/lingpipe/demos/tutorial/read-me.html
http://www.manning.com/marmanis/excerpt_contents.html
http://www.manning.com/alag/excerpt_contents.html

Related

Methods for extracting locations from text?

What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

Given a list of dozens of words, how do I find the best matching sections from a corpus of hundreds of texts?

Let’s say I have a list of 250 words, which may consist of unique entries throughout, or a bunch of words in all their grammatical forms, or all sorts of words in a particular grammatical form (e.g. all in the past tense). I also have a corpus of text that has conveniently been split up into a database of sections, perhaps 150 words each (maybe I would like to determine these sections dynamically in the future, but I shall leave it for now).
My question is this: What is a useful way to get those sections out of the corpus that contain most of my 250 words?
I have looked at a few full text search engines like Lucene, but am not sure they are built to handle long query lists. Bloom filters seem interesting as well. I feel most comfortable in Perl, but if there is something fancy in Ruby or Python, I am happy to learn. Performance is not an issue at this point.
The use case of such a program is in language teaching, where it would be nice to have a variety of word lists that mirror the different extents of learner knowledge, and to quickly find fitting bits of text or examples from original sources. Also, I am just curious to know how to do this.
Effectively what I am looking for is document comparison. I have found a way to rank texts by similarity to a given document, in PostgreSQL.

Proper approach to get words like "dentistry", "dentist" from query like "dental" (and vice versa)

I'm somewhat familiar with stemming, but the stemming library I've been given to use for a project doesn't work very well for a case where I want to find related words like if I do a query for any of these:
"dental", "dentist", "dentistry"
I should get a match for the others. I've been looking into this and I'm learning about parts of speech I didn't even know existed, like pertainyms and troponyms so I'm wondering if there isn't a library out there that has a mapping between all of these different parts of speech that could give back the sort of match I'm looking for?
I've been searching on this and haven't found a whole lot that I can make sense of. I probably don't know the right terminology, etc and I would greatly appreciate if anyone can point me in the right direction.
One approach common in IR is to stem all words in the index and the query itself. Meaning, documents containing the word 'dentistry' will be stemmed and stored in the index as 'dentist'. The keyword 'dental' is also stemmed as 'dentist' thereby matching it in the index.
Have a look at WordNet. WordNet is an organized ontology of words and concepts with links for various types of relations between words. I'm not sure if it will have exactly the relationships you want, but it's probably a good start. There are many interfaces in various programming languages (Java and Python that I've used; presumably many more).

Finding words from a dictionary in a string of text

How would you go about parsing a string of free form text to detect things like locations and names based on a dictionary of location and names? In my particular application there will be tens of thousands if not more entries in my dictionaries so I'm pretty sure just running through them all is out of the question. Also, is there any way to add "fuzzy" matching so that you can also detect substrings that are within x edits of a dictionary word? If I'm not mistaken this falls within the field of natural language processing and more specifically named entity recognition (NER); however, my attempt to find information about the algorithms and processes behind NER have come up empty. I'd prefer to use Python for this as I'm most familiar with that although I'm open to looking at other solutions.
You might try downloading the Stanford Named Entity Recognizer:
http://nlp.stanford.edu/software/CRF-NER.shtml
If you don't want to use someone else's code and you want to do it yourself, I'd suggest taking a look at the algorithm in their associated paper, because the Conditional Random Field model that they use for this has become a fairly common approach to NER.
I'm not sure exactly how to answer the second part of your question on looking for substrings without more details. You could modify the Stanford program, or you could use a part-of-speech tagger to mark proper nouns in the text. That wouldn't distinguish locations from names, but it would make it very simple to find words that are x words away from each proper noun.

Synonym style text lookup and parsing

We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.
It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".
Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.
Any suggestions of other getting the client what they are looking for would be gratefully accepted.
Thanks! I'll look into Wordnet.
Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.
The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services.
You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.
Having said that, are you solving the right problem? How do you build the category list?
Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.
You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.
Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.
I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.
I think this will help get you a long way toward what you want!
For text classification you can take a look at Apache Mahout.

Resources