I am currently working on a predictive text SMS system. I want to implement it using TST data structure and bi-gram (Predicting the next probable word based on current key sequence 12-keypad).
Currently I have a corpus and have used the available applications to come up with a dictionary, bi-gram and frequencies. Currently have the following questions in mind:
Can I find a J2ME TST implementation or a suitable Trie on this case? (More detailed explanation on the available TST trie can be great)
A general guidance on this project approach
NB:I Have looked at similar Trie implementations but still unable to figure out a way forward
You may look at this: http://algs4.cs.princeton.edu/52trie/
Alternatives can be found at Algorithms, 4th Edition in general...
Related
I am working on a natural language processing application. I have a text describing 30 domains. Each domain is defined with a short paragraph that explains it. My aim is to build a thesaurus from this text so I can determine from an input string which domains are concerned. The text is about 5000 words and each domains is described by 150 words. My questions are :
Do I have a long enough text to create a thesaurus from ?
Is my idea of building a thesaurus legit or should I just use NLP libraries to analyse my corpus and the input string ?
At the moment, I have calculated the number total of occurrence of each words grouped by domains because I first thought of a indexed approach. But I am really not sure which method is the best. Does someone have experience in both NLP and thesaurus building ?
I think what you are looking for is topic modeling. Given a word, you want to get the probability of which domain the word belongs to. I would recommend using off the shelf algorithms that implement LDA (Latent Dirichlet Algorithm).
Alternatively, you can visit David Blei's website. He has written some great software that implements LDA, and topic modeling in general. He also has presented several tutorials for topic modeling for beginners.
If your goal is to build a thesaurus then build a thesaurus; if your goal is not to build a thesaurus, then you better use stuff available out there.
More generally, for any task in NLP - from data acquisition to machine translation - you're gonna face numerous problems (both technical and theoretical), and it is very easy to stray from the path, as these problems are - most of the time - fascinating.
Whatever the task is, build a system using existing resources. Then you get the big picture; then you can start thinking about improving component A or B.
Good luck.
I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.
Let's say I have a bunch of essays (thousands) that I want to tag, categorize, etc. Ideally, I'd like to train something by manually categorizing/tagging a few hundred, and then let the thing loose.
What resources (books, blogs, languages) would you recommend for undertaking such a task? Part of me thinks this would be a good fit for a Bayesian Classifier or even Latent Semantic Analysis, but I'm not really familiar with either other than what I've found from a few ruby gems.
Can something like this be solved by a bayesian classifier? Should I be looking more at semantic analysis/natural language processing? Or, should I just be looking for keyword density and mapping from there?
Any suggestions are appreciated (I don't mind picking up a few books, if that's what's needed)!
Wow, that's a pretty huge topic you are venturing into :)
There is definitely a lot of books and articles you can read about it but I will try to provide a short introduction. I am not a big expert but I worked on some of this stuff.
First you need to decide whether you are want to classify essays into predefined topics/categories (classification problem) or you want the algorithm to decide on different groups on its own (clustering problem). From your description it appears you are interested in classification.
Now, when doing classification, you first need to create enough training data. You need to have a number of essays that are separated into different groups. For example 5 physics essays, 5 chemistry essays, 5 programming essays and so on. Generally you want as much training data as possible but how much is enough depends on specific algorithms. You also need verification data, which is basically similar to training data but completely separate. This data will be used to judge quality (or performance in math-speak) of your algorithm.
Finally, the algorithms themselves. The two I am familiar with are Bayes-based and TF-IDF based. For Bayes, I am currently developing something similar for myself in ruby, and I've documented my experiences in my blog. If you are interested, just read this - http://arubyguy.com/2011/03/03/bayes-classification-update/ and if you have any follow up questions I will try to answer.
The TF-IDF is a short for TermFrequence - InverseDocumentFrequency. Basically the idea is for any given document to find a number of documents in training set that are most similar to it, and then figure out it's category based on that. For example if document D is similar to T1 which is physics and T2 which is physics and T3 which is chemistry, you guess that D is most likely about physics and a little chemistry.
The way it's done is you apply the most importance to rare words and no importance to common words. For instance 'nuclei' is rare physics word, but 'work' is very common non-interesting word. (That's why it's called inverse term frequency). If you can work with Java, there is a very very good Lucene library which provides most of this stuff out of the box. Look for API for 'similar documents' and look into how it is implemented. Or just google for 'TF-IDF' if you want to implement your own
I've done something similar in the past (though it was for short news articles) using some vector-cluster algorithm. I don't remember it right now, it was what Google used in its infancy.
Using their paper I was able to have a prototype running in PHP in one or two days, then I ported it to Java for speed purposes.
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf
I want to do a riddle AI chatbot for my AI class.
So i figgured the input to the chatbot would be :
Something like :
"It is blue, and it is up, but it is not the ceiling"
Translation :
<Object X>
<blue>
<up>
<!ceiling>
</Object X>
(Answer : sky?)
So Input is a set of characteristics (existing \ not existing in the object), output is a matched, most likely object.
The domain will be limited to a number of objects, i could input all attributes myself, but i was thinking :
How could I programatically build a database of characteristics for a word?
Is there such a database available? How could i tag a word, how could i programatically find all it's attributes? I was thinking on crawling wikipedia, or some forum, but i can't see it build any reliable word tag database.
Any ideas on how i could achieve such a thing? Any ideas on some literature on the subject?
Thank you
This sounds like a basic classification problem. You're essentially asking; given N features (color=blue, location=up, etc), which of M classifications is the most likely? There are many algorithms for accomplishing this (Naive Bayes, Maximum Entropy, Support Vector Machine), but you'll have to investigate which is the most accurate and easiest to implement. The biggest challenge is typically acquiring accurate training data, but if you're willing to restrict it to a list of manually entered examples, then that should simplify your implementation.
Your example suggests that whatever algorithm you choose will have to support sparse data. In other words, if you've trained the system on 300 features, it won't require you to enter all 300 features in order to get an answer. It'll also make your training and testing files smaller, because you'll be omit features that are irrelevant for certain objects. e.g.
sky | color:blue,location:up
tree | has_bark:true,has_leaves:true,is_an_organism=true
cat | has_fur:true,eats_mice:true,is_an_animal=true,is_an_organism=true
It might not be terribly helpful, since it's proprietary, but a commercial application that's similar to what you're trying to accomplish is the website 20q.net, albeit the system asks the questions instead of the user. It's interesting in that it's trained "online" based on user input.
Wikipedia certainly has a lot of data, but you'll probably find extracting that data for your program will be very difficult. Cyc's data is more normalized, but its API has a huge learning curve. Another option is the semantic dictionary project Wordnet. It has reasonably intuitive APIs for nearly every programming language, as well as an extensive hypernym/hyponym model for thousands of words (e.g. cat is a type of feline/mammal/animal/organism/thing).
The Cyc project has very similar aims: I believe it contains both inference engines to perform the AI, and databases of facts about commonsense knowledge (like the colour of the sky).
We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.
It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".
Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.
Any suggestions of other getting the client what they are looking for would be gratefully accepted.
Thanks! I'll look into Wordnet.
Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.
The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services.
You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.
Having said that, are you solving the right problem? How do you build the category list?
Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.
You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.
Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.
I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.
I think this will help get you a long way toward what you want!
For text classification you can take a look at Apache Mahout.