iOS: Search on a main word(noun), not its pronoun - search

I am writing a TableView app where people can search for a word in a foreign language. In this language, the article is important as it tells the word's gender.
A reasonable english example is "The Book".
I want to search for "Book", not "The".
Any ideas on the best way to do this?
Many thanks

You need a secondary index free from noise words and do a search against this. There are also some full text search libraries for iOS, or you can build your own version of Sqlite with full text module turned on.
Also you may consider preprocessing the query, for example using an algorithm to reduce it to its word root and then searching that with a wildcard (eg. 'consideration' >> 'consider*'
Locatya http://www.locayta.com/iOS-search-engine/locayta-search-mobile/register-for-download
Building Sqlite with Fulltext on iOS http://longweekendmobile.com/2010/06/16/sqlite-full-text-search-for-iphone-ipadyour-own-sqlite-for-iphone-and-ipad/

Are you talking about looking something up in a database, eg? SQLite can be built with Full Text Search extensions that allow you to search for individual words in text. Even without the FTS extensions you can use a LIKE match in SQLite to find a word in a phrase, though the FTS extensions are much faster and more flexible.
You can also implement your own poor-man's Key Word In Context (KWIC) scheme -- basically just enter each item in the database N times for an N-word phrase, each time rotated one word.
And there are variations on the KWIC scheme that work for large numbers of phrases with less duplication -- using a tree structure to access the data. With such approaches it's practical to implement a search without need for a keyboard, just by successively refining the table contents.

Related

Methods for extracting locations from text?

What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

what algorithm does freebase use to match by name?

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

How to extract keywords from a block of text in Haskell

So I know this is a kind of a large topic, but I need to accept a chunk of text, and extract the most interesting keywords from it. The text comes from TV captions, so the subject can range from news to sports to pop culture references. It is possible to provide the type of show the text came from.
I have an idea to match the text against a dictionary of terms I know to be interesting somehow.
Which libraries for Haskell can help me with this?
Assuming I do have a dictionary of interesting terms, and a database to store them in, is there a particular approach you'd recommend to matching keywords within the text?
Is there an obvious approach I'm not thinking of?
I'd stem the words in the chunks and then search for all terms in the dict
just two random libs:
stem http://hackage.haskell.org/packages/archive/stemmer/0.2/doc/html/NLP-Stemmer-C.html
search http://hackage.haskell.org/packages/archive/sphinx/0.2.1/doc/html/Text-Search-Sphinx.html
To expand on bpgergo answer (but I don't have any haskell-specific info), it's pretty straightforward to enter documents into a relational database and index them with SOLR/lucene or sphinx, either of which should have a stemmer in their default/suggested configuration. And then you can search on which docs have pairs, triples, etc of your list of "interesting terms"
You might look at Named entity recognition, statistically unusual Phrase Detection, auto-tag generation, topics like that. Lingpipe is a good place to start, also these books:
http://alias-i.com/lingpipe/demos/tutorial/read-me.html
http://www.manning.com/marmanis/excerpt_contents.html
http://www.manning.com/alag/excerpt_contents.html

smart search by first/last name

I have to build a search facility capable of searching members by their first name/last name and may be some other search parameters (i.e. address).
The search should provide a list of match candidates so that the user can select whatever he/she seems the "correct" match.
The search should be smart enough so that the "correct" result would be among the first few items on the list. The search should also be tolerant to typos and misspellings and, may be, even be aware of name shortcuts i.e. Bob vs. Robert or Bill vs. William.
I started investigating Lucene and the family (like elastic search) as a tool for the job. While it has an impressive array of features addressing similar problems for the full text search, I am not so sure how to use them for my task - up to the point that maybe Lucene is not the right tool here at all.
What do you guys think - how can I harness Elastic Search to solve my problem? Or should I look elsewhere?
Lucene supports edit distance queries so that your search query will tolerate some typos, you define this as the allowed edit distance for a term.
for instance:
name:johnni~0.8
would return "johnny"
Also Solr provides a wide array of ready made search filters and analyzers you can use for search.
In your case I would probably chain several filter factories together:
TrimFilterFactory - trim the query
LowerCaseFilterFactory - to get rid of case differences
ISOLatin1AccentFilterFactory - to remove accents from letters (most people don't search with the accent anyway)
PhoneticFilterFactory - for matching sounds like queries like: kris -> chris
look at the documentation under the link it is pretty straight forward how to set up a new solr instance with an Analyzer that uses all the above filters. I used something similar for searching city names and it worked fairly well.
Lucene can be made tolerant of typos and misspellings, and can use synonyms. As for
The search should be smart enough so that the "correct" result would be among the first few items on the list
Are there any search engines which don't try to do this?
As far as Bob/Robert goes, that can be done with synonyms, but you need to get the synonym data from some reliable source.
In addition to what #Asaf mentioned, you might try to use N-gram indexing to deal with spelling variants. See the CJKAnalyzer for an example of how to do that.

Recommendations for a simple search engine for bag of words?

Any recommendations for small, lightweight, bag of words search engine?
I have a set of 'documents' that are each basically a small bag of arbitrary words.
Given a new document, I need to get a list of 'similar' documents along with some weight for how similar they might be. Documents are likely to be small.. a couple paragraphs at most.
Stemming would be great but not highly required.
Word expansion with word nets not required.
opensource or freeware preferred, as this is a prototype, not a full-blow project.
unix/linux platform preferred.
I'd be using it as a subcomponent and expect only to feed it documents with an ID and would later do searches for 'similar' documents to one I currently have.
Whoosh is a pure Python (no C, no external database) indexer / search engine. Check out the documentation for more information. It does support stemming.
I tried it out on an XML dump of a mediawiki instance and it seemed to work pretty well!
Solr or Sphinx. They aren't exactly lightweight but I wouldn't recommend anything smaller, if the project turns out to be successful and it needs to grow, switching the search engine might be painful.
I think that Lucene is an option. It should allow you to build a custom bag of words search engine.
I wonder about MongoDB http://www.mongodb.org/display/DOCS/Home
It seems like 'full-text-search' may be what I'm after...
and having additional fields to search with may be handy.

Resources