Build an index for substring search? - full-text-indexing

I want to do general substring search among billions of strings. The requirement is a little different from general fulltext search because I want a query "ubst" also can hit "substr".
Is Lucene or Sphinx capable of doing this? If not, what's the best way do you think to do this?

Best index structure for this case is suffix tree
Lucene does not implements this type of index so its substring search is slow. But lucene has prefix tree index which mean you can do fast search if you search terms by their prefix.

Lucene is one of the best available options. Lucene supports sub string search so ubst will return substr.
check out http://wiki.apache.org/lucene-java/LuceneImplementations for suitable language implementation.

Sphinx does support effective substring searches since Version 2.0.1-beta, 22 apr 2011. Unfortunately as of today this support regards only beta versions, as mentioned here.
I made a try with 2.1.1 beta version. It seems to work correctly. See the manual entry for dictionary type, read about keywords type.
When I tried to use 2.0.6 release version, it fell back to inefficient crc index, giving the following warning during indexing:
WARNING: min_infix_len is not supported yet with dict=keywords; using dict=crc
My minimal configuration file:
source sour
{
type = xmlpipe2
xmlpipe_command = type C:\Temp\1\sphinx\input.xml
}
index inde
{
source = sour
path = testpa
enable_star = 1
dict = keywords
charset_type = utf-8
min_infix_len = 1
}

Related

Retrieving the span of a fuzzy match

I'm trying to fuzzy-search for a short text in a larger text.
Common python libs, such as fuzzywuzzy and rapidfuzz, support the "partial_ratio" function, but those only return a score, not the location of the match.
Is there some library or function which I can use to also obtain where the fuzzy match was, (something like the span method of regex match)?
I looked at fuzzywuzzy and noted that finding the index of a match is an open issue. The same is true for RapidFuzz.
This prompted me "(something like the span method of regex match)" to do some research around this method. During my research I found the Python package regex. The package's Readme talks about fuzzy matching. I haven't used this package, but it seem that it might be useful to solving your use case.

Looking up a list of words into a String in Scala

I need to find if any word from a word list (which could be a Set or List or another structure) is contained (as a sub-string) in another String and I need the best possible performance.
This could be an example:
val query = "update customer set id=rand() where id=1000000009;"
val wordList = Set("NOW(", "now(", "LOAD_FILE(", "load_file(", "UUID(", "uuid(", "UUID_SHORT(",
"uuid_short(", "USER(", "user(", "FOUND_ROWS(", "found_rows(", "SYSDATE(", "sysdate(", "GET_LOCK(", "get_lock(",
"IS_FREE_LOCK(", "is_free_lock(", "IS_USED_LOCK(", "is_used_lock(", "MASTER_POS_WAIT(", "master_pos_wait(",
"RAND(", "rand(", "RELEASE_LOCK(", "release_lock(", "SLEEP(", "sleep(", "VERSION(", "version(")
What is the best option to achieve the best performance? I have read about the contains method but it doesn't work for sub-strings. Is the only option to iterate through the list and to use the method indexOf or there is a better option?
For Scala collections, the method to use in order to answer a question like "is there an item in this collection that satisfies my condition?" is exists (scroll up slightly when you get there because the scaladoc pages are weird about linking directly to methods).
Your condition is "does the string (query) contain this item (word)?" For this, you can use String's contains method, which comes from Java.
Putting it together, you'll get
wordList.exists { word => query.contains(word) }
// or, with some syntax sugar
wordList exists { query.contains }
You can also use .find instead of .exists, which will return an Option containing the first match that was found, instead of just a Boolean indicating whether or not something was found.
scala> wordList.exists(query.contains)
res1: Boolean = true
scala> wordList.find(query.contains)
res2: Option[String] = Some(rand()
This is advice for solution:
Check that you need to optimize it. "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
Array is collection with fastest access to element. Use it to increase access's speed.
Sometimes use a ParArray may increase performance.
If it's acceptable, for best performance first cast string to lower case, and remove all UPPER_CASE from set.
You can use own "contains" method to find any of substring. E.g., you can group some words by their prefixes (or suffixes) and don't pass all group if next (prev) symbol is different.
Use native Java to increase performance (Scala can wrap array)
First find all positions of (, because all variants related to it. Than you can check last word's symbol.
Sorry for my English. It is not best advice, but I know small amount of people (e.g. on acm.timus.ru) which can write more faster functions at Scala.

Neo4j - index lookup issue

I was trying to set index type from exact to fulltext in neo4j shell, so i can do incasesensitive search with lucene query. So i used this command:
index --set-config Destination type fulltext
but it didn't work. Still couldn't do case insensitive search, so a played around and change some other values, like _blueprints:type and to_lower_case.
That didn't do any good.
Now it somehow ignores first character of name value ( weird ! ) . So if i am searching for "London" for example and i type "Lon" it returns nothing. But if i type "ond" it returns the node. The same for every node.
I tried setting everything back to normal. Didn`t help.
What did i mess up? What am i missing?
I am using a Everyman PHP library to communicate with database.
I created new index with "to_lower_case" property.
I think that will solve my problem, just have to convert string to lower case before inserting it into query. It seems to work.
Setting configuration afterwards doesn't update already indexed values (as the shell notes, I think). If you've created your index with "to_lower_case=true" then additions as well as queries will have the values converted to lower case. Calling Index#get will still require you to lower-case it yourself.

Solr sort by min of two fields?

I want to sort a result set by the minimum of several fields.
So after reading the functionquery documentation this is what I came up with:
sort={!func}min(dvd_available_from_tdt,dto_available_from_tdt)%20desc
I also tried:
sort=_val_:min(dvd_available_from_tdt,dto_available_from_tdt)%20desc
sort=_val_:"min(dvd_available_from_tdt,dto_available_from_tdt)"%20desc
sort=_val_:"min(dvd_available_from_tdt,dto_available_from_tdt)%20desc"
sort="{!func}min(dvd_available_from_tdt,dto_available_from_tdt)"%20desc
sort={!func}min(dvd_available_from_tdt,dto_available_from_tdt)%20desc
sort="min(dvd_available_from_tdt,dto_available_from_tdt)"%20desc
and also some other placements of the quotes. But no matter what I always get this error:
HTTP ERROR: 400
Missing sort order.
Can anyobody point me in the right direction?
Try using a query that matches all documents, with a constant score, plus a function.
http://localhost:8983/solr/select/?q=%3A+_val_:price&version=2.2&start=0&rows=10&indent=on&debugQuery=true
Also, upgrading to Solr 3.3 is not that painful, and there's all sorts of cool new toys like sorting by function.
It seems to be available only in solr 3.1. I am running 1.4.1
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function

In R, how do I replace a string that contains a certain pattern with another string?

I'm working on a project involving cleaning a list of data on college majors. I find that a lot are misspelled, so I was looking to use the function gsub() to replace the misspelled ones with its correct spelling. For example, say 'biolgy' is misspelled in a list of majors called Major. How can I get R to detect the misspelling and replace it with its correct spelling? I've tried gsub('biol', 'Biology', Major) but that only replaces the first four letters in 'biolgy'. If I do gsub('biolgy', 'Biology', Major), it works for that case alone, but that doesn't detect other forms of misspellings of 'biology'.
Thank you!
You should either define some nifty regular expression, or use agrep from base package. stringr package is another option, I know that people use it, but I'm a very huge fan of regular expressions, so it's a no-no for me.
Anyway, agrep should do the trick:
agrep("biol", "biology")
[1] 1
agrep("biolgy", "biology")
[1] 1
EDIT:
You should also use ignore.case = TRUE, but be prepared to do some bookkeeping "by hand"...
You can set up a vector of all the possible misspellings and then do a loop over a gsub call. Something like:
biologySp = c("biolgy","biologee","bologee","bugs")
for(sp in biologySp){
Major = gsub(sp,"Biology",Major)
}
If you want to do something smarter, see if there's any fuzzy matching packages on CRAN, or something that uses 'soundex' matching....
The wikipedia page on approx. string matching might be useful, and try searching R-help for some of the key terms.
http://en.wikipedia.org/wiki/Approximate_string_matching
You could first match the majors against a list of available majors, any not matching would then be the likely missspellings. Then use the agrep function to match these against the known majors again (agrep does approximate matching, so if it is similar to a correct value then you will get a match).
The vwr package has methods for string matching:
http://ftp.heanet.ie/mirrors/cran.r-project.org/web/packages/vwr/index.html
so your best bet might be to use the string with the minimum Levenshtein distance from the possible subject strings:
> levenshtein.distance("physcs",c("biology","physics","geography"))
biology physics geography
7 1 9
If you get identical minima then flip a coin:
> levenshtein.distance("biolsics",c("biology","physics","geography"))
biology physics geography
4 4 8
example 1a) perl/linux regex: 's/oldstring/newstring/'
example 1b) R equivalent of 1a: srcstring=sub(oldstring, newstring, srcstring)
example 2a) perl/linux regex: 's/oldstring//'
example 2b) R equivalent of 2a: srcstring=sub(oldstring, "", srcstring)

Resources