I am pretty new to Solr and I am looking for a way to port the search features I have for my web application having a regular database to use Solr indexes. My problem so far is I have to customize the wildcards behaviour: for example, "?" should be "0 or 1 characters" not any character as it is now, "+" should mean any "white-space", "#" should be any digit and so on. Any good pointer?
Thanks!
There is no simple answer that I know of, I am afraid.
For 0 or 1 characters - you can replace the original query with an 'OR' query. Eg. mp? in your db search usecase becomes - 'mp OR mp?' in Solr.
White spaces are tokenized by default in text field. So, you can look at using a white space tokenizer as part of your custom 'text' field. There are several examples. text_ws in the sample schema only does whitespace tokenizing. You'd want to readup on tokenizers.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
There is no digit equivalent - you can do term1* OR term2* OR term3* ... etc. You can also use function queries that support numerical functions. http://wiki.apache.org/solr/FunctionQuery
It looks like the best choice in this case is to use regular expressions in the search. More details can be found here: http://1opensourcelover.wordpress.com/2013/09/29/solr-regex-tutorial/
It's not exactly what I was looking for as I will have to build my own solr-query on the back and I have a feeling that regular expressions abuse will create a little bit more overhead on my server. For the test I did it looks pretty fast.
I will leave the question open for a while maybe someone can come up with a better answer.
Related
I use azure search and have some document with a field like this {"Nr": "123.334.93"}.
If i search for querytype=full&search=123.334.93 then it found multiple document and if I search for querytype=full&search="123.334.93" then it found one document. This is as expected.
But if I search for querytype=full&search=123.334.9* I expect multiple document starting with 123.334.9 but none result are given back.
Do I miss somthing?
The same is when I use a regex expression like this querytype=full&search=/123\.334\.9.*/
Your query looks correct to me and should work.
A couple of things you might look into.
1) Sometimes you need to escape the * like this:
querytype=full&search=123.334.9\*
Usually, this is only necessary if you have more search terms after the *.
2) You can also narrow the fields searched down to only the field you need (for better efficiency) like this:
querytype=full&search=Nr:123.334.9\*
Hope this helps.
Based on the Comment from Yahnoosh.
The analyzer of the field was set to "de.microsoft". I change that to "standard.lucene", recreate and fill the index and it works as expected.
It seems that I have to be more carefully to set the analyzer and only use specific ones for fields with language specific content.
Thanks for your help.
Suppose I am searching using one of the cts:query API's. I am looking for documents containing the phrase "John and Jane". Some of my documents have "John & Jane"(actually John & Jane) in them. I want them to be returned as well. Also consider reverse situation.
Does Marklogic provide any options to do that?
Queries expressed as cts:query items or XML are easy to rewrite with XQuery typeswitch expressions. The discussion list thread at http://markmail.org/message/6hxmuqnpnfm73j4n has an example of something similar.
Mike gives a good suggestion, but it might be worth to take a step back and look at your problem first. From your comment on Mike's answer I take it that you look for something like thesaurus expansion, but for the 'and' and '&' instead of the other words.
I may be wrong, but to my knowledge MarkLogic doesn't provide features to take care of something like that automatically. Functions like search:search and search:parse are powerfull, but don't go that far. You are up to your own to take a search string like yours, break it into parts manually to wrap it in a cts:query, or use something like search:parse for that, and then pull tricks like that of Mike to walk through your query-tree, and expand any particular search query node you would like to expand in a particular way.
The markmail thread to which Mike points, gives an example of how to walk a query-tree, and manipulate it. A little heavy for this particular case, but there is a thesaurus module that can help in various general cases. The following chapter of the Search Dev Guide explains its features, and ends with a small example of how to apply it:
http://docs.marklogic.com/guide/search-dev/thesaurus#chapter
HTH!
Assume your term to search is "John & Jane"
In order to Search above word ,you can use following line
let $inputSearchDetails ="John & Jane"
let $InputXML := xdmp:unquote($inputSearchDetails, "", ("format-xml", "repair-full"))
Is there away to use find() to search for barcode that ends with the last digits 365478 I dont care what is in front
I thought something like this.
$db.find(array('barcode'=>*365478))
but that does not seem to work. am I missing something?
A regular expression of /365478$/ would provide the right filter.
Regex will work but you won't be able to use an index for the query. I'd suggest storing the last 6 digits in a separate field, put an index on it and use that.
This Works Well /$365478/ of course you can use regex but keeping it simple pays off.
You can use a regex for this but you have to store the barcode reversed if you want to be able to use an index (only rooted regex can use an index, see http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-RegularExpressions)
If that is unpractical for you you will have to save the last X digits seperately and do exact matching. If your database is going to be relatively small you might be able to get away without indexing.
the regex should be like this: 365478$
using PHP for eg.
$regex = new MongoDB\BSON\Regex ( '365478$');
I'm using solr to search for articles. I created 2 test "body" sentences which have the common word "tall", but there is no match.
The Query---> Body:"There are tall people outside" AND !UserId:2
Does not match a post with:
Body: the KU tower is really tall
UserId:3
Is this just simply a very low matching score? or is there something else going on here? In the case of a low matching score should it really be that low? The body sentences are very short and share a common word, I would have expected some match.
EDIT: I think the matching isn't happening as a result of having the !UserId: 2 condition. If I try to match body sentences without that, its very liberal. Can anyone explain this? and perhaps how to best structure a query to avoid this type of specific behavior?
Thanks!
I have seen some funky behavior with the ! operator with Solr. I would suggest you use the - (negative indicator) instead as shown in the SolrQuerySyntax Wiki Page. Try changing your original query to Body:"There are tall people outside" AND -UserId:2 to see if that works as you are expecting.
For those who come after me, I found a solution however not necessarily an explanation for its behavior.
The Solr query:
(PostBody:There are tall people outside) AND !UserId:2
worked as I desired above. Note that if the quotes are added around the body, it does not match. I believe Solr attempts to match such a query as a single string rather than individual words.
I am using the Lucene search engine but it only seems to find matches that occur at the beginning of terms.
For example:
Searching for "one" would match "onematch" or "one day a time" but not "loneranger".
The Lucene doc says it doesnt support wildcards at the front of a search string so I am not sure whether Lucene even searches inter-term matches or only can match documents that start with the search term.
Is this a problem with how I have created my index, how I am building my search query or just a limitation of Lucene?
Found some info in another post here on Stack Overflow [LUCENE.NET] Leading wildcard throws an error"
You can set the SetAllowLeadingWildcardCharacters property on your Query Parser to allow leading wildcards during your search. This will of course have the obvious large performance impact but will allow user to find matches within a search term.
Lucene will find a document if the search term appears anywhere within it, but it doesn't allow you to do wildcard queries where the wildcard is on the front of the search term, because it performs horribly. If that is functionality you care about, you will either have to do some low-level Lucene hacking change a config flag (thanks for the interesting link), find a third-party library that has already done that hacking, or find a different search implementation (for small enough datasets, the built in search from a lot of RDBMS engines is sufficient).
Your query should be
"Query query = new WildcardQuery(new Term("contents", "*one *"));"
where contents is the field name in which you are searching.
"one" should be enclosed with asterisk mark. I have given space in the query after *one but there should not be any space. without space the * is not displaying that is why I added star.