Solr not matching. Threshold setting, or something weird? - search

I'm using solr to search for articles. I created 2 test "body" sentences which have the common word "tall", but there is no match.
The Query---> Body:"There are tall people outside" AND !UserId:2
Does not match a post with:
Body: the KU tower is really tall
UserId:3
Is this just simply a very low matching score? or is there something else going on here? In the case of a low matching score should it really be that low? The body sentences are very short and share a common word, I would have expected some match.
EDIT: I think the matching isn't happening as a result of having the !UserId: 2 condition. If I try to match body sentences without that, its very liberal. Can anyone explain this? and perhaps how to best structure a query to avoid this type of specific behavior?
Thanks!

I have seen some funky behavior with the ! operator with Solr. I would suggest you use the - (negative indicator) instead as shown in the SolrQuerySyntax Wiki Page. Try changing your original query to Body:"There are tall people outside" AND -UserId:2 to see if that works as you are expecting.

For those who come after me, I found a solution however not necessarily an explanation for its behavior.
The Solr query:
(PostBody:There are tall people outside) AND !UserId:2
worked as I desired above. Note that if the quotes are added around the body, it does not match. I believe Solr attempts to match such a query as a single string rather than individual words.

Related

Wikipedia wildcard search not working?

I'm trying to do a wildcard search on Wikipedia but the search is not behaving the way the instructions say it should. Here's the advanced search help page:
https://en.wikipedia.org/wiki/Help:Advanced_search
As an example, it says this regarding a Wildcard search:
the query *stan will match Kazakhstan or Afghanistan or Stan Kenton.
However, when I attempt to do that search (or even click on the embedded link to that search), I only get
the page *stan does not exist
and it just lists a bunch of "Stan" entries starting with "Stan Laurel filmography."
Why would this feature not work? Am I missing something?
It does work, however because direct matches for "stan" are scored higher than words with it, Kazakhstan is waaaay down in results. You can try slightly narrowing the results with intitle:*stan however this is still bad. However, a quick check with k*stan shows that it works.
Conclusion: user-written help page has a bad example.

{exp:search:keywords} filtering out insignificant words

With my current project I need to display the exact search phrase entered on the results/no-results pages.
However the {exp:search:keywords} variable seems to have insignificant keywords removed.
“Who am I?” becomes “who am”
I understand why this is the case but for the purposes of this particular website I need the exact phrase.
Does anyone know how I can achieve this ? Please let there be a workaround...
I am using the Simple Search module in 2.7.0
Thanks.
/system/expressionengine/modules/search/mod.search.php:236 (v2.7.0)
//$original_keywords = $this->keywords;
$original_keywords = $_POST['keywords'];

Solr custom wildcard

I am pretty new to Solr and I am looking for a way to port the search features I have for my web application having a regular database to use Solr indexes. My problem so far is I have to customize the wildcards behaviour: for example, "?" should be "0 or 1 characters" not any character as it is now, "+" should mean any "white-space", "#" should be any digit and so on. Any good pointer?
Thanks!
There is no simple answer that I know of, I am afraid.
For 0 or 1 characters - you can replace the original query with an 'OR' query. Eg. mp? in your db search usecase becomes - 'mp OR mp?' in Solr.
White spaces are tokenized by default in text field. So, you can look at using a white space tokenizer as part of your custom 'text' field. There are several examples. text_ws in the sample schema only does whitespace tokenizing. You'd want to readup on tokenizers.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
There is no digit equivalent - you can do term1* OR term2* OR term3* ... etc. You can also use function queries that support numerical functions. http://wiki.apache.org/solr/FunctionQuery
It looks like the best choice in this case is to use regular expressions in the search. More details can be found here: http://1opensourcelover.wordpress.com/2013/09/29/solr-regex-tutorial/
It's not exactly what I was looking for as I will have to build my own solr-query on the back and I have a feeling that regular expressions abuse will create a little bit more overhead on my server. For the test I did it looks pretty fast.
I will leave the question open for a while maybe someone can come up with a better answer.

Marklogic search with ampersand

Suppose I am searching using one of the cts:query API's. I am looking for documents containing the phrase "John and Jane". Some of my documents have "John & Jane"(actually John & Jane) in them. I want them to be returned as well. Also consider reverse situation.
Does Marklogic provide any options to do that?
Queries expressed as cts:query items or XML are easy to rewrite with XQuery typeswitch expressions. The discussion list thread at http://markmail.org/message/6hxmuqnpnfm73j4n has an example of something similar.
Mike gives a good suggestion, but it might be worth to take a step back and look at your problem first. From your comment on Mike's answer I take it that you look for something like thesaurus expansion, but for the 'and' and '&' instead of the other words.
I may be wrong, but to my knowledge MarkLogic doesn't provide features to take care of something like that automatically. Functions like search:search and search:parse are powerfull, but don't go that far. You are up to your own to take a search string like yours, break it into parts manually to wrap it in a cts:query, or use something like search:parse for that, and then pull tricks like that of Mike to walk through your query-tree, and expand any particular search query node you would like to expand in a particular way.
The markmail thread to which Mike points, gives an example of how to walk a query-tree, and manipulate it. A little heavy for this particular case, but there is a thesaurus module that can help in various general cases. The following chapter of the Search Dev Guide explains its features, and ends with a small example of how to apply it:
http://docs.marklogic.com/guide/search-dev/thesaurus#chapter
HTH!
Assume your term to search is "John & Jane"
In order to Search above word ,you can use following line
let $inputSearchDetails ="John & Jane"
let $InputXML := xdmp:unquote($inputSearchDetails, "", ("format-xml", "repair-full"))

Solr behind Drupal returns too many results for specific query

We've got Solr sat behind one of our client's Drupal 7 websites, and while it's working well, it returns too many results for what should be quite specific queries. (It also has relevance/weighting problems; but I'm hoping that solving this problem will remove the - literally - irrelevant results.)
For example, searching for the phrase 'particular phrase in london' should return the node with that as its title, quite high up; I don't even think that any other content should be returned. But I find that it's returning lots of content, purely on the fact that it mentions "London"!
Frivolously, searching for the ridiculous phrase 'piecrusts in london' returns a lot of results too, apparently just because they mention London. No content on the site mentions actual piecrusts.
When I search for 'particular phrase in london', here are the parameters that end up in the catalina.out log on the server (whitespace added for clarity):
{spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1
&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR
&spellcheck.q=particular+phrase+in+london
&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0
&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200
&facet.date=ds_created
&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR
&f.bundle.facet.mincount=1&hl.fl=content,ts_comments
&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,
label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,
tos_name,tm_node,zs_entity
&start=0&facet.sort=count&f.bundle.facet.limit=50&q=special+phrase+in+london
&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR
&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0
&facet.field=im_field_health_topic&facet.field=bundle
&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50}
hits=1998 status=0 QTime=14
Note that these parameters have been built by Drupal's Apache Solr module; I don't believe we've got any particular custom code of our own that's doing anything to it.
This corresponds to the following URL, if entered directly in the browser:
http://example.com:8081/solr/CLIENT/select?spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR&spellcheck.q=particular+phrase+in+London&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200&facet.date=ds_created&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR&f.bundle.facet.mincount=1&hl.fl=content,ts_comments&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,tm_node,zs_entity&start=0&facet.sort=count&f.bundle.facet.limit=50&q=particular+phrase+in+London&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0&facet.field=im_field_health_topic&facet.field=bundle&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50
This URL returns nearly 2000 results - that's most of the content on the site! I've experimented with removing each query parameter at a time, and the only one to make any difference seems to be qf and q: if I remove qf, zero results; if I remove q, I get more results back!
I guess there are two questions here:
Is there anything in these parameters that tell Solr "don't worry if 'particular phrase', or 'piecrusts' appears: just collate the results for 'london'" and then order by relevancy? I would add that I think 'in' is mentioned in the stopwords file, so we can probably ignore the effect of that (?)
Or is this something in the (standard Drupal) schema that I need to change?
I appreciate that sometimes search is better for the visitor if it's inclusive; Google does return results even if it doesn't find perfect matches. But, stopwords and stemming aside, the client does require that searches return only results where all words appear in the content.
As mentioned in the post at http://drupal.org/node/1783454, the Apache Solr Search Integration module makes use of the mm param, which is more or less configured to effect rankings by how closely the keywords are in the dataset. Looking through the docs there are other ways you can use the parameter to effect rankings as well. Therefore the results produced by Apache Solr Search Integration are weighted more closely to the AND operator even though it will return more results as you add more keywords. The benefit of this param is that in cases where the user enters keywords that are too restrictive, results will still be returned. Displaying no results is a really quick way to guide people away from your site.
How are you displaying the search ?
Maybe you could solr views to limit the search range ?
http://drupal.org/project/apachesolr_views
thanks
Nick

Resources