Lucene/AzureSearch closest matches - search

A client is asking a feature that I'm not really sure how to develop.
The task is this: how to find the closest matches to a query and tell the missing terms in them. It's kind of what Google does when it doesn't find your exact query.
So the site have some services that have tags in them:
- Tag1
- Tag2
- Tag3
Then we want to run a query that returns all the services with all the tags. So I can do something like a grouped AND. but let's say I want to also return the closest 5 services to what the user is looking for. For instance there can be services which only match Tag2 and Tag3.
I guess I could run a grouped OR query but the thing is how can I order them by the matched number of terms found?, is there any way I can create a custom scoring to do that? And how can I get those terms that are not in the results? (Tag1 in the example).
thanks.

When you issue a simple query with all the tags, the documents that matched any of the tags will be returned an the ones that matched more of the tags will be promoted higher in the search result list.
search=Tag1 Tag2 Tag3
If you want to promote the documents that matched all of the tags even higher, you can use term boosting feature of Lucene query language
search=(Tag1 AND Tag2 AND Tag3)^3 OR Tag1 OR Tag2 OR Tag3&queryType=full
You can learn more about the defaul scoring function in Azure Search here: How full text search works in Azure Search - scoring.
To see which terms matched you can use hit highlighting.

Related

wildcard searches on specific elements only

I am looking for a way to do wildcard search only on specific elements when executing a search:search. Specifically, I might have documents that look like the following:
<pdbe:person-envelope xmlns:pdbe="http://schemas.abbvienet.com/people-db/envelope">
<person xmlns="http://schemas.abbvienet.com/people-db/model">
<costcenter>
<code>0000601775</code>
<name>DISC-PLAT INFORM</name>
</costcenter>
<displayName>Tj Tang</displayName>
<upi>10025613</upi>
<firstName>
<preferred>TJ</preferred>
<given>Tze-John</given>
</firstName>
<lastName>
<preferred>Tang</preferred>
<given>Tang</given>
</lastName>
<title>Principal Research Scientist</title>
</person>
<pdbe:raw/>
</pdbe:person-envelope>
When searches happen, I want the search text to be automatically wildcarded, but only for certain elements like displayName, firstName, lastName, but NOT for upi or code. As I understand it, I would have certain wildcard related indexes enabled in the database, but then I would need to have a custom query parser that rewrite the query into multiple cts:element-query and cts:element-value-query statements for each element that I want to wildcard search on, OR'd with the originally parsed search query. Or I can create field constraints, and rewrite the query to use field contraints.
Is there another way to conditionally search using wildcard on some elements but not others, when the user is entering as simple search query?, i.e. partial first and last name, "TJ Tan", but no partial hits when I search "100256".
You are on the right track. Lets take an element (or maybe field) query on "TS Tan"
With cts:tokenize, you can break this up (read about cs:tokenize - it is not just a normal tokenizer).
Then I have "TS" and "Tan"
You can the do things like apply business rules on which word should be wild-carded and which not and build the appropriate cts query (probably individual word queries in an and statement - or a near query - tuning depends on your need).
Now with search phrase tokenized, you can also consider that you may find building your results relies not on a wildcard index, but on a an element word lexicon - where you do your term-expansion with word-matches and those terms are then sent to the query.
We sometimes take that further and combine the query building with xdmp:estimate and make the query less restrictive if we don't get enough results early on.
Where to put this logic?
You mention search:search, so in this case, I would suggest you package this into a custom constraint.

Honoring previous searches in the next search result in solr

I am using solr for searching. I wants to improve my search result quality based on previously searched terms. Suppose, I have two products in my index with names 'Jewelry Crystal'(say it belongs to Group 1) & 'Compound Crystal'(say it belongs to Group 2). Now, if we query for 'Crystal', then both the products will come.
Let say, if I had previously searched for 'Jewelry Ornament', then I searches for 'Crystal', then I expects that only one result ('Jewelry Crystal') should come. There is no point of showing 'Compound Crystal' product to any person looking for jewelry type product.
Is there any way in SOLR to honour this kind of behavior or is there any other method to achieve this.
First of all, there's nothing built-in in Solr to achive this. What you need for this is some kind of user session, which is not supported by Solr, or a client side storage like a cookie or something for the preceding query.
But to achive the upvote you can use a runtime Boost Query.
Assuming you're using the edismax QueryParser, you can add the following to your Solr query:
q=Crystal&boost=bq:(Jewelry Ornament)
See http://wiki.apache.org/solr/ExtendedDisMax#bq_.28Boost_Query.29

Lower case and Upper case in Solr keyword

Im using Solr 3.5.0, and in Schema I have enabled the LowerCaseFilterFactory in all needed fields, bbut When I search for example "shirts" im able to get the results, also when I search for "SHIRTS" i'm able to get expected results, but when I try to search with "shiRTs" its not giving the results. I know I'm missing some thing in Schema.
Please help me on this.
Thanks
Jeyaprakash.
Apply the same analysers and filters at both index and query time, so the the queries you search for match the tokens index.
As in your case -
If you apply the Lower case filter at index time but not at query time :-
Index token will be shirts, However as the search query is not analyzed SHIRTS or even Shirts will not match indexed shirts token.
The same would apply if you are using stemmers, stopwords or other filters.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
Analyzers are components that pre-process input text at index time
and/or at search time. It's important to use the same or similar
analyzers that process text in a compatible manner at index and query
time. For example, if an indexing analyzer lowercases words, then the
query analyzer should do the same to enable finding the indexed words.

Modify haystack query syntax?

Is it possible to modify or extend how haystack understands a query?
For example, I'm looking at integrating haystack with an OSQA-based site to get SO-style search -- a search where regular keywords search question/answer/comment text, but where syntax like "[tag]" is understood to be restricted to the question's tags field. At some point we might want to add other goodies like "user:eternicode" and "score:0", but for now keywords and tags are the must-haves.
Unfortunately, it's not as simple as regexing the tags out of the query string and using that to filter on the tags field, because we want all the complexity of AND, OR, NOT, and arbitrary grouping to apply.
Is this possible with haystack? Better yet, has anyone done it before?
It seems there is no way to customize how Haystack's auto_query works, so what we ended up doing was preparsing the search query to extract tag and other custom syntaxes, perform the auto_query with the leftovers, and then apply the custom syntaxes as extra filters on the auto_query results.
In order to do this, though, we had to simplify our requirements and drop the OR requirement, so all terms are only ANDed now -- that simplified a lot of things (for example, grouping is now unnecessary).

How to find related items by tags in Lucene.NET

My indexed documents have a field containing a pipe-delimited set of ids:
a845497737704e8ab439dd410e7f1328|
0a2d7192f75148cca89b6df58fcf2e54|
204fce58c936434598f7bd7eccf11771
(ignore line breaks)
This field represents a list of tags. The list may contain 0 to n tag Ids.
When users of my site view a particular document, I want to display a list of related documents.
This list of related document must be determined by tags:
Only documents with at least one matching tag should appear in the "related documents" list.
Document with the most matching tags should appear at the top of the "related documents" list.
I was thinking of using a WildcardQuery for this but queries starting with '*' are not allowed.
Any suggestions?
Setting aside for a minute the possible uses of Lucene for this task (which I am not overly familiar with) - consider checking out the LinkDatabase.
Sitecore will, behind the scenes, track all your references to and from items. And since your multiple tags are indeed (I assume) selected from a meta hierarchy of tags represented as Sitecore Items somewhere - the LinkDatabase would be able to tell you all items referencing it.
In some sort of pseudo code mockup, this would then become
for each ID in tags
get all documents referencing this tag
for each document found
if master-list contains document; increase usage-count
else; add document to master list
sort master-list by usage-count descending
Forgive me that I am not more precise, but am unavailable to mock up a fully working example right at this stage.
You can find an article about the LinkDatabase here http://larsnielsen.blogspirit.com/tag/XSLT. Be aware that if you're tagging documents using a TreeListEx field, there is a known flaw in earlier versions of Sitecore. Documented here: http://www.cassidy.dk/blog/sitecore/2008/12/treelistex-not-registering-links-in.html
Your pipe-delimited set of ids should really have been separated into individual fields when the documents were indexed. This way, you could simply do a query for the desired tag, sorting by relevance descending.
You can have the same field multiple times in a document. In this case, you would add multiple "tag" fields at index time by splitting on |. Then, when you search, you just have to search on the "tag" field.
Try this query on the tag field.
+(tag1 OR tag2 OR ... tagN)
where tag1, .. tagN are the tags of a document.
This query will return documents with at least one tag match. The scoring automatically will take care to bring up the documents with highest number of matches as the final score is sum of individual scores.
Also, you need to realizes that if you want to find documents similar to tags of Doc1, you will find Doc1 coming at the top of the search results. So, handle this case accordingly.

Resources