The trouble with searching for a single-word sentence - search

I have a text field for tags. For example some entities:
{"tags": "apple. fruits. eat."}
{"tags": "green apple."}
{"tags": "banana. apple."}
I want to select entities with tag apple, not green apple or smth apple smth. Different variants lead to the one point: select a sentence with existing expression and it doesn't matter how this sentence looks like. But in this case it's matter.
How can I do it by using Lucene syntax or Azure Search tools? Or (in general) how can I search for a completely same sentence?

I presume that the "." is a deliminator for the different tags. There may be a way to express this in lucene, but you may need to add some custom analyzers to preserve the "."'s in tokenization.
A better strategy in this case would be use use a field of type Collection(Edm.String). This will allow you to better preserve structure the phrases for the tags, and you can use a filter to select the specific value of "apple". Collection(Edm.String) also allows you to enable faceting of the tags which is useful.

Related

Can I "Exact Search" for targeted field(s) and Search across other fields as well?

The "Exact Search" fields use their own custom analyzer, while the Search fields use a language specific custom analyzer (built on MicrosoftStemmingTokenizerLanguage.French, for example).
I can't seem to use $filter for the "Exact Search" field, because $filter considers the entire field, and doesn't use the custom analyzer of the field.
Azure Search docs indicate this about field scoped queries.
"You can specify a fieldname:searchterm construction to define a fielded query operation, where the field is a single word, and the search term is also a single word"
There is no clear way on how to do this in Azure. We know we can use the searchFields parameter in our Azure Search Rest API calls to target specific fields, but how do we search ALL fields for 1 term while specifically searching some fields for specific terms, basically doing an “AND” between them?
This is possible using the Lucene query syntax.
Construct your query like this, where "chair" is the term to search for in all fields, and field1 and field2 are fields where you want to search for specific terms:
chair AND field1:something AND field2:else
In terms of how you use this in the REST API, just embed it in your search parameter. If you're using GET it looks like this (imagine it URL-encoded):
search=chair AND field1:something AND field2:else
If you're using POST, it goes in the request body and looks like this:
{
"search": "chair AND field1:something AND field2:else",
... (other parameters)
}

Solr, managing entities

I have the following situation when using Solr. My document contains "entities" for example "peanut butter". I have a list of such entities. These are items that go together and are not to be treated as two individual words. During indexing, I want solr to realize this and treat "peanut butter" as an entity. For example if someone searches for
"peanut"
then documents that have the word peanut should rank higher than documents that have the word "peanut butter". However if someone searches for
"peanut butter"
then the document that has peanut butter should show up higher than ones that have just peanut. Is there a config setting somewhere which can be modified such that the entity list can be specified in a file and Solr would do the needful?
Configure that field to use a StrField type, instead of a TextField. TextField is designed to handle tokenization and full-text search on textual content. StrField treats it's contents as a keyword, and so does not tokenize.

Does Solr have an equivalent of "strict order operator" that Sphinx has?

I'm choosing between Solr and Sphinx.
Sphinx doc page
has a section called "5.3. Extended query syntax" which describes the following search parameters (among others) :
strict order operator (example: aaa << bbb << ccc) -
NEAR, generalized proximity operator (example: hello NEAR/3 world NEAR/4 "my test") - search according to distance between words
SENTENCE/PARAGRAPH (example: "Bill Gates" PARAGRAPH "Steve Jobs") - search inside a sentence/paragraph
Does Solr have any similar functionality?
strict order operator: you would need to use SpanQueries for this, look at enter link description here for an explanation of SpanQuery, and in order to use them from Solr, you could try SurroundQParser or else see this other question
NEAR, generalized proximity operator: yes, this is supported, see Proximity search
SENTENCE/PARAGRAPH: not directly. You could try several approaches:
Map somehow those to documents (and maybe use Join functionality in 4.0 to link Paragraph documents to parent documents etc)
Try to insert information about paragraphs with special tokens/gaps, see this

Modify haystack query syntax?

Is it possible to modify or extend how haystack understands a query?
For example, I'm looking at integrating haystack with an OSQA-based site to get SO-style search -- a search where regular keywords search question/answer/comment text, but where syntax like "[tag]" is understood to be restricted to the question's tags field. At some point we might want to add other goodies like "user:eternicode" and "score:0", but for now keywords and tags are the must-haves.
Unfortunately, it's not as simple as regexing the tags out of the query string and using that to filter on the tags field, because we want all the complexity of AND, OR, NOT, and arbitrary grouping to apply.
Is this possible with haystack? Better yet, has anyone done it before?
It seems there is no way to customize how Haystack's auto_query works, so what we ended up doing was preparsing the search query to extract tag and other custom syntaxes, perform the auto_query with the leftovers, and then apply the custom syntaxes as extra filters on the auto_query results.
In order to do this, though, we had to simplify our requirements and drop the OR requirement, so all terms are only ANDed now -- that simplified a lot of things (for example, grouping is now unnecessary).

One word phrase search to avoid stemming in Solr

I have stemming enabled in my Solr instance, I had assumed that in order to perform an exact word search without disabling stemming, it would be as simple as putting the word into quotes. This however does not appear to be the case?
Is there a simple way to achieve this?
There is a simple way, if what you're referring to is the "slop" (required similarity) as part of a fuzzy search (see the Lucene Query Syntax here).
For example, if I perform this search:
q=field_name:determine
I see results that contain "determine", "determining", "determined", etc.. If I then modify the query like so:
q=field_name:determine~1
I only see results that contain the word "determine". This is because I'm specifying a required similarity of 1, which means "exact match". I can specify this value anywhere from 0 to 1.
Another thing you can do is index the same text without stemming in one field, and with stemming in another. Boost the non-stemmed field & that should prefer exact versions of words to stemmed versions. Of course you could also write your own query parser that directs quoted phrases to the non-stemmed field only.

Resources