Azure Search returning inconsistent results - azure

I have a fairly simple index where all 10 or so fields are searchable strings and my searchMode is "all".
For sake of simplicity let's say I issue the following search:
-(x|y|z)
And I get all documents that do not have x, y or z in them.
Let's say I issue the following search:
(i+j)
And I get all docs that contain the terms i and j.
And lets say there is a decent overlap between the docs that are returned by the two searches.
I would have thought that in "all" searchMode if I issue the following:
(i+j) -(x|y|z)
I would receive the subset of i and j that do not contain x, y or z. In other words the results of the combined query would not contain any entries from the results of the individual query -(x|y|z).
But that's not the case.
Either I am misunderstanding the functionality or I am receiving wrong results.
Can someone help explain this to me?
Thanks

Azure Search should give consistent answers for this, if not let us know.
In this case it was an issue with escaping "+" in URLs (see comments). Search text in the URL query string needs to be escaped (e.g. + should show up as %2B, but it's best to use a library function to escape all the input search text instead of special-casing any particular character; there's functions for this in most environments and they know which characters need escaping).

Related

How can I easily get search context around search term with Typesense?

I currently use Typesense to search in an HTML database. When I search for a term, I would like to retrieve N characters before and N characters after the term found in search.
For example, I search for "query" and this is the sentence that matches:
Let's repeat the query we made earlier with a group_by parameter
I would like to easy retrieve a fixed number of letters (or words) before and after the term to show it in a presumably small area where the search results is retrieved, without breaking any words.
For this particular example, I would be showing:
..repeat the query we made earlier..
Is there a feature like this in Typesense?
I have checked Typesense's documents, without any luck.
The feature you're referring to is called snippets/highlights and it's enabled by default. You can control how many words are returned on either side of the matched text using the highlight_affix_num_tokens search parameter, documented under the table here: https://typesense.org/docs/0.23.1/api/search.html#results-parameters
highlight_affix_num_tokens
The number of tokens that should surround the highlighted text on each side. This controls the length of the snippet.

Can someone help me understand Solr search behaviour in this case?

Query is this :- (Profisee)
Indexed Field has the exact same token as in the above input query. But Solr search is giving zero results.
If Query is this :- (Profisee
Then I am able to find the document in the result.
P.S: I am able to get the document result for (Pro, (Profi, (Profise etc queries also.
Here are the attached images.
Exact Query No Result
Inexact Query Got Result
Here is my schema.xml definition for the fieldtype
First, please include the relevant details in your question next time, as images are hard to search, makes it hard to get the overview of your question and is hard to read for those that doesn't have perfect vision.
For your actual question, the problem is that you have a WhitespaceTokenizer. This will only break words on whitespace, such as . The indexed document contains your term as (foo), which means that only (foo) will match (since the tokenizer only breaks on whitespace, and ( or ) isn't whitespace).
foo (bar) will be indexed as two tokens, foo and (bar). Searching for (bar will match neither.
Use the StandardTokenizer to get the behaviour you want, or use a WordDelimiterGraphFilterFactory to break the word into further tokens.

Solr exact search with a hyphen

I am trying to search for a term in Solr in the Title that contains only the string 1604-04. But the results come back with anything that contains 1604 or 04. What would the syntax be to force solr to search on the exact string of 1604-04?
You can also use Classic Tokenizer.The Classic Tokenizer preserves the same behavior as the Standard Tokenizer with the following exceptions:-
Words are split at hyphens, unless there is a number in the word, in which case the token is not split and
the numbers and hyphen(s) are preserved.
This means if someone searches for 1604-04 then this Tokenizer won't break search string into two tokens.
If you want exact matches only, use a string field or a text field with a KeywordTokenizer as the tokenizer. These will keep your tokens intact as one single entry, and won't break it up into multiple tokens.
The difference is that if you use a Textfield with a KeywordTokenizer, you can still apply other filters, such as a LowercaseFilter, while a string field will store anything verbatim without any further processing possible.
Your analyzer is splitting "1604-04" into two terms, "1604" and "04". You've received answer on how to change your analysis to stop doing that.
Changing your analysis my not be the best solution (can't be entirely sure based on what you've written). Using a phrase query would be the usual way to do this. You can use a phrase query by wrapping it in quotes:
field:"1604-04"
This will still analyze and split it into two terms, but it will look for those terms in sequence. So, that query would match "1604-04" and "1604 04", but not "1604 some other stuff 04".

FTSearch that looks for '-'

does anybody know if there is a possibility to search for '-' using FTSearch?
Set col = db.ftsearch({ [services] = "-"}, 0)
dat requests does not work and instead says:
Notes error: Full text error; see log for more information (
[services] = "-")
Short answer is no.
The full text search treats most symbol characters as a white space. The exception is if the search term itself is wrapped in quotes.
The FT search engine also uses 3-gram for searching. This means that less then 3 characters will not return the results you expect. White spaces would be treated in that search, but only in the context of the found text.
For example: "ce " would find "space " but not "space." or "space" or "spaced".
If you are looking for the field that only contains "-", then a better solution is to create a view with a column containing that field value, and/or filter by that field being that value.
Looks like you are trying to do a full text search in a view? You probably would get better response time and less server impact to use #Formula language if you are working with a view.
I try to keep away on doing full text searches on the entire database. You can use a search on a view collection for faster results. There is no restriction on how many views you can have in a db. There is a cost for everything though. There are so many little tricks that can be used to get better results. Please give us more details on what you are trying to do.

ElasticSearch incorrectly indexing and querying on non-alphanumeric characters

My ElasticSearch index is not correctly indexing and querying non-alphanumeric characters. Specifically, dots and dashes are causing problems.
If I index a document with the name "O.K. Corral," it should match queries for "OK Corral". Similarly, if I index "Whiskey A Go-Go," I'd like it to match "Whiskey A GoGo" and "Whiskey A Go Go".
Right now, only queries with the correct dots and dashes will return these documents.
I'm hoping the solution will also solve any potential problems with other non-alphanumeric characters, like commas and apostrophes.
It sounds like a job for ElasticSearch token filters, but I haven't been able to find one that does what I'm looking for. Also, I would like to do this within ElasticSearch -- I don't want to write custom string manipulations to normalize data before it gets to my ES index.
Thanks for your help!
You might want to have a look at the Word Delimiter Token Filter. It will at least do what you want with "Whiskey A GoGo" and "Whiskey A Go-Go,". You can check its behaviour in advance using the analyze api.

Resources