Azure Search - regex search - azure

I am trying to configure Azure Search to find some strings that have special characters, for example
ABC*DEF
When I look for a the full term using "ABC*DEF", it works perfectly.
The problem comes if I want to use a regex term:
When I use a partial term, like /(.*)ABC(.*)/, the result has no problem
When I use a partial term, like /(.*)DEF(.*)/, the result has no problem
But when I try to look for something like /(.*)C\*D(.*)/, the result is empty.
I am using a standard analyzer. I tried also the keyword analyzer but that way the regex search doesn't work at all.
Any suggestions?

You won't be able to create a regex expression that matches ABC*DEF using the standard analyzer.
If you run "ABC\*DEF" through the analyzer api using "standard" analyzer, you will see that ABC*DEF gets divided into 2 tokens at indexing time -> "ABC" and "DEF". Regex expression are not analyzed, however, they need to match a token that exist in the index.
Since ABC\*DEF does not exist in the index (only "ABC" and "DEF" exist), you won't be able to find it using the expression you are searching for.
Using the "keyword" analyzer will keep the whole field as a single token, so if the field "only" contained the expression ABC\*DEF, then the regex expression would work on it, however, if ABC\*DEF is part of a larger paragraph of text, then that's probably not what you want to use.
Your best bet is to create a custom analyzer that tokenizes your text in the way that preserves the special characters that are relevant to your use case.

If you're searching for special chars, why don't you discard normal chars?
[^\w]

Related

Returning accented as well as normal result set via azure search filters

Does anyone know how to ensure we can return normal result as well as accented result set via the azure search filter. For e.g the below filter query in Azure search returns a name called unicorn when i check for record with name unicorn.
var result= searchServiceClient.Documents.SearchAsync<myDto>("*",new SearchParameters
{
SearchFields = new List<string> {"Name"},
Filter = "Name eq 'unicorn'"
});
This is all good but what i want is i want to write a filter such that it returns record named unicorn as well as record named únicorn (please note the first accented character) provided that both record exist.
This can be achieved when searching for such name via the search query using language or Standard ASCII folding search analyzer as mentioned in this link. What i am struggling to find out is how can we implement the same with azure filters?
Please let me know if anyone has got any solutions around this.
Filters are applied on the non-analyzed representation of the data, so I don’t think there’s any way to do any type of linguistic analysis on filters. One way to work around this is to manually create a field which only do lowercasing + asciifloding (no tokenization) and then search lucene queries that look like this:
"normal search query terms" AND customFilterColumn:"filtérValuèWithÄccents"
Basically the document would both need to match the search terms in any field AND also match the filter term in the “customFilterColumn”. This may not be sufficient for your needs, but at least you understand the art of the possible.
Using filters it won't work unless you specify in advance all the possibilities:
for example:
$filter=name eq 'unicorn' or name eq 'únicorn'
You'd better work with a different analyzer that will change accents to it's root form. As another possibility, you can try fuzzy search:
search=unicorn~&highlight=Name

Required number of characters in azure search

I've made an azure search service and it's up and working. I would like for users to be able to search with 3 characters or more.
I have the following texts in different documents:
Paracet 200mg
Paracet 150mg
Kodein/paracetamol SA
When I search for 'par' I get no results. I have to type 5 characters (parac) and I get 1 & 2 as a result. I want this result for 'par' as well. Is this possible? I can't find anything in the documentation on setting the required number of characters for a search.
For the best performance, you could enable the "fast" prefix analyzer in your index, which will break down every token into a list of prefixes at indexing time. Here's some additional information on how to do that : https://azure.microsoft.com/en-us/blog/custom-analyzers-in-azure-search/
This would require you to re-index your data, so if you are creating a brand new index, this is an option.
If re-indexing is not an option, you can instead use the suffix operation '*' in your query. Here's more information on the suffix operator : https://learn.microsoft.com/en-us/rest/api/searchservice/Simple-query-syntax-in-Azure-Search?redirectedfrom=MSDN
I suspect searching using the suffix operator (or re-indexing while using fast prefix analyzer) will also work with the 3rd document you listed (Kodein/paracetamol SA). If it still does not work, it might be due to you using a tokenizer that does not split on the '/' character. The default analyzer should correctly split on '/', but if you are using a custom analyzer it's possible the whole "Kodein/paracetamol" expression get tokenized into a single term, which would explain why a search for parace* does not return the document, since the prefix of the document is "kode…".

Solr exact search with a hyphen

I am trying to search for a term in Solr in the Title that contains only the string 1604-04. But the results come back with anything that contains 1604 or 04. What would the syntax be to force solr to search on the exact string of 1604-04?
You can also use Classic Tokenizer.The Classic Tokenizer preserves the same behavior as the Standard Tokenizer with the following exceptions:-
Words are split at hyphens, unless there is a number in the word, in which case the token is not split and
the numbers and hyphen(s) are preserved.
This means if someone searches for 1604-04 then this Tokenizer won't break search string into two tokens.
If you want exact matches only, use a string field or a text field with a KeywordTokenizer as the tokenizer. These will keep your tokens intact as one single entry, and won't break it up into multiple tokens.
The difference is that if you use a Textfield with a KeywordTokenizer, you can still apply other filters, such as a LowercaseFilter, while a string field will store anything verbatim without any further processing possible.
Your analyzer is splitting "1604-04" into two terms, "1604" and "04". You've received answer on how to change your analysis to stop doing that.
Changing your analysis my not be the best solution (can't be entirely sure based on what you've written). Using a phrase query would be the usual way to do this. You can use a phrase query by wrapping it in quotes:
field:"1604-04"
This will still analyze and split it into two terms, but it will look for those terms in sequence. So, that query would match "1604-04" and "1604 04", but not "1604 some other stuff 04".

ElasticSearch incorrectly indexing and querying on non-alphanumeric characters

My ElasticSearch index is not correctly indexing and querying non-alphanumeric characters. Specifically, dots and dashes are causing problems.
If I index a document with the name "O.K. Corral," it should match queries for "OK Corral". Similarly, if I index "Whiskey A Go-Go," I'd like it to match "Whiskey A GoGo" and "Whiskey A Go Go".
Right now, only queries with the correct dots and dashes will return these documents.
I'm hoping the solution will also solve any potential problems with other non-alphanumeric characters, like commas and apostrophes.
It sounds like a job for ElasticSearch token filters, but I haven't been able to find one that does what I'm looking for. Also, I would like to do this within ElasticSearch -- I don't want to write custom string manipulations to normalize data before it gets to my ES index.
Thanks for your help!
You might want to have a look at the Word Delimiter Token Filter. It will at least do what you want with "Whiskey A GoGo" and "Whiskey A Go-Go,". You can check its behaviour in advance using the analyze api.

One word phrase search to avoid stemming in Solr

I have stemming enabled in my Solr instance, I had assumed that in order to perform an exact word search without disabling stemming, it would be as simple as putting the word into quotes. This however does not appear to be the case?
Is there a simple way to achieve this?
There is a simple way, if what you're referring to is the "slop" (required similarity) as part of a fuzzy search (see the Lucene Query Syntax here).
For example, if I perform this search:
q=field_name:determine
I see results that contain "determine", "determining", "determined", etc.. If I then modify the query like so:
q=field_name:determine~1
I only see results that contain the word "determine". This is because I'm specifying a required similarity of 1, which means "exact match". I can specify this value anywhere from 0 to 1.
Another thing you can do is index the same text without stemming in one field, and with stemming in another. Boost the non-stemmed field & that should prefer exact versions of words to stemmed versions. Of course you could also write your own query parser that directs quoted phrases to the non-stemmed field only.

Resources