I am using Azure Search and trying to perform a search against documents:
It seems as though doing this: /indexes/blah/docs?api-version=2015-02-28&search=abc\-1003
returns the same results as this: /indexes/blah/docs?api-version=2015-02-28&search=abc-1003
Shouldn't the first one return different results than the second due to the escaping backwards slash? From what I understand the backwards slash should allow for an exact search on the whole string of "abc-1003" instead of doing a "not" operator.
(more info here: https://msdn.microsoft.com/en-us/library/azure/dn798920.aspx)
The only way I can get it to work is by doing this (note the double quotes): /indexes/blah/docs?api-version=2015-02-28&search="abc-1003"
I would rather not do that because that would mean making the user enter in the quotes, which they will not know how to do.
Am I expecting something I shouldn't or is it possibly a bug with Azure Search?
First, a dash not prefaced by a whitespace acts like a dash, not a negation operator.
As per the MSDN docs for simple query syntax
- Only needs to be escaped if it's the first character after whitespace, not if it's in the middle of a term. For example, "wi-fi" is a single term
Second, unless you are using a custom analyzer for your index, the dash will be treated by the analyzer almost like white-space and will break abc-1003 into two tokens, abc and 1003.
Then when you put it in quotes"abc-1003" it will be treated as a search for the phrase abc 1003, thus returning what you expect.
If you want to exact match on abc-1003 consider using a filter instead. It is faster and can matching things like guids or text with dashes
The documentation says that a hyphen "-" is treated as a special character that must be escaped.
In reality a hyphen is treated as a split of the token and words on both sides are searched, as Sean Saleh pointed out.
After a small investigation, I found that you do not need a custom analyzer, built-in whitespace would do.
Here is how you can use it:
{
"name": "example-index-name",
"fields": [
{
"name": "name",
"type": "Edm.String",
"analyzer": "whitespace",
...
},
],
...
}
You use this endpoint to update your index:
https://{service-name}.search.windows.net/indexes/{index-name}?api-version=2017-11-11&allowIndexDowntime=true
Do not forget to include api-key to the request header.
You can also test this and other analyzers through the analyzer test endpoint:
{
"text": "Text to analyze",
"analyzer": "whitespace"
}
Adding to Sean's answer, a custom analysis configuration with keyword tokenizer and a lowercase tokenfilter will address the issue. It appears that you are using the default standard analyzer which breaks words with special characters during lexical analysis at indexing. At query time, this lexical analysis applies to regular queries, not wildcard search queries. As a result, with your example, you have and <1003> in the search index and the wildcard search query that wasn't tokenized the same way and looks for terms that start with abc-1003 doesn't find it because neither terms in the index starts with abc-1003. Hope this makes sense. Please let me know if you have any additional questions.
Nate
Related
I am working on a wordle bot and I am trying to match words using regex. I am stuck at a problem where I need to look for specific permutations of a given word.
For example, if the word is "steal" these are all the permutations:
'tesla', 'stale', 'steal', 'taels', 'leats', 'setal', 'tales', 'slate', 'teals', 'stela', 'least', 'salet'.
I had some trouble creating a regex for this, but eventually stumbled on positive lookaheads which solved the issue. regex -
'(?=.*[s])(?=.*[l])(?=.*[a])(?=.*[t])(?=.*[e])'
But, if we are looking for specific permutations, how do we go about it?
For example words that look like 's[lt]a[lt]e'. The matching words are 'steal', 'stale', 'state'. But I want to limit the count of l and t in the matched word, which means the output should be 'steal' & 'stale'. 1 obvious solution is this regex r'slate|stale', but this is not a general solution. I am trying to arrive at a general solution for any scenario and the use of positive lookahead above seemed like a starting point. But I am unable to arrive at a solution.
Do we combine positive lookaheads with normal regex?
s(?=.*[lt])a(?=.*[lt])e (Did not work)
Or do we write nested lookaheads or something?
A few more regex that did not work -
s(?=.*[lt]a[tl]e)
s(?=.*[lt])(?=.*[a])(?=.*[lt])(?=.*[e])
I tried to look through the available posts on SO, but could not find anything that would help me understand this. Any help is appreciated.
You could append the regex which matches the permutations of interest to your existing regex. In your sample case, you would use:
(?=.*s)(?=.*l)(?=.*a)(?=.*t)(?=.*e)s[lt]a[lt]e
This will match only stale and slate; it won't match state because it fails the lookahead that requires an l in the word.
Note that you don't need the (?=.*s)(?=.*a)(?=.*e) in the above regex as they are required by the part that matches the permutations of interest. I've left them in to keep that part of the regex generic and not dependent on what follows it.
Demo on regex101
Note that to allow for duplicated characters you might want to change your lookaheads to something in this form:
(?=(?:[^s]*s){1}[^s]*)
You would change the quantifier on the group to match the number of occurrences of that character which are required.
I tried to make a matcher which could detect words like
'all-purpose'
I was trying to make a pattern like
pattern=[{'POS':'NOUN'}, {'ORTH':'-'},{'POS':'NOUN'}]
However, I realized that it only find the matches like
'all - purpose' with white space between tokens instead of 'all-purpose'.
How could I make a matcher like this?
It has to be a generalized pattern like noun-noun instead of
specific words like 'Barak Obama' as in the example in spacy documentation
Best,
What exactly are you trying to match? Using en_core_web_sm, "all-purpose" is three tokens and all has the ADV POS tag for me. So that might be the issue with your match pattern. If you just want hyphenated words this might be a better match:
pattern = [{'IS_ALPHA': True}, {'ORTH':'-'}, {'IS_ALPHA': True}]
More generally, you are correct that your pattern will only match three tokens, though that doesn't require white space - it depends on how the tokenizer works. For example, that's has no spaces but is two tokens.
If you are finding hyphenated words that occur as one token and want to match them, you can use regular expressions in Matcher rules. Here's an example ofhow that would work from the docs:
pattern = [{"TEXT": {"REGEX": "deff?in[ia]tely"}}]
In your case it could just look like this:
pattern = [{"TEXT": {"REGEX": "-"}}]
I want to find all records containing the pattern "170629-2" in Azure Search explorer, did try with
query string : customOfferId eq "170629-2*"
which only give one result back, which is the exactly match of "170629-2", but i do not get the records which have the patterns of "170629-20", "170629-21" or "170629-201".
Two things.
1-You can't use standard analyzer as it will break your "words" in two parts:
e.g. 170629-20 will be breaked as 170629 and another entry as 20.
2-You can use regex and specify the pattern you want:
170629-2+.*
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_regex
PS: use &queryType=full to allow regex
I've been trying to create a filter matching the end of the whole field text.
For example, taking a text field with the text: the brown fox jumped over the lazy dog
I would like it to match with a query that searches for fields with values ending with g. Something like:
{
"search":"*",
"queryType":"full",
"searchMode": "any",
...
"filter":"search.ismatchscoring('/g$/','MyField')"
}
The result is only records where MyField contains values with words composed by a the single g character anywhere on the string.
Using the filter directly also produces no results:
{
"search":"*",
"queryType":"full",
"searchMode": "any",
...
"filter":"MyField eq '*g'"
}
As far as I can see, the tokenization will always be the base for the search and filter, which means that on the above query, $ is completely ignored and matches will be by word, not by field.
Probably I could use the keyword_v2 analyzer on this field but then I would lose the tokenizarion that I use when searching normally.
One possible solution could be defining a second field in your index, with the same value as ‘MyField’, but with a different analyzer (e.g. keyword_v2). That way you may still search over the original field while filtering over the other.
Regardless, you might have simplified the filter for the sake of the example, but otherwise it seems redundant to use search.ismatchscoring() when not combining with another filter clause via ‘or’ – one can use the search parameter directly.
Moreover, regex might not be working because the default queryType for search.ismatchscoring() is simple, not full - please see docs here
We're using tag boosting on a scoring profile in Azure Search to boost results based on the number of intersecting strings.
ie.
Doc1 has { id: 1, name: "thing", stuff:["1 stuff","2 stuff","3,4,5 stuff"] }
Doc2 has { id: 2, name: "thing2", stuff:["1 stuff","2 stuff"] }
Searching with the scoring parameter as stuffParam:1 stuff,2 stuff is fine.
But it falls apart when trying to boost for stuffParam:1 stuff,3,4,5 stuff as the commas separation in the querystring break it.
Is there a way to escape commas, or is punctuation ignored, or is this not possible?
This was due to a bug in Azure Search that has now been fixed. Instead of the old syntax with the colon separator, you can now use a new syntax with a dash separator and use quotes to escape any tags that contain commas. For example, this:
stuffParam:1 stuff,3,4,5 stuff
Can now be written like this:
stuffParam-1 stuff,'3,4,5 stuff'
If you have tags that contain quotes, you can double them up to escape them. For example:
stuffParam-'Hello, O''Brien'
Will match the tag "Hello, O'Brien".
If you use version 1.1.2 or newer of the Azure Search .NET SDK, the ScoringParameter class now does all this for you.