Escaping special characters in wildcard search using Lucene.NET

Escaping special characters in wildcard search using Lucene.NET - search

I have looked for a long time for escaping special characters like #, {, , [, ], ... while in wildcard search in Lucene.NET 3.0.3.0, but I can´t find any possible solutions.
I have index my documents using StandardAnalyzer. The field "title" has the attributs Field.Store.YES and Field.Index.ANALYZED.
While searching I called MultiFieldQueryParser.Escape for my searchterm. The escaped query looks right but parsing the term remove the escaping characters. So my search can not find any results.
searchterm: Klammer[affe]
escaped searchterm: *Klammer\\[affe\\]*
after parsing: title:*Klammer[affe]*
So, how can I escape special characters in wildcard-Search?

You could also use the Lucene implementation QueryParser.Escape(searchQuery).

From the lucene documentation
Escaping Special Characters
Lucene supports escaping special characters that are part of the query
syntax. The current list special characters are
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
To escape these character use the \ before the character. For example
to search for (1+1):2 use the query:
\(1\+1\)\:2
So your query should be *Klammer\[affe\]*
But the standard analyzer deletes those characters so you need to index the original content differently.
See this related questions answer https://stackoverflow.com/a/17628127/956658. Another question with some info on changing the analyzing method How to perform a lucene query containing special character using QueryParser?

Related

Cant create correct regex

I have an html text. With my regex:
r'(http[\S]?://[\S]+/favicon\.ico[\S^,]+)"'
and with re.findall(), I get this result from it:
['https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196', 'https://stackoverflow.com/favicon.ico,https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196']
But i dont want this second result in list, i understand that it has coma inside, but i have no idea how to exclude coma from my regex. I use re.findall() in order to find necessery link in any place in html text because i dont know where it could be.

Note that [\S]+ contains redundant character class, it is the same as \S+. In http[\S]?://, [\S]? is most likely a human error, as [\S]? matches any optional non-whitespace char. I doubt you implied to match http§:// protocol. Just use s to match s, or S to match S.
You can use
https?://[^\s",]*/favicon\.ico[^",]+
See the regex demo.
Details:
https?:// - http:// or https://
[^\s",]* - zero or more chars other than whitespace, " and , chars
/favicon\.ico - a fixed /favicon.ico string
[^",]+ - one or more chars other than a " and , chars.

Azure Search and Dashes

I am using Azure Search and trying to perform a search against documents:
It seems as though doing this: /indexes/blah/docs?api-version=2015-02-28&search=abc\-1003
returns the same results as this: /indexes/blah/docs?api-version=2015-02-28&search=abc-1003
Shouldn't the first one return different results than the second due to the escaping backwards slash? From what I understand the backwards slash should allow for an exact search on the whole string of "abc-1003" instead of doing a "not" operator.
(more info here: https://msdn.microsoft.com/en-us/library/azure/dn798920.aspx)
The only way I can get it to work is by doing this (note the double quotes): /indexes/blah/docs?api-version=2015-02-28&search="abc-1003"
I would rather not do that because that would mean making the user enter in the quotes, which they will not know how to do.
Am I expecting something I shouldn't or is it possibly a bug with Azure Search?

First, a dash not prefaced by a whitespace acts like a dash, not a negation operator.
As per the MSDN docs for simple query syntax
- Only needs to be escaped if it's the first character after whitespace, not if it's in the middle of a term. For example, "wi-fi" is a single term
Second, unless you are using a custom analyzer for your index, the dash will be treated by the analyzer almost like white-space and will break abc-1003 into two tokens, abc and 1003.
Then when you put it in quotes"abc-1003" it will be treated as a search for the phrase abc 1003, thus returning what you expect.
If you want to exact match on abc-1003 consider using a filter instead. It is faster and can matching things like guids or text with dashes

The documentation says that a hyphen "-" is treated as a special character that must be escaped.
In reality a hyphen is treated as a split of the token and words on both sides are searched, as Sean Saleh pointed out.
After a small investigation, I found that you do not need a custom analyzer, built-in whitespace would do.
Here is how you can use it:
{
"name": "example-index-name",
"fields": [
{
"name": "name",
"type": "Edm.String",
"analyzer": "whitespace",
...
},
],
...
}
You use this endpoint to update your index:
https://{service-name}.search.windows.net/indexes/{index-name}?api-version=2017-11-11&allowIndexDowntime=true
Do not forget to include api-key to the request header.
You can also test this and other analyzers through the analyzer test endpoint:
{
"text": "Text to analyze",
"analyzer": "whitespace"
}

Adding to Sean's answer, a custom analysis configuration with keyword tokenizer and a lowercase tokenfilter will address the issue. It appears that you are using the default standard analyzer which breaks words with special characters during lexical analysis at indexing. At query time, this lexical analysis applies to regular queries, not wildcard search queries. As a result, with your example, you have and <1003> in the search index and the wildcard search query that wasn't tokenized the same way and looks for terms that start with abc-1003 doesn't find it because neither terms in the index starts with abc-1003. Hope this makes sense. Please let me know if you have any additional questions.
Nate

Tag parameters with commas in Azure Search tag boosting

We're using tag boosting on a scoring profile in Azure Search to boost results based on the number of intersecting strings.
ie.
Doc1 has { id: 1, name: "thing", stuff:["1 stuff","2 stuff","3,4,5 stuff"] }
Doc2 has { id: 2, name: "thing2", stuff:["1 stuff","2 stuff"] }
Searching with the scoring parameter as stuffParam:1 stuff,2 stuff is fine.
But it falls apart when trying to boost for stuffParam:1 stuff,3,4,5 stuff as the commas separation in the querystring break it.
Is there a way to escape commas, or is punctuation ignored, or is this not possible?

This was due to a bug in Azure Search that has now been fixed. Instead of the old syntax with the colon separator, you can now use a new syntax with a dash separator and use quotes to escape any tags that contain commas. For example, this:
stuffParam:1 stuff,3,4,5 stuff
Can now be written like this:
stuffParam-1 stuff,'3,4,5 stuff'
If you have tags that contain quotes, you can double them up to escape them. For example:
stuffParam-'Hello, O''Brien'
Will match the tag "Hello, O'Brien".
If you use version 1.1.2 or newer of the Azure Search .NET SDK, the ScoringParameter class now does all this for you.

Allowing only usernames using "reasonable" characters

A username for a website can contain the space character, and yet it cannot be composed only of space characters. It can contain some symbols (like underscore and dash), but starting with certain symbols would look weird. Non-latin letters should be allowed, preferably for all languages, but tab and newline characters shouldn't. And definitely no Zalgo.
The rules composing what should and shouldn't be allowed in a reasonable naming system are complicated, however they are virtually the same for every website. Reimplementing them is probably a bad idea. Where can I find an implementation? I'm using PHP.

You should validate the username entered by the new user against a regular expression that run a match against the allowed character set.
Example: The following allows only english alphanumeric characters and - and _.
function isNewUsernameValid ($name, $filter = "[^a-zA-Z0-9\-\_\.]"){
return preg_match("~" . $filter . "~iU", $name) ? false : true;
}
if ( !isNewUsernameValid ($name) ){
print "Not a valid name.";
}
For your particular case, you'll have to come up with and test the regular expression.

Lua Pattern for extracting/replacing value in / /

I have a string like hello /world today/
I need to replace /world today/ with /MY NEW STRING/
Reading the manual I have found
newString = string.match("hello /world today/","%b//")
which I can use with gsub to replace, but I wondered is there also an elegant way to return just the text between the /, I know I could just trim it, but I wondered if there was a pattern.

Try something like one of the following:
slashed_text = string.match("hello /world today/", "/([^/]*)/")
slashed_text = string.match("hello /world today/", "/(.-)/")
slashed_text = string.match("hello /world today/", "/(.*)/")
This works because string.match returns any captures from the pattern, or the entire matched text if there are no captures. The key then is to make sure that the pattern has the right amount of greediness, remembering that Lua patterns are not a complete regular expression language.
The first two should match the same texts. In the first, I've expressly required that the pattern match as many non-slashes as possible. The second (thanks lhf) matches the shortest span of any characters at all followed by a slash. The third is greedier, it matches the longest span of characters that can still be followed by a slash.
The %b// in the original question doesn't have any advantages over /.-/ since the the two delimiters are the same character.
Edit: Added a pattern suggested by lhf, and more explanations.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Escaping special characters in wildcard search using Lucene.NET - search

You could also use the Lucene implementation QueryParser.Escape(searchQuery).

Related

Cant create correct regex

Azure Search and Dashes

Tag parameters with commas in Azure Search tag boosting

Allowing only usernames using "reasonable" characters

Lua Pattern for extracting/replacing value in / /

Categories

Resources