LDAP Queries using Pattern Matching - search

Is it possible with LDAP queries to filter on patterns similar to Regular Expressions? For example, to find all computer objects with names that match "ABC-nnnnnn" where "n" is a numeric digit and only those with 6-digits?

To my knowledge LDAP only supports wildcards, like:
(CN=ABC-*)
That'll grab anything that starts with ABC-. You would probably have to further filter the results using something else like PowerShell, or programming language of your choice.

There's no capability to do this aside from the wildcard suggestion.

LDAP search filters do not support the concept of pattern matching, but they do support the
concept of ordering. LDAP clients should consult the schema programmatically to determine which ordering rules
are used for attributes, and if an appropriate ordering rule is supported, a combination of
greaterOrEqual and lessOrEqual filter components in a compound filter might work. Whether or
not the results are as expected depends completely on the ordering rules.
For example:
ldapsearch -h hostname -p port \
-b basedn -s scope \
`(&(cn>=abc-000000)(cn<=abc-999999))` attribute_list
As above, whether this returns the expected results depends on the ordering rules. Consult your friendly neightborhood LDAP admin for help with ordering rules and schema.
see also
LDAP: Protocol search request
LDAP: Mastering Search Filters
LDAP: Search best practices
LDAP: Programming practices

Related

Azure Cognitive Search - how to prevent searching for plural form also returning singular matches

We have an Azure Cognitive Search index that we use for full text searches.
Currently, when the user searches for a plural word (e.g. Buildings), the singular forms are also being matched (building).
We want to restrict this behaviour so that only the plural matches are returned.
I've read through the odata documentation but cannot find any reference to how we could accomplish this either through parameters in the search.ismatch in the filter or in the index config.
Plural and singular forms are likely both matching because the field is configured with the default language analyzer, which performs stemming of terms. If you're looking for an exact match, you can use the 'eq' operator in a filter. If you want a case-insensitive (but otherwise exact) match, you can try normalizers (note that this feature is in preview at the time of this writing.)
If you need matching behaviour that is somewhat more sophisticated than a case-insensitive match, you should look into custom analyzers. They allow you to customize the behaviour of tokenization, as well as selectively use (or not use) stemming and other lexical analysis techniques.
To add onto Bruce's answer,
Custom normalizers support many of the same token and character filters as custom analyzers do. In order to decide which one best fits your needs, consider the following questions:
Will this plural/singular matching behaviour be used in filtering/sorting/faceting operations? If so, pre-configured or custom normalizers will enable you to refine what results are returned by your search filter. You can build your own or choose from the list of pre-configured ones, and it supports more than case insensitivity. See the list of supported char and token filters.
Will you need this plural/singular matching behaviour in full-text search, regardless of whether a filter is used? If so, consider using the custom analyzer Bruce suggested above.
Afaik, please note that normalizers will only affect filtering/sorting/faceting results. Also, normalizers are the only way to perform this "normalization" to filter/sort/facet queries. Setting an analyzer will not affect these types of queries.

Does CloudSpanner support Fuzzy Search or Wild Card Search?

I'm doing research on CloudSpanner as part of a spike for work, and comparing it to BigTable/Elastic Search. My team wanted to find out whether or not CloudSpanner supports either FuzzySearch and/or WildCard query. I could find this neither in articles nor by viewing youtube live demos, and i can't access a demo/free trial either. I know that CloudSpanner utilizes NewSQL but I couldn't find anything on NewSQL supporting those either.
Cloud Spanner doesn't support FuzzySearch directly and as far as I know it does not support WildCard Queries directly.
the most similar things that it supports that can work for you are:
Regular Expressions:
With the function REGEXP_CONTAINS with which you can perform queries with a regular expression that matches what you want. This allows to look for [úuü] looks for all the alternatives of u.
LIKE Operator
The like operator will allow you to match a section of a string. to see it documentation you can check it here
If none of this alternatives work for you then i would suggest to do something like this
Yes, Cloud Spanner supports wild cards for some SQL queries (see documentation).
https://cloud.google.com/spanner/docs/query-syntax

In Azure Search, how do you run a "contains" search with multiple terms in searchtext?

I am using Azure Search in full query mode on top of CosmosDB and I want to run a query for any documents with a field that contains the string "azy do". This should match, for example, a document containing "lazy dog".
Reading the Azure Search documentation, it looks like this is impossible due to the term-based indexes it uses.
Rejected solutions
0 matches since it is looking for whole words:"azy do"
Doesn't work since regexes are not allowed to span multiple terms:/.azy do./
This "works", to the extent that it will match "lazy dog", but this does not respect the ordering of the query and will also match "dog lazy", for example /.azy./ AND /.do./
Is there any way of doing this correctly in Azure Search?
If you cannot achieve that via regular expression in the Lucene Query syntax, then is not possible. You may want to vote for supporting contains here.
It should be /.lazy|dog./
So, split the terms based on whitespaces and add a pipe(|) delimiter which stands for OR.
Shortly, Azure Search is not designed to support this scenario. You might be better off using the CONTAINS function in Cosmos DB or its equivalent, depending on what query language you use.
Azure Search is designed for finding terms or phrases that occur in unstructured content (documents) and returning the most relevant documents. The process of extracting and indexing those searchable terms is customizable and described here: How full text search works in Azure Search.

Elasticsearch custom stemming algorithm

I am in the process of moving an application from dtSearch to elasticsearch, and wanted to keep the same features without changing the end user's process. The main one I'm having trouble with is stemming. We allow the user to specify their own stemming rules in the dtSearch format:
3+ies -> y
3+ing ->
Where the 3 is the number of preceding characters, the ies is the suffix and the y is what to replace it with. Is it possible to specify a custom algorithm to elasticsearch (well... the lucene engine) so that the user wont have to update their stemming rules to conform to a new search service? Or are the two methods mutually exclusive?
For a painful, extremely dirty solution, you can use regular expressions.
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-pattern_replace-tokenfilter.html
Otherwise, you'll have to create your own Elasticsearch analysis plugin (with a token-filter implementation that does what you want, in java).
https://www.elastic.co/guide/en/elasticsearch/plugins/2.4/plugin-authors.html
It'll perform best if you can express your stemming rules as a DFA in memory. There are several java Automata libraries out there you can use. (e.g. http://www.brics.dk/automaton/faq.html)

How do I handle word forms in sphinx search

I have a sphinx server to index a mysql database for a django app. My search is working fine but my content includes medical words/phrases. So, for example, I need a search for "dvt" to also match against "deep venous thrombosis" and even "deep vein thrombosis". I looked through the documentation and see an option for "wordforms" and "morphology". Which of these (or something else) should I use? Also, what will work backwards? ie, a search for "deep venous thrombosis"/"deep vein thrombosis" will match against "dvt".
Also, I would appreciate some advice on how to set these up since I'm new to sphinx in general.
You will need to provide your own list of word/term synonyms to be used in query expansion.
Since Sphinx does not currently support synonym expansion in queries, you'll need to massage the query based on your list of synonyms before submitting it to the search engine.
So, using your example:
User queries for: 'dvt remediation procedures'.
Server receives query and checks each term against its list of synonyms.
Server finds a match and adds 'deep vein thrombosis' to query.
Server submits newly expanded query 'dvt deep vein thrombosis remediation procedures' to search engine.
Finally, if the stemmer built into Sphinx is doing its job, you shouldn't have to support both 'venous' and 'vein' as separate terms since they both should stem to the same term. If this is not the case, you might need to do additional pre-stemming to handle words specific to your corpora (medical terms).

Resources