Elasticsearch custom stemming algorithm

Elasticsearch custom stemming algorithm - search

I am in the process of moving an application from dtSearch to elasticsearch, and wanted to keep the same features without changing the end user's process. The main one I'm having trouble with is stemming. We allow the user to specify their own stemming rules in the dtSearch format:
3+ies -> y
3+ing ->
Where the 3 is the number of preceding characters, the ies is the suffix and the y is what to replace it with. Is it possible to specify a custom algorithm to elasticsearch (well... the lucene engine) so that the user wont have to update their stemming rules to conform to a new search service? Or are the two methods mutually exclusive?

For a painful, extremely dirty solution, you can use regular expressions.
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-pattern_replace-tokenfilter.html
Otherwise, you'll have to create your own Elasticsearch analysis plugin (with a token-filter implementation that does what you want, in java).
https://www.elastic.co/guide/en/elasticsearch/plugins/2.4/plugin-authors.html
It'll perform best if you can express your stemming rules as a DFA in memory. There are several java Automata libraries out there you can use. (e.g. http://www.brics.dk/automaton/faq.html)

Related

Azure Cognitive Search - how to prevent searching for plural form also returning singular matches

We have an Azure Cognitive Search index that we use for full text searches.
Currently, when the user searches for a plural word (e.g. Buildings), the singular forms are also being matched (building).
We want to restrict this behaviour so that only the plural matches are returned.
I've read through the odata documentation but cannot find any reference to how we could accomplish this either through parameters in the search.ismatch in the filter or in the index config.

Plural and singular forms are likely both matching because the field is configured with the default language analyzer, which performs stemming of terms. If you're looking for an exact match, you can use the 'eq' operator in a filter. If you want a case-insensitive (but otherwise exact) match, you can try normalizers (note that this feature is in preview at the time of this writing.)
If you need matching behaviour that is somewhat more sophisticated than a case-insensitive match, you should look into custom analyzers. They allow you to customize the behaviour of tokenization, as well as selectively use (or not use) stemming and other lexical analysis techniques.

To add onto Bruce's answer,
Custom normalizers support many of the same token and character filters as custom analyzers do. In order to decide which one best fits your needs, consider the following questions:
Will this plural/singular matching behaviour be used in filtering/sorting/faceting operations? If so, pre-configured or custom normalizers will enable you to refine what results are returned by your search filter. You can build your own or choose from the list of pre-configured ones, and it supports more than case insensitivity. See the list of supported char and token filters.
Will you need this plural/singular matching behaviour in full-text search, regardless of whether a filter is used? If so, consider using the custom analyzer Bruce suggested above.
Afaik, please note that normalizers will only affect filtering/sorting/faceting results. Also, normalizers are the only way to perform this "normalization" to filter/sort/facet queries. Setting an analyzer will not affect these types of queries.

Confused about main difference among query parsers in Solr

I am pretty new to Solr, when I met the concept of query parser such as basicLucene, dismax, edismac etcs, I wonder what is the main difference among them? Is it the way it scoring fields? What should I pay the major attention to when I just want a simple keyword (or boolean logic combination) search (which may involve specifying field)?

I think the standard will suffice for simple boolean manipulation and querying by fields. If you need some special features, do look at others just to see if they perform better for your needs.
About parsers
There are number of differences (allowed syntax, features, error handling), but first let's state what a parser does: it takes query text as submitted and converts it into Lucene Query object, that Solr/Lucene combo can understand and use when searching.
Lucene parser is the default one. It has fairly intuitive syntax, but lacks power. It works with Solr recommended operators (+ and -) as well as with not recommended Boolean ones (AND, OR, NOT).
Dismax parser was aimed for simple consumer-apps: "more Google-like" and "less stringent". It rarely throws errors (compared to standard one), works with + and - to avoid inner query building problem and it's name stems from Maximum Dysjunction:
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.
EDixMax was DisMax Extended - with more power / features added. See links at the bottom for full feature list, but it basically allows to use full Lucene syntax (which neither standard Solr nor DisMax parser didn't).
More? Sure! There's LOTS of parsers, some by Lucene team, some by Solr team, some by others. See third link below for full list and some info.
https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
https://cwiki.apache.org/confluence/display/solr/Other+Parsers

mongodb approximate string matching

I am trying to implement a search engine for my recipes-website using mongo db.
I am trying to display the search suggestions in type-ahead widget box to the users.
I am even trying to support mis-spelled queries(levenshtein distance).
For example: whenever users type 'pza', type-ahead should display 'pizza' as one of the suggestion.
How can I implement such functionality using mongodb?
Please note, the search should be instantaneous, since the search result will be fetched by type-ahead widget. The collections over which I would run search queries have at-most 1 million entries.
I thought of implementing levenshtein distance algorithm, but this would slow down performance, as collection is huge.
I read FTS(Full Text Search) in mongo 2.6 is quite stable now, but my requirement is Approximate match, not FTS. FTS won't return 'pza' for 'pizza'.
Please recommend me the efficient way.
I am using node js mongodb native driver.

The text search feature in MongoDB (as at 2.6) does not have any built-in features for fuzzy/partial string matching. As you've noted, the use case currently focuses on language & stemming support with basic boolean operators and word/phrase matching.
There are several possible approaches to consider for fuzzy matching depending on your requirements and how you want to qualify "efficient" (speed, storage, developer time, infrastructure required, etc):
Implement support for fuzzy/partial matching in your application logic using some of the readily available soundalike and similarity algorithms. Benefits of this approach include not having to add any extra infrastructure and being able to closely tune matching to your requirements.
For some more detailed examples, see: Efficient Techniques for Fuzzy and Partial matching in MongoDB.
Integrate with an external search tool that provides more advanced search features. This adds some complexity to your deployment and is likely overkill just for typeahead, but you may find other search features you would like to incorporate elsewhere in your application (e.g. "like this", word proximity, faceted search, ..).
For example see: How to Perform Fuzzy-Matching with Mongo Connector and Elastic Search. Note: ElasticSearch's fuzzy query is based on Levenshtein distance.
Use an autocomplete library like Twitter's open source typeahead.js, which includes a suggestion engine and query/caching API. Typeahead is actually complementary to any of the other backend approaches, and its (optional) suggestion engine Bloodhound supports prefetching as well as caching data in local storage.

The best case for it would be using elasticsearch fuzzy query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html
It supports levenshtein distance algorithm out of the box and has additional features which can be useful for your requirements i.e.:
- more like this
- powerful facets / aggregations
- autocomplete

Optimal Indexing strategy for Multilingual requirement using solr

We use IBM WCS v7 for one of our e-commerce based requirement, in which Apache solr is embeded for the search based implementation.
As per a new requirement, there will be multiple language support for an website, ex- France version of the site can have support for english, french etc. (en_FR, fr_FR etc.) In order to configure solr with this interface, what should be the optimal indexing strategy using a single solr core ?
I got some ideas 1) using multiple fields in schema.xml for multiple languages, 2) using different solr cores for different languages.
But these approaches don't seem to be the best one fitting to the current requirement, as there will be 18 language support for the e-commerce website. Using different fields for every language will be very complicated, and also using different solr code is not a good approach as we need to apply the configurational change in all the solr cores if ever it happens as per any requirement.
Is there any other approaches, or is there any way I can associate the localeId to the indexed data and process the search result with respect to the detected language ?
Any help on this topic will be highly appreciated.
Thanks and Regards,
Jitendriya Dash

This post has already been answered by original poster and others- just summarizing that as an answer:
Recommended solution is to create one index core per locale/language. This is especially important if either the catalog or content (such as product name, description, keywords) will be different and business prefers to manage it separately for each locale. This gives the added benefit for Solr to perform its stemming and tokenization specific to that locale, if applicable.
I have been part of solutions where this approach was preferred over maintaining multiple fields or documents in the same core for each locale/language. Most number of index cores I have worked with is 6.
One must also remember that index core addition will require updates to supporting processes (Product Information Management system updates to catalog load to workspace management to stage-propagation to reindexing to cache invalidation).

Does a B Tree work well for auto suggest/auto complete web forms?

Auto suggest/complete fields are used all over the web. Google has appeared to master it given that as soon as one types in a search query, suggestions are returned almost instantaneously.
I'm assuming the framework for achieving this involves a fast, in-memory data store on the web tier. We're building a Grails app based around retail products, so a user may search for Can which should suggest things like Canon, Cancun, etc, and wondering if a Java B-tree cached in memory would suffice for quick auto completes returned as JSON over AJAX. Outside of the jQuery AutoComplete field, do any frameworks and/or libraries exist to facilitate the development of this solution?

Autocomplete is a text matching, information retrieval problem. Implementing your own B-tree and writing your own logic to match words to other words is something you could do. But then you would have to implement Porter Stemming, a Vector Space Model, and a String-edit distance calculation.
...or you could use Lucene and its derivatives, which do a lot of this stuff already. If you really care about the data structures used to store this stuff, you could dive into its source. But I highly doubt writing your own and doing it all yourself would be more maintainable and efficient in the long run.
One of the more popular Grails ecosystem plugins for this is Searchable, which was mentioned in Ledbrook & Smith's Grails in Action. It uses Lucene under the covers, and make sit pretty easy to add full-text search to your domain classes. (For example, check out chapter 8 in GinA or the searchable docs).

The Grails Richui plugin has an autocomplete that I've used in the past. We had it hooked up to hit the database every keystroke (which I would not suggest but our data changed often enough that real-time data was required). If your list of things is pretty static though then it could probably work well for you.
http://grails.org/plugin/richui#AutoComplete

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string