I'm trying to understand what is the purpose of configuring a different analyzer for searching and indexing in Azure Search. See: https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-
According to my understanding, the job of the indexing analyzer is to breakup the input document into individual tokens. Through this process, it might apply multiple transformations like lower-casing the content, removing punctuation and white-spaces, and even removing entire words.
If the tokens are already processed, what is the use of the search analyzer?
Initially, I thought it would apply a similar process on the search query itself, but wouldn't setting a different analyzer than the one used to index the document at this stage completely breaks the search results? If the indexing analyzer lower-cased everything, but the search analyzer doesn't lower-case the query, wouldn't that means you'll never get matches for queries with upper case characters? What if the search analyzer doesn't split tokens on white-spaces? Won't you ever get a match the moment the query includes a space?
Assuming that this is indeed how the two analyzers works together, then why would you ever want to set two different ones?
Your understanding of the difference between index and search analyzer is correct. An example scenario where that's valuable is using ngrams for indexing but not for search terms. So this would allow a document with "cat" to produce "c", "ca", "cat" but you wouldn't necessarily want to apply ngrams on the search term as that would make the query less performant and isn't necessary since the documents already produced the ngrams. Hopefully that makes sense!
Related
We have an Azure Cognitive Search index that we use for full text searches.
Currently, when the user searches for a plural word (e.g. Buildings), the singular forms are also being matched (building).
We want to restrict this behaviour so that only the plural matches are returned.
I've read through the odata documentation but cannot find any reference to how we could accomplish this either through parameters in the search.ismatch in the filter or in the index config.
Plural and singular forms are likely both matching because the field is configured with the default language analyzer, which performs stemming of terms. If you're looking for an exact match, you can use the 'eq' operator in a filter. If you want a case-insensitive (but otherwise exact) match, you can try normalizers (note that this feature is in preview at the time of this writing.)
If you need matching behaviour that is somewhat more sophisticated than a case-insensitive match, you should look into custom analyzers. They allow you to customize the behaviour of tokenization, as well as selectively use (or not use) stemming and other lexical analysis techniques.
To add onto Bruce's answer,
Custom normalizers support many of the same token and character filters as custom analyzers do. In order to decide which one best fits your needs, consider the following questions:
Will this plural/singular matching behaviour be used in filtering/sorting/faceting operations? If so, pre-configured or custom normalizers will enable you to refine what results are returned by your search filter. You can build your own or choose from the list of pre-configured ones, and it supports more than case insensitivity. See the list of supported char and token filters.
Will you need this plural/singular matching behaviour in full-text search, regardless of whether a filter is used? If so, consider using the custom analyzer Bruce suggested above.
Afaik, please note that normalizers will only affect filtering/sorting/faceting results. Also, normalizers are the only way to perform this "normalization" to filter/sort/facet queries. Setting an analyzer will not affect these types of queries.
Our search service uses Azure Cognitve Search in the following way:
Search non-fuzzy (i.e. with full match of query string).
Search fuzzy (i.e. it's allowed to change 1-2 letters in a query string)
Join results by certain rule.
This way we want to achieve that full match results will always be on the top.
But now we want to introduce a pagination. And to do it with two separate queries is a difficult and not effective task.
An alternative would be to somehow create a single query which will combine in itself both fuzzy and non-fuzzy search but with different scoring profiles, one with higher weights for full-match search and another with lower weights for fuzzy search.
Like
search=rabbit&scoringProfile=highWeightsProfile | seacrh=rabbit~&scoringProfile=lowWeightsProfile
Is there any way to do this, either in API or in SDK?
Is there any other alternative solutions to the problem of fuzzy search but with higher scores for full-match?
Boosting individual subqueries with Lucene query syntax worked for me as a good solution. Maybe not that flexible as separate search profiles for fuzzy and non-fuzzy parts, but still good.
I am using Azure Search in full query mode on top of CosmosDB and I want to run a query for any documents with a field that contains the string "azy do". This should match, for example, a document containing "lazy dog".
Reading the Azure Search documentation, it looks like this is impossible due to the term-based indexes it uses.
Rejected solutions
0 matches since it is looking for whole words:"azy do"
Doesn't work since regexes are not allowed to span multiple terms:/.azy do./
This "works", to the extent that it will match "lazy dog", but this does not respect the ordering of the query and will also match "dog lazy", for example /.azy./ AND /.do./
Is there any way of doing this correctly in Azure Search?
If you cannot achieve that via regular expression in the Lucene Query syntax, then is not possible. You may want to vote for supporting contains here.
It should be /.lazy|dog./
So, split the terms based on whitespaces and add a pipe(|) delimiter which stands for OR.
Shortly, Azure Search is not designed to support this scenario. You might be better off using the CONTAINS function in Cosmos DB or its equivalent, depending on what query language you use.
Azure Search is designed for finding terms or phrases that occur in unstructured content (documents) and returning the most relevant documents. The process of extracting and indexing those searchable terms is customizable and described here: How full text search works in Azure Search.
I am using Lucene.NET to index the contents of a set of documents. My index contains several fields, but I'm mainly concerned with querying the "contents" field. I'm trying to figure out the best way of indexing, as well as creating the query, to meet the requirements.
Here are the current requirements:
Able to search multiple keywords, such as "planes trains automobiles" (minus the quotes). This should give me all documents that contain ANY of the terms, but the documents that contain all three should be at the top
Able to search for phrases, such as "planes, trains, and automobiles" (with quotes) which would only match if they were together in that order.
As for stop words, I would be ok with either ignoring them altogether, or including them.
As for punctuation or special characters, same deal. I can either ignore them completely, or include them.
The last two just need to be consistent, not necessarily with each other, but with how the indexer and searcher handles them. So I just don't want to have a case where the user searches for "planes and trains" but it doesn't match a document that does contain that phrase, because the indexer took out the "and" but the searcher is trying to search for that particular phrase.
Some of the documents are large, so I think we don't want to do Field.Store.Yes, right? Unless we have to for what we need to do.
The requirements you've listed should be handled just fine by using lucene's standard analyzer and queryparser. Make sure to use the same analyzer in the IndexWriter and the QueryParser. Stop words are eliminated. Punctuation is generally ignored, though the rules are a bit more involved that just ignoring every punctuation character (see UAX #29, section 4, if you are interested in the details)
If you try running the Lucene demo, you should find it works just about as you've specified here.
As far as storing the field, you have it right, yes. Store the field if you need to retrieve it from the index. Large fields that you don't need to retrieve do not need to be stored.
Hi I am building a search application using lucene. Some of my queries are complex. For example, My documents contain the fields location and population where location is a not-analyzed field and population is a numeric field. Now I need to return all the documents that have location as "san-francisco" and population between 10000 and 20000. If I combine these two fields and build a query like this:
location:san-francisco AND population:[10000 TO 20000], i am not getting the correct result. Any suggestions on why this could be happening and what I can do.
Also while building complex queries some of the fields that I am including are analyzed while others are not analyzed. For instance the location field is not analyzed and contains terms like chicago, san-francisco and so on. While the summary field is analyzed and it generally contains a descriptive paragraph.
Consider this query:
location:san-francisco AND summary:"great restaurants"
Now if I use a StandardAnalyzer while searching I do not get the correct results when the location field contains a term like san-francisco or los-angeles (i.e it cannot handle the hyphen in between) but if I use a keyword analyzer for the query I do not get correct results either because it cannot search for the phrase "great restaurants" in the summary field.
First, I would recommend tackling this one problem at a time. From my reading of your post, it sounds like you have multiple issues:
You're unsure why a particular query
is not returning any results.
You're unsure why some fields are not being analyzed.
You're having problems with the built-in analyzers dealing with
hyphens.
That's how your post reads. If that's correct, I would suggest you post each question separately. You'll get better answers if the question is precise. It's overwhelming trying to answer your question in the current format.
Now, let me take a stab in the dark at some of your problems:
For your first problem, if you're getting into really complex queries in Lucene, ask yourself whether it makes sense to be doing these queries here, rather than in a proper database. For a more generic answer, I'd try isolating the problem by removing parts of the query until you get results back. Once you find out what part of the query is causing no results, we can debug that further.
For the second problem, check the document you're adding to Lucene. Lucene provides options to store data but not index it. Make sure you've got the right option specified when adding fields to the document.
For the third problem, if the built-in analyzers don't work out for you, breaking on hyphens, just build your own analyzer. I ran into a similar issue with the '#' symbol, and to solve the problem, I wrote a custom analyzer that dealt with it properly. You could do the same for hyphens.
You should use PerFieldAnalyzerWrapper. As the name suggests, you can use different analyzers for different field. In this case, you can use KeywordAnalyzer for city name and StandardAnalyzer for text.