How to search only inside one string of a Collection in Azure Search? - azure

I've a collection fields like:
["city of god"]
["god of war", "city of war"]
I want to perform a search on the field with 'city' AND 'god' and I want only 'city of god' to be returned.
Yet, the second field is also return regardless of the terms being in two different strings within the collection.
Anyway to make the search strict to within strings and not to the entire collection?

Each searchable field in the index is treated as a bag of terms, so for “city AND god” you’re matching on all terms of that field in the whole document, not only the terms within sub-documents (in this case individual strings in the collection).
One way to get around this would be to specify a reasonable estimate of distance between these terms within a single string of a collection and use that to issue a proximity search query to get the desired result. For your specific example, assuming that the terms would be within 5 words of each other, the following query should work -
&queryType=full&search=fieldName:"city god"~5
Using proximity search is really useful as it helps that words don’t have to be in the provided order in the phrase query with large enough proximity value e.g., “city god”~5 would also match “god bless the city”.
Make sure to include queryType=full in your query string as proximity search is part of the full query syntax and would not work otherwise. You can check some other examples here.

Related

Azure Cognitive Search - When would you use different search and index analyzers?

I'm trying to understand what is the purpose of configuring a different analyzer for searching and indexing in Azure Search. See: https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-
According to my understanding, the job of the indexing analyzer is to breakup the input document into individual tokens. Through this process, it might apply multiple transformations like lower-casing the content, removing punctuation and white-spaces, and even removing entire words.
If the tokens are already processed, what is the use of the search analyzer?
Initially, I thought it would apply a similar process on the search query itself, but wouldn't setting a different analyzer than the one used to index the document at this stage completely breaks the search results? If the indexing analyzer lower-cased everything, but the search analyzer doesn't lower-case the query, wouldn't that means you'll never get matches for queries with upper case characters? What if the search analyzer doesn't split tokens on white-spaces? Won't you ever get a match the moment the query includes a space?
Assuming that this is indeed how the two analyzers works together, then why would you ever want to set two different ones?
Your understanding of the difference between index and search analyzer is correct. An example scenario where that's valuable is using ngrams for indexing but not for search terms. So this would allow a document with "cat" to produce "c", "ca", "cat" but you wouldn't necessarily want to apply ngrams on the search term as that would make the query less performant and isn't necessary since the documents already produced the ngrams. Hopefully that makes sense!

ArangoDB: Querying multiple fields at the same time for partial match

I have a database containing product information (SKU, model number, descriptions, etc) and I'd like to have a relatively quick search function where a user can just type in a few letters or a word from any of the the text fields and then get a list of products that contain that phrase in any of those fields.
The number of items in the database will probably not be more than 100,000.
What would be the easiest way to accomplish this, without creating complex queries?
It sounds like you're looking for an autocomplete. There are numerous ways to do this.
Indexing
No matter the solution you choose, you'll want to put some indices on your data. I recommend adding a skiplist to everything you're going to be searching, and an additional fulltext index on any long-form text (such as product description). String comparison uses skiplists, while only a FULLTEXT search will leverage a fulltext index.
Querying
You have some choices here.
LIKE
https://docs.arangodb.com/3.1/AQL/Functions/String.html#like
You could run your search something like:
for product in warehouse
filter like(product.model, #searchTerm, true) or
like(product.sku, #searchTerm, true)
return product
Advantage: simple query syntax, multiple attributes in one search, supports substrings, can search the middle of a body of text.
Disadvantage: relatively slow.
Fulltext
This is a lot more complex for querying, but is very responsive, and is the approach my application uses for its autocomplete.
let sku = (for result in fulltext("warehouse", "sku", "prefix:#seacrhTerm")
return {sku: result.sku, model: result.model, description: result.description}
let model = (for result in fulltext("warehouse", "model", "prefix:#searchTerm")
return {sku: result.sku, model: result.model, description: result.description}
let description = (for result in fulltext("warehouse", "description", "prefix:#searchTerm")
return {sku: result.sku, model: result.model, description: result.description}
let resultsMatch = union(sku,model,description)
return resultsMatch
Advantage: Very fast, extremely responsive, can handle very long bodies of text with ease, searches anywhere in a text body.
Disadvantage: Complex query structure as you need one variable for every attribute you're searching, a fulltext index created on each of those attributes you're searching, and a union at the end. You may need to do a union of the unioned results depending on how advanced your search needs to be. Doesn't support substring searching.
Raw string comparison
Simply create a query that filters for results to be greater than or equal to your search term, but less than your search term with the last letter incremented by 1. Example is in the link under the Foxx portion of my answer. This leverages skiplists.
Advantage: Very fast as long as the field is not tremendously long. Extremely easy to implement.
Disadvantage: Doesn't support substring searches. Only searches the first part of a string. I.e. you must know the beginning of the field you're searching.
This will work very well for quickly searching something like a model number where your users will probably know the beginning of it, but poorly for something like a description in which your users are probably searching for words somewhere in the middle of a body of text.
Foxx
Jan's little Cookbook example is a good place to start:
https://docs.arangodb.com/cookbook/UseCases/PopulatingAnAutocompleteTextbox.html
I would recommend abstracting whatever you do into a Foxx service. It is especially liberating if you need to dynamically build up AQL queries in database, in case you have a huge number of fields and collections to search and you need to generate a Fulltext search dynamically.
Bottom line
Experiment and see which of these works best for you. My best guess is that you will find the Fulltext solution the best if you need to search on product descriptions. If you expect your users to always search the first few letters of a field, just use the comparison with a skiplist as it is very very fast.

How to Index and Search multiple terms and phrases with Lucene

I am using Lucene.NET to index the contents of a set of documents. My index contains several fields, but I'm mainly concerned with querying the "contents" field. I'm trying to figure out the best way of indexing, as well as creating the query, to meet the requirements.
Here are the current requirements:
Able to search multiple keywords, such as "planes trains automobiles" (minus the quotes). This should give me all documents that contain ANY of the terms, but the documents that contain all three should be at the top
Able to search for phrases, such as "planes, trains, and automobiles" (with quotes) which would only match if they were together in that order.
As for stop words, I would be ok with either ignoring them altogether, or including them.
As for punctuation or special characters, same deal. I can either ignore them completely, or include them.
The last two just need to be consistent, not necessarily with each other, but with how the indexer and searcher handles them. So I just don't want to have a case where the user searches for "planes and trains" but it doesn't match a document that does contain that phrase, because the indexer took out the "and" but the searcher is trying to search for that particular phrase.
Some of the documents are large, so I think we don't want to do Field.Store.Yes, right? Unless we have to for what we need to do.
The requirements you've listed should be handled just fine by using lucene's standard analyzer and queryparser. Make sure to use the same analyzer in the IndexWriter and the QueryParser. Stop words are eliminated. Punctuation is generally ignored, though the rules are a bit more involved that just ignoring every punctuation character (see UAX #29, section 4, if you are interested in the details)
If you try running the Lucene demo, you should find it works just about as you've specified here.
As far as storing the field, you have it right, yes. Store the field if you need to retrieve it from the index. Large fields that you don't need to retrieve do not need to be stored.

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.
I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

SOLR/Lucene weighting by user-centric criteria

We are switching from SQL Fulltext Search to Lucene (SOLR stack) search in the next few months. One last wrinkle in figuring out our strategy here has to with replicating one current part of our search platform.
First, some nomenclature to describe the problem: Our site has a bunch of documents. People might "add" those documents, they might "favorite" those documents, they might "read" those documents, etc. Let's call that union of such documents for a given user their "personal documents". Some documents are public, and some are private so that only the logged-in-user can see them.
Currently, we have a weighting function that will always show a given user's "personal" documents FIRST in the search list, for any search. This outranks the normal order (but a document must be valid in the result set -- it just ranks above any other less important document). In SQL, we are able to achieve this by having a user-defined-function that returns a score, and it varies by user.
An analogy is Facebook -- where, when you type "Joe", it will first find all the Joes that you know, followed by any other Joe that meets the criteria. My search for "Joe" will return a different ordered set than your search for Joe.
In the world of Lucene/SOLR, as I understand it, I cannot figure out how to have such user-centric weighting of documents without two separate queries that are then effectively UNIONed together (I know, it's not relational, but you get the idea). We have millions of users, and hundreds of thousands of documents. If a user is logged in, we want "their documents" to show up first in any search, then the rest of all documents. And in each case, we want the search results to show only those documents that match the original search -- we're just talking about rank-order.
Can you think of any strategies here to reproduce this user-defined-function feature?
Can you afford to have a field in each document telling this particular document belongs to Jim (e.g. user123Doc:1)? If yes, you could solve it by sorting the result set by {user123Doc, score, ...}.
Or, if you don't want to store this information in Lucene, you can store this elsewhere (e.g. in the database) and implement FieldComparator so it works with these values. More on this is available here.

Resources