Azure Search - return only certain number of words from a field - azure

We have a database that stores news stories from many websites. The entire text of each article is stored in one field of unstructured data as nvarcharmax. Our clients query on a person's name to see if they appear anywhere in any of our articles. But in order for us to be compliant with requirements in our industry, as well as not infringe on any copyrights, we're only allowed to return the 25 words which surround that person's name that was queried on. Along with that we give the client the URL of the article and they can take it from there.
Is this something that can be accomplished in Azure Search? The ability to only display a subset of words from the field which is being queried on?

You can use the hit highlighting feature in Azure Search. Please see highlight= parameter in https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents.
Azure Search returns up to five text fragments, which are usually sentences, that contains the search hits per field. It currently does not allow you to configure the size of the window/fragment so you can't get a specific number of words surrounding the search terms.

Related

Why does Azure Search give higher score to less relevant document?

I have two documents indexed in Azure Search (among many others):
Document A contains only one instance of "BRIG" in the whole document.
Document B contains 40 instances of "BRIG".
When I do a simple search for "BRIG" in the Azure Search Explorer via Azure Portal, I see Document A returned first with "#search.score": 7.93229 and Document B returned second with "#search.score": 4.6097126.
There is a scoring profile on the index that adds a boost of 10 for the "title" field and a boost of 5 for the "summary" field, but this doesn't affect these results as neither have "BRIG" in either of those fields.
There's also a "freshness" scoring function with a boost of 15 over 365 days with a quadratic function profile. Again, this shouldn't apply to either of these documents as both were created over a year ago.
I can't figure out why Document A is scoring higher than Document B.
It's possible that document A is 'newer' than document B and that's the reason why it's being displayed first (has a higher score). Besides Term relevance, freshness can also impact the score.
EDIT:
After some research it looks like that newer created Azure Cognitive Search uses BM25 algorithm by default. (source: https://learn.microsoft.com/en-us/azure/search/index-similarity-and-scoring#scoring-algorithms-in-search)
Document length and field length also play a role in the BM25 algorithm. Longer documents and fields are given less weight in the relevance score calculation. Therefore, a document that contains a single instance of the search term in a shorter field may receive a higher relevance score than a document that contains the search term multiple times in a longer field.
Test your scoring profile configurations. Perhaps try issuing queries without scoring profiles first and see if that meets your needs.
The "searchMode" parameter controls precision and recall. If you want more recall, use the default "any" value, which returns a result if any part of the query string is matched. If you favor precision, where all parts of the string must be matched, change searchMode to "all". Try the above query both ways to see how searchMode changes the outcome. See Simple Query Examples.
If you are using the BM25 algorithm, you also may want to tune your k1 and b values. See Set BM25 Parameters.
Lastly, you may want to explore the new Semantic search preview feature for enhanced relevance.

Information search in user defined (restricted or limited) set of search space (feasible region)

Is there any particular term /phrase for the following:
Searching information where the search space (feasible region) is restricted or limited to a particular set of user-defined documents only.
Example: I am searching for an answer from a set of 100 documents only, in a given space of 10000 documents. This 100 documents are user defined. The search engine cannot refer to any other documents.
Please let me know. Thanks.
I am NOT looking for proximity search. A good reference is here. If I can reframe the question, lets say that I want to restrict my search to "STOP SEARCH AFTER number DOCUMENT | DOCUMENTS". Does this kind of search have a particular term / pharse?

Azure Search Suggestions don't catch missing prefix

When sending a phrase to the Azure Search service, using Suggest method,
the results are only phrases start with the search term.
Even when using "FuzzyMatching"
for example "ap" will return "aplle" and "april" but not "rap"
Is it possible to get any phrase contain the search term ?
You are correct that Azure Search does not allow for the ability to do this type of contain (or wildcard) search for suggestions. However, one thing that we will be releasing (hopefully towards the end of next week) is something called custom analyzers. Custom analyzers allow you to do not only this, but other types of analysis on your data. For example, you can create a field and tell us that it should allow for prefix or suffix matching. You can also choose to do regex style queries against your field.
I do want to caveat this with a bit of a warning though. If you set your field to allow for prefix or suffix search we can get results quite quickly because if we know that you want us to either look at the start or end of the word, we can build our inverted index appropriately to handle this very quickly. However, for things like generic contain (or even regex) it is more of a brute force type of search and if you have significant content, this could have an impact on the latency of your queries.
Hopefully that will help you do what you need here and if you want to keep an eye out for this, we will be posting content on this at our documentation page: https://azure.microsoft.com/en-us/documentation/services/search/
Liam

Improving type-ahead suggestions in Search using SOLR

What are the possible ways you could improve the type-ahead (auto-complete) suggestions that appear in a free-form search?
From my understanding, all the suggestions that appear for keywords are stored in a SOLR table.
How do you ensure that it covers all the industry specific relevant type-ahead suggestions?
Can you automate including most recent user generated queries that are not currently providing search results to lead to relevant ones?
In preprocessing, the documents fed into the search engine need to be enriched with whatever is sensible and provides help to find them. E.g. a document containing the string paris may be enriched by french capital, capital of france, ile-de-france, … You will need a dictionary to do so. You can take data from dbpedia.org or—for English only—WordNet. For not to over-generalize you will need to implement some disambiguation (meaning discovery) in the first step, since paris—for example—could equally be expanded with alexandros, alaksandu of wilusa, king of troy, depending on the context.

Free text (natural language) query parsing with solr

I'm trying to build a query parsing algorithm for a local search site that can classify a free text search query (single input text box) into various type of possible searches possible on the site.
For e.g. the user could type chinese restaurants near xyz. How should I go about breaking it down to Cuisine:"chinese", locality:"xyz" given that
- there could be spelling mistakes
- keywords may match in different columns e.g. a restaurant may have "chinese" in its name
This is not really a natural language parsing problem since we're trying to search in a very limited set of posiibilities
My initial thoughts are to dump all values of a particular type into a field from the database and use the users query to match in all those fields. Then based on the score (and a predifined confidence level) divide the query into the 3-4 search fields like name/cuisine/locality.
Is there a better/standard way of doing this.
About spelling mistakes, you have to work with a dictionary/thesaurus. This can be part of your pre-processing and normalization.
About querying in multiple columns you can do; cuisine:chinese OR restaurant_name:chinese
You can boost one of the two: cuisine:chinese^0.8 OR restaurant_name:chinese

Resources