Can azure cognitive search provide aggregate sums? - aggregation

Suppose I have a lot of documents with facets and text fields, additionally they have a field like amount: 3. I want to support a query which returns the sum of the amount of documents matching the search criteria.
Is this a supported feature in Azure Cognitive Search? I can't seem to find the documentation on how to do a query which would return the aggregate sum, but I know you can do it in Elastic Search.
Can I do a raw lucene query?

Related

Customize azure search scoring in a specific way

Consider a scenario where all documents have following fields
The requirement is that for email the score should be either 100 (if exact match) or 0.
For remaining fields, it is 0 to 100 based on edit distance .
Suppose in an index the records are like the following
1.abcd#gmail.com,Peterr,Parker,Developer
2.xyz#yahoo.com,Steve,Smith,Manager
The query is made on fuzzy search of all the fields and parameters are like
abcd#gmail.com,Pet,Par,Devl
The search result should have a score for first record like
score for email + score of last name +score of first name+score of title
=100+50(approx edit distance of 'Peterr and Pet')+50(approx edit distance of 'Peterr and Parker')+44(approx edit distance of 'Devl and Developer')
=244
Similarly ,the search result should have a score in similar way.
I just checked Azure search scoring has weights but those I don't think would be of much helpful in scenarios like this .The main thing we are looking for is to find a way where the search score returned for each record by Azure search would be in accordance with the score I discussed above
To clarify, it seems what you need is the scoring formula to be a function of the edit distance between the query term and the indexed term - the shorter the distance, the higher the score. Unfortunately, this is not possible in Azure Search.
Azure Search engine executes the search query in two phases: retrieval and scoring.
During retrieval search query terms processed by the lexical analyzer are looked up in the inverted index. Documents that had those terms are returned. When you use fuzzy search we expand your search query by adding terms from the inverted index that are within edit distance from a given query term - fuzzy expansion. This way your query can match more documents.
During scoring we assign a relevance score to retrieved documents using the Lucene scoring formula. This formula is based on TF/IDF. Practically, it means that documents that matched terms that are rare will be ranked higher up in the results set.
It's important to know that the Lucene scoring formula only applies to documents that matched the original query terms and terms added through fuzzy expansion. Documents that matched terms added through prefix expansion or regex/wildcard expansion are given constant score 1. This way those documents will be in the results set but won't have impact on ranking that's based on frequency of terms.
Hope that helps

The implication of #search.score in Azure Search Service

I understood the reason for having search profile and boosting results based on some fields e.g. distance, rating, etc. To me, that's most likely applicable to structured documents like json files. The scenario that I cannot make sense of it is when indexer gets search service index let's say a MS Word or PDF document in azure blob. We have two entries of "id" and "content" which I don't know how the search score would apply to it.
For e.g. there are two documents with different contents. I searched for a keyword and the same keyword found in two documents resulted into getting two different scores for two MS Word documents. My challenge is why this score should be different while both documents contain the same keyword?
The score is determined by many factors, for example, the count of terms in each document, and the number of searchable fields in which query terms were found. In your example, the documents have different lengths, so naturally they'll have different scores. HTH.

How does ElasticSearch rank filter queries (rather than text queries)?

I know that ElasticSearch uses relevance ranking algorithms such as Lucene's tf/idf, length normalization and couple of more algorithms to rank term queries applied on textual fields (i.e. searching words "medical" AND "journal" in the "title" and "body" fields).
My question is how does ElasticSearch rank and retrieve results of a filter or range query (i.e. age=25, or weight>60)?
I know these types of queries are just filtering documents based on the condition(s). But lets say I have 200 documents which their age field value is 25. Which of those documents will be retrieved as top 10 results?
Does ElasticSearch retrieve them by the order it indexed them?
From the Elasticsearch documentation:
Filters: As a general rule, filters should be used instead of queries:
for binary yes/no searches
for queries on exact values
Queries: As a general rule, queries should be used instead of filters:
for full text search
where the result depends on a relevance score
So when running a search such as "age=25, or weight>60" you should be using a filter.
However - Filters do not affect the scoring - i.e. if you only used a filter your search results would all have the same score.
There is a range query - this is a query that would affect score and I would guess that it scores documents based on things like the document timestamp (most recent gets a higher score).
You'd need to explore the documentation further and dig into Lucene documentation to understand exactly how and why the a document got its score - but as above, you may be better using Filters that don't affect scoring.

Lucene is not finding results that are present in the index

I'm inspecting a Lucene index with Luke.
All documents have a field 'Title' and I would like to do a search for the search expression Title:Power, by which I want to find all documents with a title containing the word Power.
In Luke, I go to the tab "Search" and enter +Title:Power
When searching, there are no results. However, when I search by another field, I do find the document: +ContentType:MyContentType
In the column Title, I can clearly see the value of the document being: Power Quality Guide.
What could be the reasons I'm not finding this document when searching on Title?
There can be a number of reasons. Most common ones:
Title field could just be stored in the index but not indexed for search (Field.Store.YES, Field.Index.NO), unlike for the field for which you can find results (ContentType);
document(s) could be indexed using one analyzer but query is using a different one;
document is indexed using NOT_ANALYZED option which would store a field as a single term

How does Lucene order data in composite queries?

I need to know how Lucene orders the records in a result set if I use composite queries.
It looks like it sorts it using "score" value for exact queries and it sorts it lexicographically for range queries. But what if have the query which looks like
q = type:TAG OR type:POST AND date:[111 to 999]
You mix together logical search and scoring. When you pass query like date:[111 to 999], Lucene searches for all documents with the date in specified range. But you give it no advice on how to sort them - is date 111 more preferable for you than 555? or is 701 better than 398? Lucene have no idea about it, so the score is the same for all found documents. Just to make some order, Lucene sorts results lexicographically, but that's mostly detail of implementation, not some key idea.
On other hand, if you pass some other parameters with a query - be it keywords or tags - Lucene can apply its similarity algorithm and assign different scores to different docs in results. You may find more on Lucene's scoring here.
So, to give you short answer: Lucene sorts results by score, and only if the score for 2 documents is the same, it uses other types sorting options like lexicographical order.

Resources