Get all `Facets` from azure search without item result - azure

Hi all I'm facing performance issues with azure cognitive search currently I have 956 Facets filed.
When I load Documents from Azure server it's taking almost 30 to 35 seconds.
But when I remove Facets from Azure search request Documents load in 2 to 3 seconds.
So for this, I have created 2 API's
First API load Document result from the azure server.
Second API load all Facets from the azure server.
Is there any way to load only Facets?
Code get the document from the azure server.
DocumentSearchResult<AzureSearchItem> results = null;
ISearchFilterResult searchResult = DependencyResolver.Current.GetService<ISearchFilterResult>();
WriteToFile("Initiate request call for search result ProcessAzureSearch {0}");
results = searchServiceClient.Documents.Search<AzureSearchItem>(searchWord, parameters);
WriteToFile("Response received for search result {0}");

Faceting is an aggregation operation that's performed over the matching results and is quite intensive when there are a lot of distinct buckets. I can't comment on the specific increase in latency but adding facets to the query definitely has a performance impact.
Since faceting computes aggregation on matching documents, it has to run the query in the backend but as Gaurav mentioned, specifying top = 0 will prevent the actual retrieval as it doesn't need to be included in the response. This could improve the performance especially if the individual docs are large.
Another possibility is to run just the query first and then use a identifier field to filter the docs with facets. Since filtering is faster than querying, the overall latency should improve. This only works if you're able to identify the id groups for the resultant docs from the 1st API call.
In general I'd recommend using facets judiciously and re-evaluate the design if there is a need to run faceting queries on a field with high cardinality. Here's a document on optimizing search performance that you can take a look at -
https://learn.microsoft.com/en-us/azure/search/search-performance-optimization

SearchParameters has a property called Top which instructs the Search Service to return those number of documents.
Gets or sets the number of search results to retrieve. This can be
used in conjunction with $skip to implement client-side paging of
search results. If results are truncated due to server-side paging,
the response will include a continuation token that can be used to
issue another Search request for the next page of results.
One possible solution would be to set this value to 0 in your Facets API and in that case no documents will be returned by the Search Service.
I am not sure about the performance implication of this approach though. I just tried it with a very small set of data and it worked just fine for me.

Related

What is the most performant and scalable way to paginate of cosmos db result with the SQL API

I have more questions based on this question and answer, which is now quite old but still seems to be accurate.
The suggestion of storing results in memory seems problematic in many scenarios.
A web farm where the end-user isn't locked to a specific server.
Very large result sets.
Smaller result sets with many different users and queries.
I see a few ways of handling paging based on what I've read so far.
Use OFFSET and LIMIT at potentially high RU costs.
Use continuation tokens and caches with the scaling concerns.
Save the continuation tokens themselves to go back to previous pages.
This can get complex since there may not be a one to one relationship between tokens and pages.
See Understanding Query Executions
In addition, there are other reasons that the query engine might need to split query results into multiple pages. These include:
The container was throttled and there weren't available RUs to return more query results
The query execution's response was too large
The query execution's time was too long
It was more efficient for the query engine to return results in additional executions
Are there any other, maybe newer options for paging?

How to reliably determine when an Azure Cognitive Search index is up to date?

Azure Cognitive Search is eventually consistent - writes to the service return successfully but the writes are not materialized in the search index for a short period of time.
We are using Azure Cognitive Search in an eventually consistent event sourced CQRS architecture, where an Azure Search index is used as a projection of the event stream. We use websockets to notify connected clients when a projection has been updated, so that they can re-query it to fetch the latest data.
This presents a challenge with Azure Search, because when we notify a client that the index has been updated, the client may query the index before it can provide the most up to date data.
Does Azure Cognitive Search provide any built in ability to determine when a given write will be queryable?
If not, what patterns can be used to achieve what we want?
I am not aware of any functionality in Azure Search that allows you to submit content with a callback that confirms that the content is indexed. However, I have used search engines before where this was an option. When submitting content you could choose between different wait options:
Fire and forget (quickest, return immediately)
Wait for confirmed storage (reasonably quick)
Confirmed indexed (potentially very slow depending on overall indexing load)
It seems like you want the last option. You could create a function that submits a batch of content and then queries until that content is available. For this to work, you would need an indicator on each record that you can use to confirm via a query that the new records are in fact indexed. I always include a timestamp property on all my records and it would do the job in this case.
Use case: You have an index with 500 items. You then have a batch of 10 updated items where 5 are updates and 5 are new records. You add a timestamp to all of these records, submit the batch of 10 and then query in a loop. Once the query confirms that you have 10 or more records with a timestamp higher or equal to the time you submitted your batch, you know the index is updated.

Azure CosmosDB SQL Record counts

I have a CosmosDB Collection which I'm querying using the REST API.
I'd like to access the total number of documents which match my query. I know I can do a count, but that means two calls, one for the count and a subsequent one to retrieve the actual records.
I would assume this is not possible in a single call, BUT.. the Data Explorer in Azure Portal seems to manage it, so just wondering if anyone has been able to figure out what calls it makes, to get this:
Showing Results 1 - 10
Retrieved document count 342
Retrieved document size 2868425 bytes
Output document count 10
It's the Retrieved Document Count I need - if the portal can do it, there ought to be a way :)
I've tried the JAVA SDK as well as REST but can't see any useful options in there either
As so often is the case in this game, asking a question triggers the answer... so apologies in advance.
The answer is to send the x-ms-documentdb-populatequerymetrics header in the request.
The response then gives a whole bunch of useful stuff in x-ms-documentdb-query-metrics.
What I would like to understand still is whether this has any performance impact?

How do I resolve RequestRateTooLargeException on Azure Search when indexing a DocumentDB source?

I have a DocumentDB instance with about 4,000 documents. I just configured Azure Search to search and index it. This worked fine at first. Yesterday I updated the documents and indexed fields along with one UDF to index a complex field. Now the indexer is reporting that DocumentDB is reporting RequestRateTooLargeException. The docs on that error suggest throttling calls but it seems like Search would need to do that. Is there a workaround?
Azure Search code uses DocumentDb client SDK, which retries internally with the appropriate timeout when it encounters RequestRateTooLarge error. However, this only works if there're no other clients using the same DocumentDb collection concurrently. Check if you have other concurrent users of the collection; if so, consider adding capacity to the collection.
This could also happen because, due to some other issue with the data, DocumentDb indexer isn't able to make forward progress - then it will retry on the same data and may potentially encounter the same data problem again, akin a poison message. If you observe that a specific document (or a small number of documents) cause indexing problem, you can choose to ignore them. I'm pasting an excerpt from the documentation we're about to publish:
Tolerating occasional indexing failures
By default, an Azure Search indexer stops indexing as soon as even as single document fails to be indexed. Depending on your scenario, you can choose to tolerate some failures (for example, if you repeatedly re-index your entire datasource). Azure Search provides two indexer parameters to fine- tune this behavior:
maxFailedItems: The number of items that can fail indexing before an indexer execution is considered as failure. Default is 0.
maxFailedItemsPerBatch: The number of items that can fail indexing in a single batch before an indexer execution is considered
as failure. Default is 0.
You can change these values at any time by specifying one or both of these parameters when creating or updating your indexer:
PUT https://service.search.windows.net/indexers/myindexer?api-version=[api-version]
Content-Type: application/json
api-key: [admin key]
{
"dataSourceName" : "mydatasource",
"targetIndexName" : "myindex",
"parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 5 }
}
Even if you choose to tolerate some failures, information about which documents failed is returned by the Get Indexer Status API.

How to paginate requests for large data sets in OData from PowerQuery?

I have an OData feed that contains a number of large tables (tens of millions of rows). I need to configure PowerQuery (or PowerPivot, whichever is the best tool for the job) to access this OData feed, but to do so in a paginated way so that a single request doesn't try to return 10 million rows all at once, but instead builds up the complete result of tens of millions of rows with multiple paginated queries. I don't want to have to manually submit many different URLs with different values of $top and $skip to do my own manual pagination, instead I need PowerQuery or PowerPivot to handle the pagination for me.
I was hoping that PQ/PP would just be smart enough to do pagination, perhaps by first issuing a "count" query to determine how many rows are present, but this appears not to be the case. When I give PQ/PP a URL to a large OData table, it just blindly issues a query to retrieve all rows (actually, it issues 2 such identical queries, which seems odd), which crashes the DB on the server.
In searching for an answer, I've seen hints that PQ/PP can do pagination, but no clue as to how to enable this behavior. So is there a way to tell PQ/PP to use some kind of pagination to access large data sets? If so, can I set the page size?
Can you put the PageSize on the EnableQueryAttribute if you are using Web API.
[EnableQuery(PageSize = 10)]
public IHttpActionResult Get()
{
return Ok(customers);
}
You can use recursion to fetch and append successive pages. Each successive fetch uses a higher "start" line number in the url. And the recursion ends if the fetch yields an empty list. In "M" the if statement can check for empty list otherwise append and increment then "#" to self reference your current function name.

Resources