Why can't ContinuationToken be used for paging in Azure Search API? - azure

Reading the documentation for the Azure Search .NET SDK, I see that the ContinuationToken property is not supposed to used for pagination (this is the same as the #odata.nextLink and #search.nextPageParameter properties in the REST API).
Note that this property is not meant to help you implement paging of search results. You can implement paging using the Top and Skip search parameters.
Source
Why can't I use it for pagination? I have a situation where I want to run a query and then step through a static copy of the results page by page. I don't want those query results to change beneath my feet, however, as I am navigating through them, as new documents are added to the underlying database. In my case, there could be hundreds or thousands of results that get added in the minute or two between submitting the initial query and navigating to another page. How could I accomplish this?

Your question can be addressed in two parts:
Why is it not recommended to use ContinuationToken to implement pagination?
How can pagination be implemented such that results remain completely stable from page to page?
These are actually unrelated questions, since nothing about ContinuationToken guarantees the stability of the search results. Azure Search makes no consistency guarantees around paging, whether you use $top and $skip or ContinuationToken.
For question #1, the reason ContinuationToken is not recommended for paging is that Azure Search controls when the token is returned, not your application code. If you make assumptions about how and when Azure Search decides to return you a token, there's a chance those assumptions may break with a future service update. The intent of ContinuationToken is to prevent requests for too many documents from overwhelming the service, so you should assume that it is entirely at the service's discretion whether it will return a token.
For question #2, since Azure Search doesn't provide consistency guarantees, you can't completely avoid issues like the same document showing up in multiple pages, missing documents, or documents that are deleted by the time they are seen in results. Even if you wanted to build your own snapshot of the results and page over them in your application code, building a consistent snapshot isn't possible in the first place. However, if your only concern is to avoid showing new documents in the results, you can include a created timestamp field in your index and filter on that in every search request.
Frankly, unless you're trying to export the entire contents of your index, I would question the need for such strong consistency guarantees around paging. Google and Bing make no such guarantees, so arguably user expectations are already set around this. If you are trying to export your data, this is unfortunately not easy with Azure Search today. In that case, please vote on this User Voice item to help the team prioritize this scenario.

Related

using the Graph API for accessing SharePoint Lists

TLDR of the question:
Is it possible to use Graph to query a SharePoint list, which contains lookups that would need to be fetched from a different SharePoint list?
The "old" SharePoint API can do that in one request.
Follow up question as a result of my attempts to work around that limitation:
Why does Graph not allow me to ask for multiple list entries by ID?
This makes literally no sense to me.
Background for the question:
I've been given the task to move a small SharePoint app from the normal SharePoint API over to the Graph API, so the features could be expanded into incorporating Exchange too. I've never worked with either prior to this, so I didn't really have any idea what I was getting into.
And while I did succeed in finding equivalent queries to Graph for everything that was needed so far, do I also start to doubt that Graph is seriously intended to be used for SharePoint access.
Lists are the best example. The SharePoint API offers resolving LoopupId values when requesting multiple items.
Graph doesn't even offer that when requesting an item directly, let alone multiple.
To make things worse, after I wrote my own lookup routine that picks the columns that are lookups, and having to manually tell it where to find the values for that, I discovered that Graph won't even let me request multiple items by ID...
At first I tried to chain id eq '<id>' requests, because even $batch requests are limited to 20 individual requests, limiting the amount of items I could lookup to that at most.
But filtering 'id' is apparently unintended.
https://graph.microsoft.com/v1.0/sites/{site}/lists/{list}/items?$filter=id+eq+'67'
results in "General exception while processing", which I've never even seen as a response until this.
I then tried the in keyword:
https://graph.microsoft.com/v1.0/sites/{site}/lists/{list}/items?$filter=id+in+('67')
which results in "Invalid request".
After that I thought I could be smart an add a calculated column which copies the item's id and index on that, but guess what: can't set an index on that column in the first AND it also refuses filtering on that on top.
Not even offering the header fix for indexing on unindexed columns, nope. Outright complains that the field isn't usable.
With all this, I feel like I will have to settle for a hybrid approach, unless I'm seriously missing something here.
I thought that having to write my own LookupId resolver was bad, but being unable to even optimize the requests to return all matching items from a list in one request at least, and instead having to request every single item EACH, because filtering by id is forbidden and the ONLY access by id is singular, just gives me the feeling that Graph was never meant to be used for SharePoint lists at all.
Is there an actual question here?
Microsoft has been recommending Graph for working with SharePoint Online for a while now. Though it is true there are shortcomings at present, Microsoft is constantly investing in improving Graph whereas older methods like CSOM are deprecated and are no longer updated.
If there is any sufficient reason why you have to filter by id instead of using GET .../items/{id} as many times as needed, you may try to filter by fields/ID, but don't use $ before filter parameter for that case - it won't work, so try
GET https://graph.microsoft.com/v1.0/sites/{site}/lists/{list}/items?filter=fields/ID eq 67
with setting header Prefer=HonorNonIndexedQueriesWarningMayFailRandomly
Also, getting multiple items from large lists may cause issues, so for that case I'd add &top=5000

Efficient pagination in Cosmos DB

I need to implement efficient pagination for Cosmos DB with nodejs api. There are many examples about the implementation with .NET and LINQ but I could not find anything good for nodejs. The idea is to send the pageSize and pageIndex and get the relevant result.
I already know we can always use dbClient.queryDocuments and get the queryIterator and perform the pagination but this requires always iterating from the first document in the DB. An example could be find here.
Any idea how to do it in an efficient way?
Unfortunately CosmosDB as an engine doesn’t have skip and take pagination support yet.
It is, however, a planned feature.
The blogs you’ve read provide one of the few viable workarounds for now which of course comes with a cost.
You could write something smarter and instead of iterating though every document from the beginning, you could keep the request’s continuation token and use it with your next request. That way you can have a previous and next button logic.

REST API: Infinite scroll pagination in the GUI, but allow searching through all entries

I have Express running in a Node.js server, which serves as a backen for my React frontend application.
The frontend application fetches data from the backend (which is stored in Mongo) through a REST call, and display this data in a table.
The amount of data is growing by the day, so I though I should look into reducing the abount of data transferred to the frontend application, so avoid unnecessary strain on the backend.
I'm not sure if this is the right way to approach this, but I've been thinking I would look into having the backen fetch a limited amount of entries, so that only these data will be displayed in the frontend table.
The problem arises with searching - when the user wants to search the data in the table, I'll need to be able to search through all entries, not just the data loaded into the table.
I guess one option would be to have the search function actually query the REST API, instead of searching the table itself.
If I'm on the right track, I guess I could implement REST API pagination, somewhere along the example found in https://refactoringfactory.wordpress.com/2012/09/08/pagination-in-node-js-and-express/. Other suggestions on how to implement pagination are welcome.
I'd very much like some input on the approach I described, and suggestions for smarter ways implement this.
EDIT: I changed the title somewhat to include "Infinite scroll pagination". This is what I'm looking to implement. At the moment I have a click on pages pagination setup, but would like to replace this for the infinite scroll pagination.
I've been thinking I would look into having the backen fetch a limited amount of entries, so that only these data will be displayed in the frontend table.
This is common practice in my experience. The term for it is "pagination." Have a look at this SO question regarding best practices for pagination in REST API's: API pagination best practices.
The problem arises with searching - when the user wants to search the data in the table, I'll need to be able to search through all entries, not just the data loaded into the table.
I guess one option would be to have the search function actually query the REST API, instead of searching the table itself.
Again, you got it. Doing small filters/searches on the client is fine for a limited number of entries, but if you need to only retrieve items matching search criteria in the first place, then adding that functionality to your REST API is the right choice.
Right, you should do
pagination: you might implement it by exposing 2 arguments in the rest endpoint for the listing
?p=<number>: page number, defaults to 1
?l=<number>: number of items per page / page length, defaults to a number maybe from 10 to 100
search: implement it by exposing 1 argument in the rest endpoint for the listing
/?q=<string>: you can define to be what you want, maybe a string that matches with one or multiple fields of the data
If you want to minimize the network traffic, you might also add one more parameter to explicitly select the fields you want to be returned, like this
/?f=<string>: string could be something like id,name,age, and so the api should return only those three fields per record.
All this parameters should be accepted by a list endpoint in your RESTful API
Example:
http://example.com/api/cars/?p=2&l=15&q=toyota&f=id,brand,model,color

Applying "tag" to millions of documents, using bulk/update methods

We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.
I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".
There is the classic partial update endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.
there is the bulk request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go.
Then there is the official open issue for /update_by_query endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.
On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.
So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.
While waiting for update by query support, I have opted for:
Use the scan/scroll API to loop over the document IDs you want to tag (related answer).
Use the bulk API to perform partial updates to set the tag on every matching doc.
Additionally I store the tag data (your CSV) in a separate doc type, and query from that and tag all new docs as they are created, i.e., to not have to first index and then update.
Python snippet to illustrate the approach:
def actiongen():
docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id'])
for doc in docs:
yield {
'_op_type': 'update',
'_index': doc['_index'],
'_type': doc['_type'],
'_id': doc['_id'],
'doc': {'tags': tags},
}
helpers.bulk(es, actiongen(), index=args.index, stats_only=True)
Using the aforementioned update-by-query plugin, you would simply call:
curl -XPOST localhost:9200/index/type/_update_by_query -d '{
"query": {"filtered": {"filter":{
"not": {"term": {"tag": "github"}}
}}},
"script": "ctx._source.label = \"github\""
}'
The update-by-query plugin only accepts a script, not partial documents.
As for performance and memory issues, I guess the best thing is to give it a try.
I'd go with the bulk API with the caveat that you should try to update each document the minimal number of times. Updates are just atomic deletes and adds and leave behind the deleted document as a tombstone until it can be merged out.
Sending a groovy script to execute the update probably makes the most sense here so you don't have to fetch the document first.
Could you create a Parent/Child relationship whereby you can add a 'tags' type which references your 'posts' type as its parent. This way you wouldn't need to perform a full reindex of your data - simply index each of the appropriate tags against the appropriate post ID.
A very old thread. Landed through the github page to implement "update by query" to see if it's implemented in 2.0 but unluckily not. Thanks to plugin from Teka, if the update is small, that very much doable from sense but our use case was to update million of documents daily based on certain complex queries. At the end, we moved to es-hadoop connector. Although infrastructure is a big big overhead here but parallelizing the process of fetching/updating/inserting document through spark helped us anyhow. If anyone has any other suggestion discovered :) in past one year, would love to hear on that.

How does solr work with data split into different services and therefore not synchronously available?

take for instance an ecommerce store with catalog and price data in different web services. Now, we know that solr does not allow partial updates to a document field(JIRA bug), so how do you index these two services ?
I had three possibilities, but I'm not sure which one is correct:
Partial update - not possible
Solr join - have price and catalog in separate index and join them in solr. You cant join them in your client side code, without screwing up pagination and facet counts. I dont know if this is possible in pre-solr 4.0
have some sort of intermediate indexing service, which composes an entire document based on the results from both these services and sends this for indexing. however there are two problems with this approach:
3.1 You can still compose documents partially, and then when the document is complete, you can set a flag indicating that this is a complete document. However, to do this each time a document has to be indexed, it has to first check whether the document exists in the index, edit it and push it back. So, big performance hit.
3.2 Your intermediate service checks whether a particular id is available from all services - if not silently drops it and hopes that when it appears in the other service, the first service will already be populated. This is OK, but it means that an item is not available in search until all fields are available (not desirable always - if u dont have price, you can simply set it to out-of-stock and still have it available)
Of all these methods, only #3.2 looks viable to me - does anyone know how you do this kind of thing with DIH? Because now, you have two different entry points (2 different web services) into indexing and each has to check the other
The usual way to solve this is close to your 3.2: write code that creates the document you want to index from the different available services. The usual flow would be to fetch all the items from the catalog, then fetch the prices when indexing. Wether you want to have items in the search from the catalog that doesn't have prices available depends on your business rules for the service. If you want to speed up the process (fetch product, fetch price, repeat), expand the API to fetch 1000 products and then prices for all the products at the same time.
There is no reason why you should drop an item from the index if it doesn't have price, unless you don't want items without prices in your index. It's up to you and your particular need what kind of information you need to have available before indexing the document.
As far as I remember 4.0 will probably support partial updates as it moves to the new abstraction layer for the index files, although I'm not sure it'll make your situation that much more flexible.
Approach 3.2 is the most common, though I think about it slightly differently. First, think about what you want in your search results, then create one Solr document for each potential result, with as much information as you can get. If it is OK to have a missing price, then add the document that way.
You may also want to match the documents in Solr, but get the latest data for display from the web services. That gives fresh results and avoids skew between the batch updates to Solr and the live data.
Don't hold your breath for fine-grained updates to be added to Solr and Lucene. It gets a lot of its speed from not having record-level locking and update.

Resources