Golang / CosmosDB Pagination - azure

I'm trying to implement pagination while selecting records from CosmosDB using cosmosapi package.
The azure documentation states that continuation tokens never expire and I'm trying to understand the semantics of that.
In How does Cosmos DB Continuation Token work? there is an agreement that
Documents created after serving the first page are observable on
subsequent pages
I tried to validate that point by running some experiments from a golang applicaiton, and something is not quite right. As a very high level example, if we insert three records to CosmosDB:
Insert record #1
Insert record #2
Insert record #3
Then if we try to select from the table (query = SELECT * FROM c ORDER BY c.dateField DESC) using this options:
opts := cosmosapi.QueryDocumentsOptions{
IsQuery: true,
ContentType: cosmosapi.QUERY_CONTENT_TYPE,
ConsistencyLevel: cosmosapi.ConsistencyLevelStrong,
Continuation: "",
PartitionKeyValue: partitionKeyValue,
MaxItemCount: 2,
}
it returns:
record #1
record #2
continuation token = "cont-token-1"
Now when selecting again with the same options, but different continuation token:
opts := cosmosapi.QueryDocumentsOptions{
IsQuery: true,
ContentType: cosmosapi.QUERY_CONTENT_TYPE,
ConsistencyLevel: cosmosapi.ConsistencyLevelStrong,
Continuation: "cont-token-1",
PartitionKeyValue: partitionKeyValue,
MaxItemCount: 2,
}
It returns
record #3
Which is fairly logical.
Now when I try to insert record #4, and it gets inserted right after record #3, and try to fetch using "cont-token-1", record #4 does not show up. It only shows up when I regenerate the continuation tokens by selecting again using an empty opts.Continuation field.
If I try to select using an empty continuation token, then it fetches record #1 and record #2, and leads to a new token that fetches record #3 and record #4.
Is this the expected behavior? Or am I missing anything?
From my understanding, it should show up. The continuation token is like a bookmark, and it should see the results even when using the same continuation token.

A continuation token can only be used with the exact same query and will return the exact same answer every time, regardless of how you change the underlying data, you need to get a new token if your underlying data changes in such a way that would have been included in the first answer.

Related

Azure CosmosDB. Continuation token length in stored procedure

I have a REST API which is intent to query the documents stored in CosmosDB with OData-like syntax. I'm returning documents with chunks. I.e. I'm setting $top=10 and get 10 documents with a continuation token. This continuation token is returned from stored procedure:
var accepted = collection.queryDocuments(collection.getSelfLink(),
sql, requestOptions,
function (err, documents, responseOptions) {
// ...
// put responseOptions.continuation into response body
});
The problem is if the continuation token is long (i.e. 6k characters), an I pass it into URL, the URL cannot be handled and I can't reach out my endpoint (getting 404). As far as I understand the more complex initial SQL query is the longer is the continuation token an its length cannot be set up.
Is there a workaround for that?
Don't think there would be a out of the box solution for this issue. What you can try is to implement tiny url kind of framework at your service layer.
https://www.geeksforgeeks.org/how-to-design-a-tiny-url-or-url-shortener/

How to get Salesforce REST API to paginate?

I'm using the simple_salesforce python wrapper for the Salesforce REST API. We have hundreds of thousands of records, and I'd like to split up the pull of the salesforce data so all records are not pulled at the same time.
I've tried passing a query like:
results = salesforce_connection.query_all("SELECT my_field FROM my_model limit 2000 offset 50000")
to see records 50K through 52K but receive an error that offset can only be used for the first 2000 records. How can I use pagination so I don't need to pull all records at once?
Your looking to use salesforce_connection.query(query=SOQL) and then .query_more(nextRecordsUrl, True)
Since .query() only returns 2000 records you need to use .query_more to get the next page of results
From the simple-salesforce docs
SOQL queries are done via:
sf.query("SELECT Id, Email FROM Contact WHERE LastName = 'Jones'")
If, due to an especially large result, Salesforce adds a nextRecordsUrl to your query result, such as "nextRecordsUrl" : "/services/data/v26.0/query/01gD0000002HU6KIAW-2000", you can pull the additional results with either the ID or the full URL (if using the full URL, you must pass ‘True’ as your second argument)
sf.query_more("01gD0000002HU6KIAW-2000")
sf.query_more("/services/data/v26.0/query/01gD0000002HU6KIAW-2000", True)
Here is an example of using this
data = [] # list to hold all the records
SOQL = "SELECT my_field FROM my_model"
results = sf.query(query=SOQL) # api call
## loop through the results and add the records
for rec in results['records']:
rec.pop('attributes', None) # remove extra data
data.append(rec) # add the record to the list
## check the 'done' attrubite in the response to see if there are more records
## While 'done' == False (more records to fetch) get the next page of records
while(results['done'] == False):
## attribute 'nextRecordsUrl' holds the url to the next page of records
results = sf.query_more(results['nextRecordsUrl', True])
## repeat the loop of adding the records
for rec in results['records']:
rec.pop('attributes', None)
data.append(rec)
Looping through the records and using the data
## loop through the records and get their attribute values
for rec in data:
# the attribute name will always be the same as the salesforce api name for that value
print(rec['my_field'])
Like the other answer says though, this can start to use up a lot of resources. But it what you're looking for if want to achieve page nation.
Maybe create a more focused SOQL statement to get only the records needed for your use case at that specific moment.
LIMIT and OFFSET aren't really meant to be used like that, what if somebody inserts or deletes a record on earlier position (not to mention you don't have ORDER BY in there). SF will open a proper cursor for you, use it.
https://pypi.org/project/simple-salesforce/ docs for "Queries" say that you can either call query and then query_more or you can go query_all. query_all will loop and keep calling query_more until you exhaust the cursor - but this can easily eat your RAM.
Alternatively look into the bulk query stuff, there's some magic in the API but I don't know if it fits your use case. It'd be asynchronous calls and might not be implemented in the library. It's called PK Chunking. I wouldn't bother unless you have millions of records.

Ambiguous result shows in count(1) in documentdb query Explorer

I have a collection with 90,000 records and each and every time,data is added onto it. But whenever I query 'select count(c.id) from c'. It will shows ambiguous result sometimes 20,190 or 19,916 or 22,897 like that, It won't able to give the exact output.
[
{
"$1": 21687
}
]
It's most likely that the query cannot finish in one shot. 20,000 is roughly the batch size we experience. To confirm this, look for a continuation token in the response headers. If there, you'll need to resubmit with that token until it comes back without a continuation token and sum all the counts from call-side.

Using Lexical Filtering of Azure Table on range of RowKey values

Problem: no results are returned.
I'm using the following code to get a range of objects from a partition with only 100 or so rows:
var rangeQuery = new TableQuery<StorageEntity>().Where(
TableQuery.CombineFilters(
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, partitionKey),
TableOperators.And,
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.GreaterThanOrEqual, from)
),
TableOperators.And,
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.LessThanOrEqual, to)
)
);
var results = table.ExecuteQuery(rangeQuery);
foreach (StorageEntity entity in results)
{
storageEntities.Add(entity);
}
NOTE: it doesn't seem to matter how I combine the 3 terms, no results are returned. An example of one that I am expecting is this (partitionKey, rowKey):
"10005678", "PL7NR_201503170900"
The ranged filter code generates this expression:
((PartitionKey eq '10005678') and (RowKey ge 'PL7NR_201503150000'))
and (RowKey lt 'PL7NR_201504082359')
But I have also tried this (which is my preferred approach for performance reasons, i.e. partition scan):
(PartitionKey eq '10005678') and ((RowKey ge 'PL7NR_201503150000') and
(RowKey lt 'PL7NR_201504082359'))
My understanding is that the Table storage performs a lexical search and that these row keys should therefore encompass a range that includes a row with the following keys:
"10005678", "PL7NR_201503170900"
Is there something fundamentally wrong with my understanding?
Thanks for looking at this.
UPDATE: question updated thanks to Gaurav's answer. The code above implicitly handles continuation tokens (i.e. the foreach loop) and there are only 100 or so items in the partition, so I do not see the continuation tokens as being an issue.
I have tried removing the underscores ('_') from the key and even tried moving the prefix from the rowKey and adding it as a suffix to the partitionKey.
NOTE: This is all running on my local machine using storage emulation.
From Query Timeout and Pagination:
A query against the Table service may return a maximum of 1,000 items
at one time and may execute for a maximum of five seconds. If the
result set contains more than 1,000 items, if the query did not
complete within five seconds, or if the query crosses the partition
boundary, the response includes headers which provide the developer
with continuation tokens to use in order to resume the query at the
next item in the result set. Continuation token headers may be
returned for a Query Tables operation or a Query Entities operation.
Please check if you're getting back Continuation Token in response.
Now coming on to your filter expressions:
((PartitionKey eq '10005678') and (RowKey ge 'PL7NR_201503150000'))
and (RowKey lt 'PL7NR_201504082359')
This one is definitely doing a Full Table Scan because (RowKey lt 'PL7NR_201504082359') is a clause in itself. For executing of this particular piece, it basically starts from top of the table and find out entities where RowKey < 'PL7NR_201504082359' without taking PartitionKey into consideration.
(PartitionKey eq '10005678') and ((RowKey ge 'PL7NR_201503150000') and
(RowKey lt 'PL7NR_201504082359'))
This one is doing a Partition Scan and you may not get result back if you have too much data in the specified partition or the query takes more than 5 seconds to execute as mentioned above.
So, check if your query is returning any continuation tokens and make use of them to get the next set of entities if no entities are returned.
A few resources that you may find useful:
How to get most out of Windows Azure Tables: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx
Azure Storage Table Design Guide: Designing Scalable and Performant Tables: http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/

OrientDB ORecordHookAbstract is firing onRecordAfterRead twice if no index is present

I'm using the hook functionality in OrientDB to implement encryption of some document fields "transparently" to the caller.
Basically the process is:
1 : when onRecordBeforeCreate or onRecordBeforeUpdate event fires we apply encryption to some data fields and change the document before is created or updated
byte[] data = doc.field("data");
byte[] encrypted = encrypt(data);
doc.field("data", encrypted);
2 : when onRecordAfterRead fires we get the encrypted data from the document fields, decrypt them, and change the document fields again with the decrypted data.
byte[] encrypted = doc.field("data");
byte[] decrypted = decrypt(encrypted);
doc.field("data", decrypted);
The problem is that the event onRecordAfterRead is firing twice and in the first time the data decrypts correctly (because is encrypted) but on the second time the decryption fails because we already had decrypted it, and so the document "load" fails.
It happens if the query that i execute to load the document uses some field of the document in the filter (where clause).
So for example, the following query does not trigger the problem:
select count from Data;
but the following query triggers the problem:
select from Data where status ="processed";
This is because i don't have an index on the status field. If i add an index then it fires the event only once. So this is related with the use or not of the indexes.
Should the events be fired if OrientDB is "scanning" the documents when executing a query? Shouldn't it only fire when the matching documents are in fact loaded? Is there a way around this?
the onRecordAfterRead should be called once, if is called multiple times should be a bug, you can report it on https://github.com/orientechnologies/orientdb/issues with a test case that reproduce the problem.

Resources