Google Cloud Datastore Cursor with google.cloud.ndb - pagination

I am working with Google Cloud Datastore using the latest google.cloud.ndb library
I am trying to implement pagination use Cursor using the following code.
The same is not fetching the data correctly.
[1] To Fetch Data:
query_01 = MyModel.query()
f = query_01.fetch_page_async(limit=5)
This code works fine and fetches 5 entities from MyModel
I want to implementation pagination that can be integrated with a Web frontend
[2] To Fetch Next Set of Data
from google.cloud.ndb._datastore_query import Cursor
nextpage_value = "2"
nextcursor = Cursor(cursor=nextpage_value.encode()) # Converts to bytes
query_01 = MyModel.query()
f = query_01.fetch_page_async(limit=5, start_cursor= nextcursor)
[3] To Fetch Previous Set of Data
previouspage_value = "1"
prevcursor = Cursor(cursor=previouspage_value.encode())
query_01 = MyModel.query()
f = query_01.fetch_page_async(limit=5, start_cursor=prevcursor)
The [2] & [3] sets of code do not fetch paginated data, but returns results same as results of codebase [1].
Please note I'm working with Python 3 and using the
latest "google.cloud.ndb" Client library to interact with Datastore
I have referred to the following link https://github.com/googleapis/python-ndb
I am new to Google Cloud, and appreciate all the help I can get.

Firstly, it seems to me like you are expecting to use the wrong kind of pagination. You are trying to use numeric values, whereas the datastore cursor is providing cursor-based pagination.
Instead of passing in byte-encoded integer values (like 1 or 2), the datastore is expecting tokens that look similar to this: 'CjsSNWoIb3Z5LXRlc3RyKQsSBFVzZXIYgICAgICAgAoMCxIIQ3ljbGVEYXkiCjIwMjAtMTAtMTYMGAAgAA=='
Such a cursor you can obtain from the first call to the fetch_page() method, which returns a tuple:
(results, cursor, more) where results is a list of query results, cursor is a cursor pointing just after the last result returned, and more indicates whether there are (likely) more results after that
Secondly, you should be using fetch_page() instead of fetch_page_async(), since the second method does not return you the cursors you need for pagination. Internally, fetch_page() is calling fetch_page_async() to get your query results.
Thirdly and lastly, I am not entirely sure whether the "previous page" use-case is doable using the datastore-provided pagination. It may be that you need to implement that yourself manually, by storing some of the cursors.
I hope that helps and good luck!

Related

Let Alexa ask the user a follow up question (NodeJS)

Background
I have an Intent that fetches some Data from an API. This data contains an array and I am iterating over the first 10 entries of said array and read the results back to the user. However the Array is almost always bigger than 10 entries. I am using Lambda for my backend and NodeJS as my language.
Note that I am just starting out on Alexa and this is my first skill.
What I want to archive is the following
When the user triggers the intent and the first 10 entries have been read to the user Alexa should ask "Do you want to hear the next 10 entries?" or something similar. The user should be able to reply with either yes or no. Then it should read the next entries aka. access the array again.
I am struggling with the Alexa implementation of this dialog.
What I have tried so far: I've stumbled across this post here, however I couldn't get it to work and I didn't find any other examples.
Any help or further pointers are appreciated.
That tutorial gets the concept right, but glosses over a few things.
1: Add the yes and no intents to your model. They're "built in" intents, but you have to add them to the model (and rebuild it).
2: Add your new intent handlers to the list in the .addRequestHandlers(...) function call near the bottom of the base skill template. This is often forgotten and is not mentioned in the tutorial.
3: Use const sessionAttributes = handlerInput.attributesManager.getSessionAttributes(); to get your stored session attributes object and assign it to a variable. Make changes to that object's properties, then save it with handlerInput.attributesManager.setSessionAttributes(sessionAttributes);
You can add any valid property name and the values can be a string, number, boolean, or object literal.
So assume your launch handler greets the customer and immediately reads the first 10 items, then asks if they'd like to hear 10 more. You might store sessionAttributes.num_heard = 10.
Both the YesIntent and LaunchIntent handlers should simply pass a num_heard value to a function that retrieves the next 10 items and feeds it back as a string for Alexa to speak.
You just increment sessionAttributes.num_heard by 10 each time that yes intent runs and then save it with handlerInput.attributesManager.setSessionAttributes(sessionAttributes).
What you need to do is something called "Paging".
Let's imagine that you have a stock of data. each page contains 10 entries.
page 1: 1-10, page 2: 11-20, page 3: 21-30 and so on.
When you fetching your data from DB you can set your limitations, In SQL it's implemented with LIMIT ,. But how you get those values based on the page index?
Well, a simple calculation can help you:
let page = 1 //Your identifier or page index. Managed by your client frontend.
let chunk = 10
let _start = page * chunk - (chunk - 1)
let _end = start + (chunk - 1)
Hope this helped you :)

How to get Salesforce REST API to paginate?

I'm using the simple_salesforce python wrapper for the Salesforce REST API. We have hundreds of thousands of records, and I'd like to split up the pull of the salesforce data so all records are not pulled at the same time.
I've tried passing a query like:
results = salesforce_connection.query_all("SELECT my_field FROM my_model limit 2000 offset 50000")
to see records 50K through 52K but receive an error that offset can only be used for the first 2000 records. How can I use pagination so I don't need to pull all records at once?
Your looking to use salesforce_connection.query(query=SOQL) and then .query_more(nextRecordsUrl, True)
Since .query() only returns 2000 records you need to use .query_more to get the next page of results
From the simple-salesforce docs
SOQL queries are done via:
sf.query("SELECT Id, Email FROM Contact WHERE LastName = 'Jones'")
If, due to an especially large result, Salesforce adds a nextRecordsUrl to your query result, such as "nextRecordsUrl" : "/services/data/v26.0/query/01gD0000002HU6KIAW-2000", you can pull the additional results with either the ID or the full URL (if using the full URL, you must pass ‘True’ as your second argument)
sf.query_more("01gD0000002HU6KIAW-2000")
sf.query_more("/services/data/v26.0/query/01gD0000002HU6KIAW-2000", True)
Here is an example of using this
data = [] # list to hold all the records
SOQL = "SELECT my_field FROM my_model"
results = sf.query(query=SOQL) # api call
## loop through the results and add the records
for rec in results['records']:
rec.pop('attributes', None) # remove extra data
data.append(rec) # add the record to the list
## check the 'done' attrubite in the response to see if there are more records
## While 'done' == False (more records to fetch) get the next page of records
while(results['done'] == False):
## attribute 'nextRecordsUrl' holds the url to the next page of records
results = sf.query_more(results['nextRecordsUrl', True])
## repeat the loop of adding the records
for rec in results['records']:
rec.pop('attributes', None)
data.append(rec)
Looping through the records and using the data
## loop through the records and get their attribute values
for rec in data:
# the attribute name will always be the same as the salesforce api name for that value
print(rec['my_field'])
Like the other answer says though, this can start to use up a lot of resources. But it what you're looking for if want to achieve page nation.
Maybe create a more focused SOQL statement to get only the records needed for your use case at that specific moment.
LIMIT and OFFSET aren't really meant to be used like that, what if somebody inserts or deletes a record on earlier position (not to mention you don't have ORDER BY in there). SF will open a proper cursor for you, use it.
https://pypi.org/project/simple-salesforce/ docs for "Queries" say that you can either call query and then query_more or you can go query_all. query_all will loop and keep calling query_more until you exhaust the cursor - but this can easily eat your RAM.
Alternatively look into the bulk query stuff, there's some magic in the API but I don't know if it fits your use case. It'd be asynchronous calls and might not be implemented in the library. It's called PK Chunking. I wouldn't bother unless you have millions of records.

How can I update expiration of a document in Couchbase using Python 3?

We have a lot of docs in Couchbase with expiration = 0, which means that documents stay in Couchbase forever. I am aware that INSERT/UPDATE/DELETE isn't supported by N1QL.
We have 500,000,000 such docs and I would like to do this in parallel using chunks/bulks. How can I update the expiration field using Python 3?
I am trying this:
bucket.touch_multi(('000c4894abc23031eed1e8dda9e3b120', '000f311ea801638b5aba8c8405faea47'), ttl=10)
However I am getting an error like:
_NotFoundError_0xD (generated, catch NotFoundError): <Key=u'000c4894abc23031eed1e8dda9e3b120'
I just tried this:
from couchbase.cluster import Cluster
from couchbase.cluster import PasswordAuthenticator
cluster = Cluster('couchbase://localhost')
authenticator = PasswordAuthenticator('Administrator', 'password')
cluster.authenticate(authenticator)
cb = cluster.open_bucket('default')
keys = []
for i in range(10):
keys.append("key_{}".format(i))
for key in keys:
cb.upsert(key, {"some":"thing"})
print(cb.touch_multi(keys, ttl=5))
and I get no errors, just a dictionary of keys and OperationResults. And they do in fact expire soon thereafter. I'd guess some of your keys are not there.
However maybe you'd really rather set a bucket expiry? That will make all the documents expire in that time, regardless of what the expiry on the individual documents are. In addition to the above answer that mentions that, check out this for more details.
You can use Couchbase Python (Any) SDK Bucket.touch() method Described here https://docs.couchbase.com/python-sdk/current/document-operations.html#modifying-expiraton
If you don't know the document keys you can use N1QL Covered index get the document keys asynchronously inside your python SDK and use the above bucket touch API set expiration from your python SDK.
CREATE INDEX ix1 ON bucket(META().id) WHERE META().expiration = 0;
SELECT RAW META().id
FROM bucket WHERE META().expiration = 0 AND META().id LIKE "a%";
You can issue different SELECT's for different ranges and do in parallel.
Update Operation, You need to write one. As you get each key do (instead of update) bucket.touch(), which only updates document expiration without modifying the actual document. That saves get/put of whole document (https://docs.couchbase.com/python-sdk/current/core-operations.html#setting-document-expiration).

Passing sets of properties and nodes as a POST statement wit KOA-NEO4J or BOLT

I am building a REST API which connects to a NEO4J instance. I am using the koa-neo4j library as the basis (https://github.com/assister-ai/koa-neo4j-starter-kit). I am a beginner at all these technologies but thanks to some help from this forum I have the basic functionality working. For example the below code allows me to create a new node with the label "metric" and set the name and dateAdded propertis.
URL:
/metric?metricName=Test&dateAdded=2/21/2017
index.js
app.defineAPI({
method: 'POST',
route: '/api/v1/imm/metric',
cypherQueryFile: './src/api/v1/imm/metric/createMetric.cyp'
});
createMetric.cyp"
CREATE (n:metric {
name: $metricName,
dateAdded: $dateAdded
})
return ID(n) as id
However, I am struggling to know how I can approach more complicated examples. How can I handle situations when I don't know how many properties will be added when creating a new node beforehand or when I want to create multiple nodes in a single post statement. Ideally I would like to be able to pass something like JSON as part of the POST which would contain all of the nodes, labels and properties that I want to create. Is something like this possible? I tried using the below Cypher query and passing a JSON string in the POST body but it didn't work.
UNWIND $props AS properties
CREATE (n:metric)
SET n = properties
RETURN n
Would I be better off switching tothe Neo4j Rest API instead of the BOLT protocol and the KOA-NEO4J framework. From my research I thought it was better to use BOLT but I want to have a Rest API as the middle layer between my front and back end so I am willing to change over if this will be easier in the longer term.
Thanks for the help!
Your Cypher syntax is bad in a couple of ways.
UNWIND only accepts a collection as its argument, not a string.
SET n = properties is only legal if properties is a map, not a string.
This query should work for creating a single node (assuming that $props is a map containing all the properties you want to store with the newly created node):
CREATE (n:metric $props)
RETURN n
If you want to create multiple nodes, then this query (essentially the same as yours) should work (but only if $prop_collection is a collection of maps):
UNWIND $prop_collection AS props
CREATE (n:metric)
SET n = props
RETURN n
I too have faced difficulties when trying to pass complex types as arguments to neo4j, this has to do with type conversions between js and cypher over bolt and there is not much one could do except for filing an issue in the official neo4j JavaScript driver repo. koa-neo4j uses the official driver under the hood.
One way to go about such scenarios in koa-neo4j is using JavaScript to manipulate the arguments before sending to Cypher:
https://github.com/assister-ai/koa-neo4j#preprocess-lifecycle
Also possible to further manipulate the results of a Cypher query using postProcess lifecycle hook:
https://github.com/assister-ai/koa-neo4j#postprocess-lifecycle

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}
Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

Resources