Cassandra Python driver doesn't page large queries

Cassandra Python driver doesn't page large queries - cassandra

It is said in the documentation that cassandra-driver does automatic paging when queries are large enough (with default_fetch_size being 5000 rows) and will return PagedResult.
I have tested reading data from my local Cassandra which contains 9999 rows with SimpleStatement with my own fetch size, but it returned the ResultSet (9999 rows) instead of pages (instance of PagedResult). Also, I tried to change the Session.default_fetch_size but it didn't work as well.
Here's my code..
My first attempt: This is the SimpleStatement code i have made to change the fetch size.
cluster = Cluster()
session = cluster.connect(keyspace_name)
query = "SELECT * FROM user"
statement = SimpleStatement(query, fetch_size=10)
rows = list(session.execute(statement))
print(len(rows))
It prints 9999 (all rows), not 10 rows as I already set the fetch_size.
My second attempt: I tried to change the query fetch size by changing session's default fetch size Session.default_fetch_size.
cluster = Cluster()
session = cluster.connect(keyspace_name)
session.default_fetch_size = 10
query = "SELECT * FROM user"
rows = list(session.execute(query))
print(len(rows))
It also prints 9999 rows instead of 10.
My goal is not to limit the rows from my fetch query, such as SELECT * FROM user LIMIT 10. What I want is to fetch the rows page by page to avoid overload on memory.
So what actually happened?
Note: I am using Cassandra-Driver 3.25 for Python and using Python3.7
I am sorry if my additional information still doesn't make my question a good one. I never ask any questions before. So...any suggestions are welcome :)

Your test is invalid because your code is faulty.
When you list(), you are in fact "materialising" all the result pages. Your code is not iterating over the rows but retrieving all of the rows.
The driver automatically fetches the next page in the background until there are no more pages to fetch. It may not seem like it but each page only contains fetch_size rows.
Retrieving the next page happens transparently so to you it seems like the results are not getting paged at all but that automatic behaviour from the driver is working as designed. Cheers!

Related

How to get Salesforce REST API to paginate?

I'm using the simple_salesforce python wrapper for the Salesforce REST API. We have hundreds of thousands of records, and I'd like to split up the pull of the salesforce data so all records are not pulled at the same time.
I've tried passing a query like:
results = salesforce_connection.query_all("SELECT my_field FROM my_model limit 2000 offset 50000")
to see records 50K through 52K but receive an error that offset can only be used for the first 2000 records. How can I use pagination so I don't need to pull all records at once?

Your looking to use salesforce_connection.query(query=SOQL) and then .query_more(nextRecordsUrl, True)
Since .query() only returns 2000 records you need to use .query_more to get the next page of results
From the simple-salesforce docs
SOQL queries are done via:
sf.query("SELECT Id, Email FROM Contact WHERE LastName = 'Jones'")
If, due to an especially large result, Salesforce adds a nextRecordsUrl to your query result, such as "nextRecordsUrl" : "/services/data/v26.0/query/01gD0000002HU6KIAW-2000", you can pull the additional results with either the ID or the full URL (if using the full URL, you must pass ‘True’ as your second argument)
sf.query_more("01gD0000002HU6KIAW-2000")
sf.query_more("/services/data/v26.0/query/01gD0000002HU6KIAW-2000", True)
Here is an example of using this
data = [] # list to hold all the records
SOQL = "SELECT my_field FROM my_model"
results = sf.query(query=SOQL) # api call
## loop through the results and add the records
for rec in results['records']:
rec.pop('attributes', None) # remove extra data
data.append(rec) # add the record to the list
## check the 'done' attrubite in the response to see if there are more records
## While 'done' == False (more records to fetch) get the next page of records
while(results['done'] == False):
## attribute 'nextRecordsUrl' holds the url to the next page of records
results = sf.query_more(results['nextRecordsUrl', True])
## repeat the loop of adding the records
for rec in results['records']:
rec.pop('attributes', None)
data.append(rec)
Looping through the records and using the data
## loop through the records and get their attribute values
for rec in data:
# the attribute name will always be the same as the salesforce api name for that value
print(rec['my_field'])
Like the other answer says though, this can start to use up a lot of resources. But it what you're looking for if want to achieve page nation.
Maybe create a more focused SOQL statement to get only the records needed for your use case at that specific moment.

LIMIT and OFFSET aren't really meant to be used like that, what if somebody inserts or deletes a record on earlier position (not to mention you don't have ORDER BY in there). SF will open a proper cursor for you, use it.
https://pypi.org/project/simple-salesforce/ docs for "Queries" say that you can either call query and then query_more or you can go query_all. query_all will loop and keep calling query_more until you exhaust the cursor - but this can easily eat your RAM.
Alternatively look into the bulk query stuff, there's some magic in the API but I don't know if it fits your use case. It'd be asynchronous calls and might not be implemented in the library. It's called PK Chunking. I wouldn't bother unless you have millions of records.

NodeJS - azure-storage-node- , how to retrieve addition of two columns, and apply filtering condition

Sorry for being newbie for NodeJs and table query, my question's,
How I could create a query using Nodejs pakcage "azure-storage-node", which selects the sum/addition of two coloumns 'start' and 'period' , if the addition is greater than a threshold it will take the whole raw, my tries which didn't work is something like this,
var query = new azure.TableQuery();
total = query.select(['start']) + query.select(['period']);
query.where('total > ?' , 50000);
or may be something like this,
var query = new azure.TableQuery()
.where('start + period gt 50000');
but it throws an error of '+'.
Thanks

What you're trying to accomplish is not possible with Azure Tables at least as of today as Azure Tables has limited querying support and support for computed columns (if I may say so) is not there.
There are two possible solutions:
Have an attribute called total in your entities that will contain the value i.e. start + period. You calculate this value when you're inserting or updating the entity and store it at that time.
Do this filtering on the client side. For this you will need to download all related entities and then apply this filtering on the client side on the data that you fetched.

Paging through Cassandra using QueryBuilder

The DataStax documentation says that to page through all data, the following CQL query is useful:
SELECT * FROM test WHERE token(k) > token(42);
Is it possible to build this query using the QueryBuilder? It provides a token method, but that seems to work only on column names, not on values.
Ideally, the value (in the example: 42) is of type Object, just like in the eq/gte/lte functions.

Try using automatic paging with the .fetchSize method. It uses token under the hood:
Automatic paging is introduced Cassandra 2.0. Automatic paging allows the developer to iterate on an entire ResultSet without having to care about its size: some extra rows are fetched as the client code iterate over the results while the old ones are dropped. The amount of rows that must be retrieved can be parameterized at query time. In the Java Driver this will looks like:
Statement stmt = new SimpleStatement("SELECT * FROM images");
stmt.setFetchSize(100);
ResultSet rs = session.execute(stmt);
Source: http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

QueryBuilder.fcall("token", value) ;
can solve the problem！

node.js mongodb cursor looping on client request

I know how to query a collection. But I have a collection with 100,000 records and I want to show only 100 items per page. The user can then select next 100 records and so on...
Since this request is coming from the user, how do I keep the cursor open on node.js for looping the next 100 items when client requests for it?
What is the standard practice?
Thanks!

The standard practice for what you are referring to is something like pagination.
You don't need to keep the cursor open all the time. All you need to make sure is that you continue from the same place you left off.
The client would retain the number of records that has already been displayed and use that number inside the skip() function of the cursor.
For example:
Client is provided with 10 records. record_count = 10.
Client requests more records and includes record_count in the request.
Server uses supplied record_count in another query in the skip parameter.
Server returns another 10 records to client.
Client updates the record_count variable to now be 20.
Rinse, Repeat...
Keep in mind that you'd want your results to be sorted somehow so that your query will always return different results (the next 10 records).
I'm not too familiar with the node drivers for mongo, but in the mongo shell, you would execute the query as follows:
db.collection.find().sort( { "time": 1 } ).skip( record_count ).limit( 10 )

Why is GetPaged() Executing two database calls?

I'm a bit new to subsonic (i.e. evaluating 3.0.0.3) and have come across a strange behavior in GetPaged(int pageIndex, int pageSize). When I execute the method it does two SQL calls. Any ideas why ?
Details
Lets say I have a "Cultures" table with 200 rows. In my code I do something like ...
var sonicCollection = from c in RTE.Culture.GetPaged(1, 25)
select c;
Now, I would expect this executes a single query returning the first 25 entries in my cultures table. When I watch SQL profiler I see two queries run by.
First this--
SELECT [dbo].[Cultures].[cultureCode], [dbo].[Cultures].[cultureName]
FROM [dbo].[Cultures]
Then This--
SELECT *
FROM (SELECT ROW_NUMBER() OVER (
ORDER BY cultureID ASC) AS Row,
[dbo].[Cultures].[cultureCode], [dbo].[Cultures].[cultureName]
FROM [dbo].[Cultures]
)
AS PagedResults
WHERE Row >= 1 AND Row <= 25
I expect the 2nd query to roll by, as it is the one returning the 25 rows I politely requested of subsonic. The first query, however, appears to return 200 rows (at least according to SQL profiler).
Any ideas what's going on?

It's a bug in the code. The code actually queries every record and then iterates over each one for the count. I've created an issue in the github repo here:
https://github.com/subsonic/SubSonic-3.0/issues/259
You can download the source, fix the issue and recompile pretty easily. I've done this and its fixed my issue.

You just want to use RTE.Culture.GetPaged() - it runs the paged query for you.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cassandra Python driver doesn't page large queries - cassandra

Related

How to get Salesforce REST API to paginate?

NodeJS - azure-storage-node- , how to retrieve addition of two columns, and apply filtering condition

Paging through Cassandra using QueryBuilder

node.js mongodb cursor looping on client request

Why is GetPaged() Executing two database calls?

Categories

Resources