Datastax driver limit option - cassandra

I construct a select query using datastax java driver. I set the limit using limit option. But i see another property that can be set too
setFetchSize(int size)
DEFAULT_FETCH_SIZE- 5000 according to the docs.
http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/QueryOptions.html#DEFAULT_FETCH_SIZE
Does this mean that if i have around 10000 columns in a row,if i have a query run with a limit of 3, it will always fetch the default value specified- 5000 rows and then limit the last 3 rows from that?
I thought the limit query fetches the last 3 values alone by default when used like this. Can someone clarify on this?

LIMIT sets the max number of rows the engine retrieves while setFetchSize sets the max number of rows that are returned to the client in one roundtrip.

in cassandra LIMIT will not work the same as mysql or any other RDBMS limit,
by default when you execute select query it will display data till 10000 columns, after that 1 message will appear that it is out of limit something like that, so let's say you have 50000 records in the database and you are executing select query then only 10000 records will appear so now you will execute query select * from table LIMIT 50000 in this case all 50000 data will display..

I think the difference between the fetchsize and the limit is the same as JDBC with other databases like MySQL.
So, the LIMIT will limit your query results to those that fall within the range specified.
The fetch size is the number of rows physically retrieved from the database at one time by the driver as you scroll through a query ResultSet with next().

Related

Amazon Keyspaces ALLOW FILTERING without all keys in Node JS

I have a table with two keys KeyA and KeyB on AWS Keyspaces.
I open a CQL editor on AWS Console and run the query:
SELECT fields FROM table WHERE KeyA = 'value' AND KeyB = 'value' LIMIT 10 ALLOW FILTERING
It returns 10 rows, I copy the query to my node.js project and run the same query, also returns 10 rows.
Now I want to filter by only ONE key, I open a CQL editor on AWS Console and run the query:
SELECT fields FROM table WHERE KeyA = 'value' LIMIT 10 ALLOW FILTERING
and returns 10 fields, now I copy and paste the same query to my node project but this time returns 0 rows.
I believe I'm missing some configuration on my node.js? it's the library issue? AWS issue?
I'm using Node v14.16.1, cassandra-driver v4.6.1
When using allow filtering you should also implement paging even though you are using a limit of 10.
Amazon Keyspaces paginates results based on the number of rows that it reads to process a request, not the number of rows returned in the result set. As a result, some pages might contain fewer rows than you specify in PAGE SIZE for filtered queries. In addition, Amazon Keyspaces paginates results automatically after reading 1 MB of data to provide customers with consistent, single-digit millisecond read performance.
Because Amazon Keyspaces paginates results based on the number of rows read to process a request and not the number of rows returned in the result set, some pages may not contain any rows if you are running filtered queries.
https://docs.aws.amazon.com/keyspaces/latest/devguide/working-with-queries.html#paginating-results
client.eachRow(query, parameters, { prepare: true, autoPage : true }, function(n, row) {
// Invoked per each row in all the pages
}, callback);
In the end, a full scan is not a typical Cassandra access pattern and its recommended that you always access data based on fully qualified partition key.

Remote pagination and last_page: filter during, or after, database query?

I would like to use Tabulator's remote pagination to load records from my database table, one page at a time. I expect that I should be able to use the page and size parameters sent by Tabulator to my remote back-end to determine which records to select from the database.
For example, with page=2 and size=10, I can use MySQL's LIMIT 10,20 to select the records to be shown on page 2 (if size is set to 10).
However, doing this precludes me from using the count of all records to determine the number of pages in the table. Doing a count on the returned records will only yield 10 records, even if there are a total of 500 records (for example), so only one pagination button will be shown (instead of the expected 50 buttons).
So in order to do remote pagination "correctly" in Tabulator, it seems I must do a query to count all records from my database (with no limits), then do a count to determine the last_page, and then do something like PHP's array_slice to extract the nth page's worth of records to return as the dataset. Or I can do 2 database queries: count all records to determine # of pages, and then do a LIMIT [start],[end] query.
Is this correct?
Tabulator needs to know the last page number in order to layout the pagination buttons in the table footer so that users can select the page they want to view from the list of pages.
You simply need to do a query to count the total number of records and divide it by the number of page size which is passed in the request. you can run a count query quite efficiently returning only the count and no data.
You can then run a standard query with a limit set on the records to retreive the records for that page.
If you want to optimize things further you could stick the count value in cache so that you dont need to generate it on each request.

Cassandra, what is the efficient way to run subquery

I have a huge table of employees (around 20 to 30 million), and I have around 50,000 employee ids to select from this table.
What is the fastest way to query? Is it a query like this:
select * from employee_table where employeeid in (1,400,325 ....50000)
The ids are not necessarily in sequential order; they are in a random order.
When the IN clause is used in a query the load for the co-ordinator node increases because for every value (in your case the employee id) it needs to hit the required nodes (again based on the CL of your query) and collate the results before returning back to the client. Hence if your IN clause has a few values using IN is ok.
But in your case if you need to fetch ~50K employee IDs I would suggest you fire select * from employee_table where employeeid = <your_employee_id> in parallel for those 50K IDs
I would also suggest that when you do this you should monitor your cassandra cluster & to ensure these parallel queries are not causing a high load on your cluster. (This last statement is based on my personal experience :))

Why does aggregating paginated query takes less time than fetching the entire table

I have a table in my database and I have it indexed over three columns: PropertyId, ConceptId and Sequence. This particular table has about 90,000 rows in it and it is indexed over these three properties.
Now, when I run this query, the total time required is greater than 2 minutes:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
However, if I paginate the query like so:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the aggregate time (x goes from 0 to 8) required is only around 20 seconds.
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries and we're adding on the additional latency required for sequential network calls because I haven't parallelized this query at all. And, I know it's not a caching issue because running these queries one after the other does not affect the latencies very much.
So, my question is this: why is one so much faster than the other?
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries
Pagination queries some times works very fast,if you have the right index...
For example,with below query
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the maximum rows you might read is 20000 only..below is an example which proves the same
RunTimeCountersPerThread Thread="0" ActualRows="60" ActualRowsRead="60"
but with select * query.. you are reading all the rows
After a prolonged search into what's going on here, I discovered that the reason behind this difference in performance (> 2 minutes) was due to hosting the database on Azure. Since Azure partitions any tables you host on it across multiple partitions (i.e. multiple machines), running a query like:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
would run more slowly because the query pulls data from all the partitions in before ordering them, which could result in multiple queries across multiple partitions on the same table. By paginating the query over indexed properties I was looking at a particular partition and querying over the table stored there, which is why it performed significantly better than the un-paginated query.
To prove this, I ran another query:
SELECT *
FROM MSC_NPV
ORDER BY Narrative
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
This query ran anemically when compared to the first paginated query because Narrative is not a primary key and therefore is not used by Azure to build a partition key. So, ordering on Narrative required the same operation as the first query and additional operations on top of that because the entire table had to be gotten beforehand.

SELECT COUNT(*) return 0 but I have 800 rows

I use Cassandra 2.0 and cqlsh:
cqlsh:node1> SELECT count(*) FROM users;
count
-------
0
(1 rows)
but when I do:
cqlsh:node1> select id from users LIMIT 10;
id
--------------------
8acecf2
f638215
8b33e24
470a2cb
0f9a5c2
4c49298
2e28a56
b42ce98
19b68c5
2a207f2
(10 rows)
My users table have 5 "text" columns with more than 100Kb of base64 data.
When I do a SELECT * FROM users; cqlsh take 3 seconds before showing the data.
Any one has a solution?
Is it possible to make a COUNT(column)?
ps: what do you need? logs? where?
what needs to be done when counting is specify a limit :
if you are sure that the number of "rows" is less that 5,000,000 (5 millions) then you can do in cql3.0 :
select count(*) from mycolumnfamilyname limit 5000000;
You need to re-think, why does your application need counts. If you have millions/billions of rows, counting will be time/resource-consuming.
If your application is OK with "approximate" counts of users, then you may use "nodetool cfstats". It will get you number of keys (users) estimate and generally accurate.
If you need "exact", then there are different techniques to do that.
you can maintain a special row and keep adding columns to it, when there is a new row insert. Now you can count the number of columns to get number of rows.
In order to count a specific column, you have to have the column in the WHERE clause.
For example, assuming the 'id' column is your primary key, you could do this:
SELECT COUNT(id) FROM users WHERE id > '';
If the column is not the primary key, then you have to allow filtering as in:
SELECT COUNT(name) FROM users WHERE name > '' ALLOW FILTERING;
As mentioned by others, this is slow and the LIMIT keyword is required if you expect a large number of users. The slowness comes from the fact that Cassandra reads all the rows one by one and from what I understand, it reads the entire rows (i.e. your really big columns will be loaded each time,) because they do not have a way to just read one column when filtering. But Cassandra 3.x may have a ameliorated that now.
If you really need that number often, you could use a lock and increment a field representing the number of users. You could also have a process that adjusts the number once in a while if it gets out of sync, somehow.

Resources