Paging Large Queries: total number - cassandra

Regarding Cassandra and paging. I might guess the answer but just to be sure;
I know how to ask for a pagesize, but is it possible to get the eventually total number for a query. Like you query select * from tableName with a pagesize of 10, but if you did not use paging you would get 100. Is it possible to get the number 100 when using pagesize 10?
Note: Just if it is of any use, I am using gocql.

Plain answer is no, getting number 100 in your example means knowing the result of "SELECT count(*) FROM table", which is a perf killer query.
The best thing you can get is an estimate of number of partitions per node using nodetool or calling directly JMX Beans. But it won't get you the estimate of CQL row (because in 1 partition there may be N rows if your table has clustering columns)

Related

What is the best way to get row count from a table in Cassandra?

Is there any best way that we can get the total number of rows from the Cassandra table?
Regards,
Mani
DatastaxBulk is probably the easiest to install and run.
Apache Spark Cassandra connector could be handy. Once the dataframe is loaded with sc.cassandraTable() you can count
Avoid counting in your code, it does not scale as it performs a full scan of the cluster, the response time will be in seconds.
Avoid counting with CQL select count(*) as you will likely hit the timeout quickly.
you can simply use Count(*) to get row numbers from the table.
For example,
Syntax:
SELECT Count(*)
FROM tablename;
and the expected output looks like this,
count
-------
4
(1 rows)
Background
Cassandra has a built-in CQL function COUNT() which counts the number of rows returned by a query. If you execute an unbounded query (no filter or WHERE clause), it will retrieve all the partitions in the table which you can count, for example:
SELECT COUNT(*) FROM table_name;
Pitfalls
However, this is NOT recommended since it requires a full table scan that would query every single node which is very expensive and will affect the performance of the cluster.
It might work for very small clusters (for example, 1 to 3 nodes) with very small datasets (for example, a few thousand partitions) but in practice it would likely timeout and not return results. I've explained in detail why you shouldn't do this in Why COUNT() is bad in Cassandra.
Recommended solution
There are different techniques for counting records in the database but the easiest way is to use the DataStax Bulk Loader (DSBulk). It is open-source so it's free to use. It was originally designed for bulk-loading data to and exporting data from a Cassandra cluster as a scalable solution for the cqlsh COPY command.
DSBulk has a count command that provides the same functionality as the CQL COUNT() function but has optimisations that break up the table scan into small range queries so doesn't suffer from the same problems as brute-force counting.
DSBulk is quite simple to use and only takes a few minutes to setup. First, you need to download the binaries from DataStax Downloads then unpack the tarball. For details, see the DSBulk Installation Instructions.
Once you've got it installed, you can count the partitions in a table with one command:
$ cd path/to/dsbulk_installation
$ bin/dsbulk count -h <node_ip> -k ks_name -t table_name
Here are some references with examples to help you get started quickly:
Docs - Counting data in tables
Blog - Counting records with DSBulk
Blog - DSBulk Intro + Loading data
You can also use cqlsh as an alternate for small tables.
Refer this documentation
https://www.datastax.com/blog/running-count-expensive-cassandra

Best way to Fetch N rows in ScyllaDB

I have a use case in which I utilize ScyllaDB to limit users' actions in the past 24h. Let's say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB's ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.
However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I'd like to understand better and compare the behavior(please correct me if any of my statement is wrong):
using count()
count() will implement a full-scan query, meaning that it may query more than necessary records into the table.
SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;
using limit
limit will only limit the number of records being returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.
SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;
using paging
I'm a bit new to this, but if I read the docs correctly it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to a number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?
docs: https://java-driver.docs.scylladb.com/stable/manual/core/paging/index.html
my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql
iter := conn.Query(
"SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?",
hashID,
userID,3
).PageSize(3).PageState(nil).Iter()
Please let me know if my analysis was correct and which method would be best to choose
Your client should always use paging - otherwise you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don't!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.
Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve and - even then - should it not be, we have to consider such set-up at scale. Eg: There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?
If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is because there are some particular problems with the description you've given at hand:
First, you want to retrieve N amount of records within a particular time-frame, but your queries don't specify such time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is not clear how a hash_id + user_id combination can be done to determine the number of records within a time-frame.
Of course, it may be that I am wrong, but I'd like to suggest different some approaches which may be or not applicable for you and your use case.
Consider a timestamp component part of the clustering key. This will allow you to avoid full partition scans, with queries such as:
SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;
If the above is not applicable, then perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:
SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;
I hope that helps!
I have a few points I want to add to what Felipe wrote already:
First, you don't need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers, the partition key doesn't need to be random-looking. Scylla will internally hash the partition key on its own to improve the load balancing. You don't need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it's a variant of murmur3 too (which is not identical to the one you used - it's a modified algorithm originally picked by the Cassandra developers).
Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it's fine to add the third - and then when both add their record - and you end up with four records. You'll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it's a disaster. Note that theoretically you can get even more than 4 - if the user managest to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won't manage to get many superflous records). If you'll want 3 to be a hard limit, you'll probably needs to change your solution - perhaps to one based on LWT and not use TTL.
Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn't really matter if you just retrieve them or count them - you should just do whichever is convenient for you. In any case, I think that paging is not a good solution your need. For such short results and you can just use the default page size and you'll never reach it anyway, and also paging hints the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case you know that you'll never continue after the first three results. So in short, don't use any special paging setup here - just use the default page size (which is 1MB) and it will never be reached anyway.

what is the impact of limit in cassandra cql

When executing a cqlsh query like select * from table limit 10, would cassandra scan the entire table and just return the first 10 records, or it can precisely locate the first 10 records across whole datacenter without scanning the entire table?
The LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.
Cassandra has internal mechanisms such as request timeouts which prevent bad queries from causing the cluster to crash so queries are more likely to timeout rather than overloading the cluster with scans on all nodes/replicas.
As a side note, the LIMIT option is irrelevant when used with SELECT COUNT() since the count function returns just 1 row (by design). COUNT() needs to do a full table scan regardless of the limit set. I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/6897/. Cheers!
LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.

Cassandra Table Count Timeout

I try to get number of rows in a table. But cassandra turns timeout for this query select count(*) from events ;
I think that my tables is too big so if ı give a timeout value for my query, it turns always timeout cqlsh --request-timeout=200000
Table size 1.3TB. Is there any way to learn How many rows in this table ?
Do not use count(*) to get no of rows.You can use the following link and download the jar file to get the count.
https://github.com/brianmhess/cassandra-count
One solution that can help you find the total rows in Pagination of result.
Please refer to the below doc :
https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/
Note : You can also try ALLOW FILTERING for development! But the use of it should avoided as it is very expensive query, might impact the performance of the cassandra.

Select All Records From Cassandra

I am trying to select all records from one Cassandra table (~10M records) which should be distributed over 4 nodes using CQL shell, but every time I do that it partitions the output to 1K records Max. So my question is, it is possible to select all records at once as I am trying to see how much time it takes Cassandra to retrieve all records.
When you write "SELECT * from CF" CQL client will never select everything at once. It's just a stupid action for large data. Instead it will load only first page and give you an iterator. Cassandra from 2.0 version supports automatic query paging. So you should call your select all query and ITERATE over pages to load full column family. See an example for python client. There is no way to load all in one action in CQL now and it shouldn't be.
While it was already pointed out that it's a bad idea to try and load all data in cqlsh, what you're trying to do is still somewhat possible. You just need to set a limit and probably increase the timeout for cqlsh.
user#host:~# cqlsh --request-timeout=600
This will start the shell with a request timeout of 10 minutes.
select * from some_table limit 10000000;
Please do not use this in a production environment, as it might have terrible implications for performance and cluster availability!

Resources