What is the best way to get row count from a table in Cassandra? - cassandra

Is there any best way that we can get the total number of rows from the Cassandra table?
Regards,
Mani

DatastaxBulk is probably the easiest to install and run.
Apache Spark Cassandra connector could be handy. Once the dataframe is loaded with sc.cassandraTable() you can count
Avoid counting in your code, it does not scale as it performs a full scan of the cluster, the response time will be in seconds.
Avoid counting with CQL select count(*) as you will likely hit the timeout quickly.

you can simply use Count(*) to get row numbers from the table.
For example,
Syntax:
SELECT Count(*)
FROM tablename;
and the expected output looks like this,
count
-------
4
(1 rows)

Background
Cassandra has a built-in CQL function COUNT() which counts the number of rows returned by a query. If you execute an unbounded query (no filter or WHERE clause), it will retrieve all the partitions in the table which you can count, for example:
SELECT COUNT(*) FROM table_name;
Pitfalls
However, this is NOT recommended since it requires a full table scan that would query every single node which is very expensive and will affect the performance of the cluster.
It might work for very small clusters (for example, 1 to 3 nodes) with very small datasets (for example, a few thousand partitions) but in practice it would likely timeout and not return results. I've explained in detail why you shouldn't do this in Why COUNT() is bad in Cassandra.
Recommended solution
There are different techniques for counting records in the database but the easiest way is to use the DataStax Bulk Loader (DSBulk). It is open-source so it's free to use. It was originally designed for bulk-loading data to and exporting data from a Cassandra cluster as a scalable solution for the cqlsh COPY command.
DSBulk has a count command that provides the same functionality as the CQL COUNT() function but has optimisations that break up the table scan into small range queries so doesn't suffer from the same problems as brute-force counting.
DSBulk is quite simple to use and only takes a few minutes to setup. First, you need to download the binaries from DataStax Downloads then unpack the tarball. For details, see the DSBulk Installation Instructions.
Once you've got it installed, you can count the partitions in a table with one command:
$ cd path/to/dsbulk_installation
$ bin/dsbulk count -h <node_ip> -k ks_name -t table_name
Here are some references with examples to help you get started quickly:
Docs - Counting data in tables
Blog - Counting records with DSBulk
Blog - DSBulk Intro + Loading data

You can also use cqlsh as an alternate for small tables.
Refer this documentation
https://www.datastax.com/blog/running-count-expensive-cassandra

Related

ScyllaDB count(*) return difference result

I have a question about query in scylladb. I want to count the rows in a table with:
SELECT COUNT(*)
FROM tabledata;
First run returns a result of 5732 rows
Second run returns a result of 5432 rows
Always different result.
Any suggestions on how to count rows in scylla?
Consistency level?
(you can find on internet a very funny picture about eventual consistency)
IF you have RF=3
If you wrote all your rows with LOCAL_QUORUM
then I'd set CONSISTENCY LOCAL_QUORUM
and rerun the count
if you are not sure whether all your writes were properly done, use CL ALL
another option is to run a full repair and rerun the count
ALSO your table might have TTL, in such case having a different count every time is expected (and if you wrote it might be bigger, if you just read, then it will be smaller)
For efficient count look at https://github.com/scylladb/scylla-code-samples/tree/master/efficient_full_table_scan_example_code - but the same applies re consistency level (and of course this script will tell you with a timeout error that a token range couldn't queried and it means that node/shard was overloaded with other traffic, by default it doesn't retry, it's a simple script)
The problem you're running into is inherent in any distributed row store (Cassandra or Scylla). In order for that to work, a coordinator node needs to contact all other nodes, query them, and assemble the result set. That causes a lot of contention which may prevent some replicas from reporting properly.
I recommend (downloading) using DSBulk for this type of operation. It has a count feature designed just for this purpose.
dsbulk count -k ks1 -t table1 -h '10.200.1.3,10.200.1.4'

Cassandra query language(CQL): Can we combine SELECT * FROM table_name; and SELECT count(*) FROM table_name;

I am trying to find out if we can combine the 'Select *' and 'Select count(*)' query together in CQL and if yes, then how about the performance, does it increase the execution performance of the query than running 2 separate queries on large dataset.
Short answer - no, you can't combine these two queries.
Longer answer - don't do that. On the big datasets your query most probably will timeout because Cassandra will need to go through all data on all machines. This will lead to increased load to coordinator node, and potentially crash it. If you need to fetch all data you need to use another approach:
use Spark Cassandra Connector that is heavily optimized for such tasks
if you want to offload data, or just count, you can use DSBulk utility for that
if you still need to do that from the code, you need to perform so-called token range scanning - technology that is used by Spark Cassandra Connector and DSBulk. You can either implement that yourself (here is an example for Java driver 3.x), or for Java use DSBulk's API - the jars are available via Maven (I don't have example for that)

Cassandra Table Count Timeout

I try to get number of rows in a table. But cassandra turns timeout for this query select count(*) from events ;
I think that my tables is too big so if ı give a timeout value for my query, it turns always timeout cqlsh --request-timeout=200000
Table size 1.3TB. Is there any way to learn How many rows in this table ?
Do not use count(*) to get no of rows.You can use the following link and download the jar file to get the count.
https://github.com/brianmhess/cassandra-count
One solution that can help you find the total rows in Pagination of result.
Please refer to the below doc :
https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/
Note : You can also try ALLOW FILTERING for development! But the use of it should avoided as it is very expensive query, might impact the performance of the cassandra.

Select All Records From Cassandra

I am trying to select all records from one Cassandra table (~10M records) which should be distributed over 4 nodes using CQL shell, but every time I do that it partitions the output to 1K records Max. So my question is, it is possible to select all records at once as I am trying to see how much time it takes Cassandra to retrieve all records.
When you write "SELECT * from CF" CQL client will never select everything at once. It's just a stupid action for large data. Instead it will load only first page and give you an iterator. Cassandra from 2.0 version supports automatic query paging. So you should call your select all query and ITERATE over pages to load full column family. See an example for python client. There is no way to load all in one action in CQL now and it shouldn't be.
While it was already pointed out that it's a bad idea to try and load all data in cqlsh, what you're trying to do is still somewhat possible. You just need to set a limit and probably increase the timeout for cqlsh.
user#host:~# cqlsh --request-timeout=600
This will start the shell with a request timeout of 10 minutes.
select * from some_table limit 10000000;
Please do not use this in a production environment, as it might have terrible implications for performance and cluster availability!

Cassandra multi row selection

Somewhere I have heard that using multi row selection in cassandra is bad because for each row selection it runs new query, so for example if i want to fetch 1000 rows at once it would be the same as running 1000 separate queries at once, is that true?
And if it is how bad would it be to keep selecting around 50 rows each time page is loaded if say i have 1000 page views in a single minute, would it severely slow cassandra down or not?
P.S I'm using PHPCassa for my project
Yes, running a query for 1000 rows is the same as running 1000 queries (if you use the recommended RandomPartitioner). However, I wouldn't be overly concerned by this. In Cassandra, querying for a row by its key is a very common, very fast operation.
As to your second question, it's difficult to tell ahead of time. Build it and test it. Note that Cassandra does use in memory caching so if you are querying the same rows then they will cache.
We are using Playorm for Cassandra and there is a "findAll" pattern there which provides support to fetch all rows quickly. Visit
https://github.com/deanhiller/playorm/wiki/Support-for-retrieving-many-entities-in-parallel for more details.
1) I have little bit debugged the Cassandra code base and as per my observation to query multiple rows at the same time cassandra has provided the multiget() functionality which is also inherited in phpcassa.
2) Multiget is optimized to to handle the batch request and it saves your network hop.(like for 1k rows there will be 1k round trips, so it definitely reduces the time for 999 round trips)
3) More about multiget() in phpcassa: php cassa multiget()

Resources