ScyllaDB count(*) return difference result - cassandra

I have a question about query in scylladb. I want to count the rows in a table with:
SELECT COUNT(*)
FROM tabledata;
First run returns a result of 5732 rows
Second run returns a result of 5432 rows
Always different result.
Any suggestions on how to count rows in scylla?

Consistency level?
(you can find on internet a very funny picture about eventual consistency)
IF you have RF=3
If you wrote all your rows with LOCAL_QUORUM
then I'd set CONSISTENCY LOCAL_QUORUM
and rerun the count
if you are not sure whether all your writes were properly done, use CL ALL
another option is to run a full repair and rerun the count
ALSO your table might have TTL, in such case having a different count every time is expected (and if you wrote it might be bigger, if you just read, then it will be smaller)
For efficient count look at https://github.com/scylladb/scylla-code-samples/tree/master/efficient_full_table_scan_example_code - but the same applies re consistency level (and of course this script will tell you with a timeout error that a token range couldn't queried and it means that node/shard was overloaded with other traffic, by default it doesn't retry, it's a simple script)

The problem you're running into is inherent in any distributed row store (Cassandra or Scylla). In order for that to work, a coordinator node needs to contact all other nodes, query them, and assemble the result set. That causes a lot of contention which may prevent some replicas from reporting properly.
I recommend (downloading) using DSBulk for this type of operation. It has a count feature designed just for this purpose.
dsbulk count -k ks1 -t table1 -h '10.200.1.3,10.200.1.4'

Related

What is the best way to get row count from a table in Cassandra?

Is there any best way that we can get the total number of rows from the Cassandra table?
Regards,
Mani
DatastaxBulk is probably the easiest to install and run.
Apache Spark Cassandra connector could be handy. Once the dataframe is loaded with sc.cassandraTable() you can count
Avoid counting in your code, it does not scale as it performs a full scan of the cluster, the response time will be in seconds.
Avoid counting with CQL select count(*) as you will likely hit the timeout quickly.
you can simply use Count(*) to get row numbers from the table.
For example,
Syntax:
SELECT Count(*)
FROM tablename;
and the expected output looks like this,
count
-------
4
(1 rows)
Background
Cassandra has a built-in CQL function COUNT() which counts the number of rows returned by a query. If you execute an unbounded query (no filter or WHERE clause), it will retrieve all the partitions in the table which you can count, for example:
SELECT COUNT(*) FROM table_name;
Pitfalls
However, this is NOT recommended since it requires a full table scan that would query every single node which is very expensive and will affect the performance of the cluster.
It might work for very small clusters (for example, 1 to 3 nodes) with very small datasets (for example, a few thousand partitions) but in practice it would likely timeout and not return results. I've explained in detail why you shouldn't do this in Why COUNT() is bad in Cassandra.
Recommended solution
There are different techniques for counting records in the database but the easiest way is to use the DataStax Bulk Loader (DSBulk). It is open-source so it's free to use. It was originally designed for bulk-loading data to and exporting data from a Cassandra cluster as a scalable solution for the cqlsh COPY command.
DSBulk has a count command that provides the same functionality as the CQL COUNT() function but has optimisations that break up the table scan into small range queries so doesn't suffer from the same problems as brute-force counting.
DSBulk is quite simple to use and only takes a few minutes to setup. First, you need to download the binaries from DataStax Downloads then unpack the tarball. For details, see the DSBulk Installation Instructions.
Once you've got it installed, you can count the partitions in a table with one command:
$ cd path/to/dsbulk_installation
$ bin/dsbulk count -h <node_ip> -k ks_name -t table_name
Here are some references with examples to help you get started quickly:
Docs - Counting data in tables
Blog - Counting records with DSBulk
Blog - DSBulk Intro + Loading data
You can also use cqlsh as an alternate for small tables.
Refer this documentation
https://www.datastax.com/blog/running-count-expensive-cassandra

Timeout on CQL COUNT() restricted by partition key

I have a table in cassandra DB that is populated. It provably has around 10000 records. When I try to execute select count(*), my query times out. Surprisingly, it times out even when i restrict the query with the partition key. The table has a column that is filled with a lot of text. I can't understand how that would be a problem, but i thought, i'd mention it. Any suggestions?
Doing a COUNT() of the rows in a partition shouldn't timeout unless it contains thousands and thousands of rows. More importantly, the query is most likely timing out when the partition contains thousands of tombstones.
You haven't provided a lot of information in your question and ideally you should have included:
the table schema
a sample query
In any case if you are storing queue-like datasets and deleting rows after they've been processed (because it's a queue) then you are generating lots of tombstones within the partition. Once you've reached the maximum tombstone_failure_threshold (default is 100K tombstones) then Cassandra will stop reading any more rows.
Unfortunately, it's hard to say what's happening in your case without the necessary details. Cheers!
A SELECT COUNT(*) needs to scan the entire database and can potentially take an extremely long time - much longer than the typical timeout. Ideally, such a query would be paged - periodically returning empty pages until the final page contains the count - to avoid timing out. But currently in Cassandra - and also in Scylla - this isn't done.
As Erick noted in his reply, everything becomes worse if you also have a lot of tombstones: You said you only have 10,000 rows, but it's easy to imagine a use case where the data changes frequently, and you actually have for each row 100 deleted rows - so Cassandra needs to scan through 1 million rows (most of them already dead), not 10,000.
Another issue to consider is that when your cluster is very large, scanning usually contact nodes sequentially, and each node many times (depending on the number of vnodes), so the scan time on a very large cluster will be large even if there are just a few actual rows in the database. By the way, unlike a regular scan, an aggregation like COUNT(*) can actually be done internally in parallel. Scylla recently implemented this and it speeds up counts (and other aggregation), but if I understand correctly, this feature is not in Cassandra.
Finally, you said that "Surprisingly, it times out even when i restrict the query with the partition key.". The question is how you restricted the query with a partition key. If you restricted the partition key itself to a range, it will still be slow because Cassandra still needs to scan all the partitions and compare their keys to the range. What you should have done is to restrict the token of the partition key, e.g., something like
where token(p) >= -9223372036854775808 and token(p) < ....

what is the impact of limit in cassandra cql

When executing a cqlsh query like select * from table limit 10, would cassandra scan the entire table and just return the first 10 records, or it can precisely locate the first 10 records across whole datacenter without scanning the entire table?
The LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.
Cassandra has internal mechanisms such as request timeouts which prevent bad queries from causing the cluster to crash so queries are more likely to timeout rather than overloading the cluster with scans on all nodes/replicas.
As a side note, the LIMIT option is irrelevant when used with SELECT COUNT() since the count function returns just 1 row (by design). COUNT() needs to do a full table scan regardless of the limit set. I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/6897/. Cheers!
LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.

Cassandra unpredictable failure depending on WHERE clause

I am attempting to execute a SELECT statement against a large Cassandra table (10m rows) with various WHERE clauses. I am issuing these from the Datastax DevCenter application. The columns I am using in the where clause have secondary indexes.
The where clause looks like WHERE fileid = 18000 or alternatively WHERE fileid < 18000. In this example, the second where clause results in the error Unable to execute CQL script on 'connection1': Cassandra failure during read query at consistency ONE (1 responses were required but only 0 replica responded, 1 failed)
I have no idea why it is failing in this unpredictable manner. Any ideas?
NOTE: I am aware that this is a terrible idea, and Cassandra is not meant to be used in this way. I am issuing these queries and timing them to prove to others how inefficient Cassandra is for our use case compared to other solutions.
Your query is probably failing because of a READ timeout (the timeout on waiting to read data). You could try updating the Cassandra.yaml with a larger read timeout time with read_request_timeout_in_ms: 200000 (for 200s) to give an output rather than an error. However, if you're trying to prove the inefficiency of Cassandra in your use case, this error seems like a pretty good way to do it.

Select All Performance in Cassandra

I'm current using DB2 and planning to use cassandra because as i know cassandra have a read performance greater than RDBMS.
May be this is a stupid question but I have experiment that compare read performance between DB2 and Cassandra.
Testing with 5 million records and same table schema.
With query SELECT * FROM customer. DB2 using 25-30s and Cassandra using 40-50s.
But query with where condition SELECT * FROM customer WHERE cusId IN (100,200,300,400,500) DB2 using 2-3s and Cassandra using 3-5ms.
Why Cassandra faster than DB2 with where condition? So i can't prove which database is greater with SELECT * FROM customer right?
FYI.
Cassandra: RF=3 and CL=1 with 3 nodes each node run on 3 computers (VM-Ubuntu)
DB2: Run on windows
Table schema:
cusId int PRIMARY KEY, cusName varchar
If you look at the types of problems that Cassandra is good at solving, then the reasons behind why unbound ("Select All") queries suck become quite apparent.
Cassandra was designed to be a distributed data base. In many Cassandra storage patterns, the number of nodes is greater than the replication factor (I.E., not all nodes contain all of the data). Therefore, limiting the number of network hops becomes essential to modeling high-performing queries. Cassandra performs very well with specific queries (which utilize the partition/clustering key structure), because it can quickly locate the node primarily responsible for the data.
Unbound queries (A.K.A. multi-key queries) incur the extra network time because a coordinator node is required. So one node acts as the coordinator, queries all other nodes, collates data, and returns the result set. Specifying a WHERE clause (with at least a partition key) and while using a "Token Aware" load balancing policy, performs well for two reasons:
A coordinator node is not required.
The node primarily responsible for the range is queried, returning the result set in a single netowrk hop.
tl;dr;
Querying Cassandra with an unbound query, causes it to incur a lot of extra processing and network time that it normally wouldn't have to do, had the query been specified with a WHERE clause.
Even as a troublesome query like a no-condition range query, 40-50s is pretty extreme for C*. Is the coordinator hitting GCs with the coordination? Can you include code used for your test?
When you make a select * vs millions of records, it wont fetch them all at once, it will grab the fetchSize at a time. If your just iterating through this, the iterator will actually block even if you used executeAsync initially. This means that every 10k (default) records it will issue a new query that you will block on. The serialized nature of this will take time just from a network perspective. http://docs.datastax.com/en/developer/java-driver/3.1/manual/async/#async-paging explains how to do it in a non-blocking way. You can use this to to kick off the next page fetch while processing the current which would help.
Decreasing the limit or fetch size could also help, since the coordinator may walk token ranges (parallelism is possible here but its heuristic is not perfect) one at a time until it has read enough. If it has to walk too many nodes to respond it will be slow, this is why empty tables can be very slow to do a select * on, it may serially walk every replica set. With 256 vnodes this can be very bad.

Resources