Cassandra read performance

Cassandra read performance - cassandra

I want to know how to configure Cassandra to get better READ performances, because when I try to do a SELECT query on a table which has 1M rows I get the timedoutexception.
I've already change the request_timeout_in_ms, add more nodes but still got the same error.

You are querying too many rows at once. You need to query less rows at a time and page through them.
Update:
First query:
select <KEY>,p0001 from eExtension limit 1000;
Repeat:
take the last result from that query:
select <KEY>,p0001 from eExtension where token(<KEY>) > token(<LAST KEY RETURNED FROM PREVIOUS>) limit 1000;
repeat that pattern until done.

Sounds like you're trying to read all 1M rows at once. Don't.

One way to do pagination is to use Cassandra's clients app like Playorm. PlayOrm returns a cursor when you query and as your first page reads in the first 100 results and displays it, the next page can just use the same cursor in your session and it picks up right where it left off without rescanning the first 100 rows again.
Visit this to see the example for cursor and this for all features and more details about playorm

Related

Best way to Fetch N rows in ScyllaDB

I have a use case in which I utilize ScyllaDB to limit users' actions in the past 24h. Let's say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB's ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.
However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I'd like to understand better and compare the behavior(please correct me if any of my statement is wrong):
using count()
count() will implement a full-scan query, meaning that it may query more than necessary records into the table.
SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;
using limit
limit will only limit the number of records being returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.
SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;
using paging
I'm a bit new to this, but if I read the docs correctly it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to a number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?
docs: https://java-driver.docs.scylladb.com/stable/manual/core/paging/index.html
my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql
iter := conn.Query(
"SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?",
hashID,
userID,3
).PageSize(3).PageState(nil).Iter()
Please let me know if my analysis was correct and which method would be best to choose

Your client should always use paging - otherwise you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don't!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.
Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve and - even then - should it not be, we have to consider such set-up at scale. Eg: There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?
If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is because there are some particular problems with the description you've given at hand:
First, you want to retrieve N amount of records within a particular time-frame, but your queries don't specify such time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is not clear how a hash_id + user_id combination can be done to determine the number of records within a time-frame.
Of course, it may be that I am wrong, but I'd like to suggest different some approaches which may be or not applicable for you and your use case.
Consider a timestamp component part of the clustering key. This will allow you to avoid full partition scans, with queries such as:
SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;
If the above is not applicable, then perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:
SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;
I hope that helps!

I have a few points I want to add to what Felipe wrote already:
First, you don't need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers, the partition key doesn't need to be random-looking. Scylla will internally hash the partition key on its own to improve the load balancing. You don't need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it's a variant of murmur3 too (which is not identical to the one you used - it's a modified algorithm originally picked by the Cassandra developers).
Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it's fine to add the third - and then when both add their record - and you end up with four records. You'll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it's a disaster. Note that theoretically you can get even more than 4 - if the user managest to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won't manage to get many superflous records). If you'll want 3 to be a hard limit, you'll probably needs to change your solution - perhaps to one based on LWT and not use TTL.
Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn't really matter if you just retrieve them or count them - you should just do whichever is convenient for you. In any case, I think that paging is not a good solution your need. For such short results and you can just use the default page size and you'll never reach it anyway, and also paging hints the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case you know that you'll never continue after the first three results. So in short, don't use any special paging setup here - just use the default page size (which is 1MB) and it will never be reached anyway.

AWS Keyspaces effective pagination

I have to jump to specific rows in AWS Keyspaces table. In specific I am doing pagination so I want to be able to jump to a specific page. Do I have any options for that?
For example, I have to fetch 100 rows after 1e+6 row. And ofc I want to do it as quickly as possible.
My ideas/solutions:
Set page size to requested one (100 in this case) and iterate over all rows and get next_page until come up with a specific set
Find max possible size of the page and use max_page to iterate over the biggest possible sets of rows
But maybe there are more clever solutions?
I don't have the opportunity to somehow change the table by adding additional columns!

Pagination isn't a best practice in Cassandra because you don't know how many results you will have until you query for them. Amazon Keyspaces paginates results based on the number of rows that it reads to process a request, not the number of rows returned in the result set. As a result, some pages might contain fewer rows than you specify in PAGE SIZE for filtered queries. In addition, Amazon Keyspaces paginates results automatically after reading 1 MB of data to provide customers with consistent, single-digit millisecond read performance.
For Paginating through pages we have a tool called the Export tool. This tool allows you to asynchronously read page by page. It will read partition keys even if you skip them. In Keyspaces still has to read every partition key which means using more RCUs, but this will accomplish your goal. When you are using the tool to read by page you may see the tool stop after a certain number of pages, you just restart the tool at the page it left off at when you run it again.

what is the impact of limit in cassandra cql

When executing a cqlsh query like select * from table limit 10, would cassandra scan the entire table and just return the first 10 records, or it can precisely locate the first 10 records across whole datacenter without scanning the entire table?

The LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.
Cassandra has internal mechanisms such as request timeouts which prevent bad queries from causing the cluster to crash so queries are more likely to timeout rather than overloading the cluster with scans on all nodes/replicas.
As a side note, the LIMIT option is irrelevant when used with SELECT COUNT() since the count function returns just 1 row (by design). COUNT() needs to do a full table scan regardless of the limit set. I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/6897/. Cheers!

LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.

Cassandra Table Count Timeout

I try to get number of rows in a table. But cassandra turns timeout for this query select count(*) from events ;
I think that my tables is too big so if ı give a timeout value for my query, it turns always timeout cqlsh --request-timeout=200000
Table size 1.3TB. Is there any way to learn How many rows in this table ?

Do not use count(*) to get no of rows.You can use the following link and download the jar file to get the count.
https://github.com/brianmhess/cassandra-count

One solution that can help you find the total rows in Pagination of result.
Please refer to the below doc :
https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/
Note : You can also try ALLOW FILTERING for development! But the use of it should avoided as it is very expensive query, might impact the performance of the cassandra.

Select All Records From Cassandra

I am trying to select all records from one Cassandra table (~10M records) which should be distributed over 4 nodes using CQL shell, but every time I do that it partitions the output to 1K records Max. So my question is, it is possible to select all records at once as I am trying to see how much time it takes Cassandra to retrieve all records.

When you write "SELECT * from CF" CQL client will never select everything at once. It's just a stupid action for large data. Instead it will load only first page and give you an iterator. Cassandra from 2.0 version supports automatic query paging. So you should call your select all query and ITERATE over pages to load full column family. See an example for python client. There is no way to load all in one action in CQL now and it shouldn't be.

While it was already pointed out that it's a bad idea to try and load all data in cqlsh, what you're trying to do is still somewhat possible. You just need to set a limit and probably increase the timeout for cqlsh.
user#host:~# cqlsh --request-timeout=600
This will start the shell with a request timeout of 10 minutes.
select * from some_table limit 10000000;
Please do not use this in a production environment, as it might have terrible implications for performance and cluster availability!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string