Fetch more than 2147483647 record from Cassandra - cassandra

I inherited a Cassandra database with years of data in it. I was tasked to delete all records older than 2 years. I don't know how many rows the table contains, but it is a lot.
The table structure is this:
CREATE TABLE IF NOT EXISTS my_table (
key1 bigint,
key2 text,
"timestamp" timestamp,
some more columns,
PRIMARY KEY ((key1, key2), "timestamp")
) WITH CLUSTERING ORDER BY ("timestamp" DESC);
Since key1 and key2 are partition keys, I cannot simply delete everything with a timestamp < 2 years. You would need to do this per partition key.
So I went ahead and created a small tool in Java based on the async paging pattern described in the manual: https://docs.datastax.com/en/developer/java-driver/4.11/manual/core/paging/
I do a SELECT DISTINCT key1, key2 from my_table;, iterate over the keys, delete rows for those keys older than 2 years, fetch the next page and repeat.
After a few hours, the tool completes and reports it has modified the rows of 2147483647 partitioning keys. That is exactly 2^32-1, the maximum of a signed 32-bit integer. This is probably some limit in Cassandra, because having exact that amount of keys is improbable.
My questions:
How can I fetch ALL of the table?
Is 2147483647 some (configurable) limit and why?
The other strategy would be to start a new table, use a TTL and write to both tables until two years have passed. But I would like to avoid that if I can.

I work at ScyllaDB - Scylla is a Cassandra compatible database.
There is indeed a limitation in Cassandra paging - https://issues.apache.org/jira/browse/CASSANDRA-14683 and it is not yet fixed.
What you can try and do is use the last token returned and continue paging from that state
select distinct token (key1,key2), key1,key2 from my_table ;
and then when the paging ended you would change the query and use the last returned token (as an example)
select distinct token (key1,key2), key1,key2 from my_table where token(key1,key2) >= -3748018335291956378;
(you need to reiterate with >= since multiple pairs maybe mapped to the same token)
PS: Scylla has uplifted this limitation (https://github.com/scylladb/scylla/issues/5101) so we are bound by 2^64 -1

Related

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

Cassandra count with limit

I need to find out if the count of records in Cassandra table is greater than certain number, e.g 10000.
I still don't have large data set, but at a large scale, with possible billions of records, how would I be able to achieve this efficiently?
There could potentially be billions of records, or just thousands. I just need to know if there are more or less than 10K.
This below doesn't seem right, I think it would fail or be very slow for large number of records.
SELECT COUNT(*) FROM data WHERE sourceId = {id} AND timestamp <
{endDate} AND timestamp > {startDate};
I could also do something like this:
SELECT * FROM data WHERE sourceId = {id} AND timestamp < {endDate} AND timestamp > {startDate} LIMIT 10000;
and count in memory
I can't have new table used for counting, e.g, when a new record is written, increase counter, that option is unacceptable.
Is there some other way to do this? Select with limit looks dumb, but seems most viable.
sourceId is partition key and timestamp is clustering key.
Cassandra version is 3.11.4, and I work in Spring if it has any relevance.
You may introduce bucket_id into partition key, so primary key will be ((sourceId, bucket_id), timestamp). Bucketing is used cassandra to constraint data rows belonging to single partition, i.e. partition will be split into smaller chunks. To count all rows issue async query for each partition (source_id, bucket_id) with additional timestamp field. Bucket_id may_be derived from timestamp so that is possible define which bucket_id is required to access.
Another solutions:
use cassandra's counters (but I read it affect performance, and cannot correctly handle repeat and speculative queries)
use another db, like redis which has atomic counters (but how synchronize redis and cassandra?)
precalculate values and save it's during write (for example into static columns)
something else
The first query:
SELECT COUNT(*) FROM data WHERE sourceId = {id}
AND timestamp < {endDate} AND timestamp > {startDate};
should work if you have a table with following primary key: (sourceId, timestamp, ...) - in this case, aggregation operation is executed inside the single partition, so it won't involve the hitting of multiple nodes, etc. It still may timeout if you have very slow disks, and too much data in given time range.
If you have another table structure, then you'll need to use something like Spark, that will read data from Cassandra, perform filtering, and counting...

Cassandra Data modelling : Timestamp as partition keys

I need to be able to return all users that performed an action during a specified interval. The table definition in Cassandra is just below:
create table t ( timestamp from, timestamp to, user text, PRIMARY KEY((from,to), user))
I'm trying to implement the following query in Cassandra:
select * from t WHERE from > :startInterval and to < :toInterval
However, this query will obviously not work because it represents a range query on the partition key, forcing Cassandra to search all nodes in the cluster, defeating its purpose as an efficient database.
Is there an efficient to model this query in Cassandra?
My solution would be to split both timestamps into their corresponding years and months and use those as the partition key. The table would look like this:
create table t_updated ( yearFrom int, monthFrom int,yearTo int,monthTo int, timestamp from, timestamp to, user text, PRIMARY KEY((yearFrom,monthFrom,yearTo,monthTo), user) )
If i wanted the users that performed the action between Jan 2017 and July 2017 the query would look like the following:
select user from t_updated where yearFrom IN (2017) and monthFrom IN (1,2,3,4,5,6,7) and yearTo IN (2017) and monthTo IN (1,2,3,4,5,6,7)
Would there be a better way to model this query in Cassandra? How would you approach this issue?
First, the partition key has to operate on equals operator. It is better to use PRIMARY KEY (BUCKET, TIME_STAMP) here where bucket can be combination of year, month (or include days, hrs etc depending on how big your data set is).
It is better to execute multiple queries and combine the result in client side.
The answer depends on the expected number of entries. Thumb rule, is that a partition should not exceed 100mb. So if you expect a moderate number of entries, it would be enough to go with year as partition key.
We use Week-First-Date as a partition key in a iot scenario, where values get written at most once a minute.

Can we restrict in cassandra that a table only have limited number of records or rows?

Can we restrict in cassandra that a table only have limited number of records or rows? If we want to insert maximum 20 rows in a table then how do we do?
Cassandra does not support this kind of operation. This is part of the business logic in your application and it should be done on application level.
No, but you can make a PER PARTITION LIMIT on the query, then issue a delete periodically to create a range tombstone for everything past that range. ie in a
CREATE TABLE mytable (
primary text
clustering timestamp
value text
PRIMARY KEY ((primary), clustering)
You can SELECT * FROM mytable WHERE primary = 'mykey' PER PARTITION LIMIT 20 which then the last one has a clustering of 1548857236000 can then DELETE FROM mytable WHERE primary = 'mykey' and clustering > 1548857236000. For most part id just issue that delete very infrequently (like once an hour or a day depending on load in order to keep partition size down) and use LeveledCompactionStrategy. If enough load include a date component to the primary key like ((primary, yyyyMMdd), clustering) to prevent too much tombstone much buildup in the partition.

Get first row for each partition key in Cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.

Resources