Cassandra count with limit - cassandra

I need to find out if the count of records in Cassandra table is greater than certain number, e.g 10000.
I still don't have large data set, but at a large scale, with possible billions of records, how would I be able to achieve this efficiently?
There could potentially be billions of records, or just thousands. I just need to know if there are more or less than 10K.
This below doesn't seem right, I think it would fail or be very slow for large number of records.
SELECT COUNT(*) FROM data WHERE sourceId = {id} AND timestamp <
{endDate} AND timestamp > {startDate};
I could also do something like this:
SELECT * FROM data WHERE sourceId = {id} AND timestamp < {endDate} AND timestamp > {startDate} LIMIT 10000;
and count in memory
I can't have new table used for counting, e.g, when a new record is written, increase counter, that option is unacceptable.
Is there some other way to do this? Select with limit looks dumb, but seems most viable.
sourceId is partition key and timestamp is clustering key.
Cassandra version is 3.11.4, and I work in Spring if it has any relevance.

You may introduce bucket_id into partition key, so primary key will be ((sourceId, bucket_id), timestamp). Bucketing is used cassandra to constraint data rows belonging to single partition, i.e. partition will be split into smaller chunks. To count all rows issue async query for each partition (source_id, bucket_id) with additional timestamp field. Bucket_id may_be derived from timestamp so that is possible define which bucket_id is required to access.
Another solutions:
use cassandra's counters (but I read it affect performance, and cannot correctly handle repeat and speculative queries)
use another db, like redis which has atomic counters (but how synchronize redis and cassandra?)
precalculate values and save it's during write (for example into static columns)
something else

The first query:
SELECT COUNT(*) FROM data WHERE sourceId = {id}
AND timestamp < {endDate} AND timestamp > {startDate};
should work if you have a table with following primary key: (sourceId, timestamp, ...) - in this case, aggregation operation is executed inside the single partition, so it won't involve the hitting of multiple nodes, etc. It still may timeout if you have very slow disks, and too much data in given time range.
If you have another table structure, then you'll need to use something like Spark, that will read data from Cassandra, perform filtering, and counting...

Related

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

Best Cassandra data model for maintaining bounded lists per user

I have Kafka streams containing interactions of users with a website, so every event has a timestamp and information about the event. For each user I want to store the last K events in Cassandra (e.g. 100 events).
Our website is constantly experiencing bot / heavy users that is why we want to cap events, just to consider "normal" users.
I currently have the current data model in Cassandra:
user_id, event_type, timestamp, event_blob
where
<user_id, event_type> = partition key, timestamp = clustering key
For now we write a new record in Cassandra as soon as a new event happens and later on we go and clean up "heavier" partitions (i.e. count of events > 100).
This doesn't happen in real time and until we don't clean up the heavy partitions we sometimes get bad latencies when reading.
Do you have any suggestions of a better table design for such case?
Is there a way to tell Cassandra to store only at most K elements for partition and expire the old ones in a FIFO way? Or is there a better table design that I can opt for?
Do you have any suggestions of a better table design for such case?
When data modeling for scenarios like this, I recommend a pattern that makes use of three things:
Default TTL set on the table.
Clustering on a time component in descending order.
Adjust query to use a range on the timestamp, never querying data past the TTL.
TTL:
later on we go and clean up "heavier" partitions
How long (on average) before the cleanup happens? One thing I would do, is to use a TTL on that table set to somewhere around the maximum amount of time before your team usually has to clean them up.
Clustering Key, Descending Order:
So your PRIMARY KEY definition looks like this:
PRIMARY KEY ((user_id,event_type),timestamp)
Make sure that you're also clustering in a descending order on timestamp.
WITH CLUSTERING ORDER BY (timestamp DESC)
This is important to use in conjunction with your TTL. Here, your tombstones are on the "bottom" of the partition (when sorting on timestamp descinding) and the recent data (the data you care about) is at the "top" of the partition.
Range Query:
Finally, make sure your query has a range component on the timestamp.
For example: if today is the 11th, and my TTL is 5 days, I can then query the last 4 days of data without pulling back tombstones:
SELECT * FROM events
WHERE user_id = 11111 AND event_type = 'B'
AND timestamp > '2020-03-07 00:00:00';
Problem with your existing implementation is that deletes create tombstones which eventually cause latencies in the read. Creating too many tombstones is not recommended.
FIFO implementation based on count (number of rows per partition) is not possible. The better approach for your use case is not to delete records in the same table. Use Spark to migrate the table into a new temp table and remove the extra records in the migration process. Something like:
1) Create a new table
2) Using Spark , read from the orignal table , migrate all required records (filter extra records) and write to new temp table.
3) Truncate the orignal table. Note that truncate operation do not create Tombstones.
4) Migrate everything from the temp table back to orignal table using Spark.
5) Truncate the temp table.
You can do this in maintenance window of your application ( something like once in a month) until then you can restrict reads with Limit 100 per partition.

Cassandra order by timestemp desc

I just begin study cassandra.
It was a table and queries.
CREATE TABLE finance.tickdata(
id_symbol int,
ts timestamp,
bid double,
ask double,
PRIMARY KEY(id_symbol,ts)
);
And query is successful,
select ts,ask,bid
from finance.tickdata
where id_symbol=3
order by ts desc;
Next it was decision move id_symbol in table name, new table(s) scripts.
CREATE TABLE IF NOT EXISTS mts_src.ticks_3(
ts timestamp PRIMARY KEY,
bid double,
ask double
);
And now query fails,
select * from mts_src.ticks_3 order by ts desc
I read from docs, that I need use and filter (WHERE) by primary key (partition key),
but technically my both examples same. Why cassandra so restricted in this aspect?
And one more question, It is good idea in general? move id_symbol in table name -
potentially it can be 1000 of unique id_symbol and a lot of data for each. Separate this data on individual tables look like good idea!? But I lose order by possibility, that is so necessary for me to take fresh data by each symbol_id.
Thanks.
You can't sort on the partition key, you can sort only on clustering columns inside the single partition. So you need to model your data accordingly. But you need to be very careful not to create very large partitions (when using ticker_id as partition key, for example). In this case you may need to create a composite keys, like, ticker_id + year, or month, depending on how often you're inserting the data.
Regarding the table per ticker, that's not very good idea, because every table has overhead, it will lead to increased resource consumption. 200 tables is already high number, and 500 is almost "hard limit"

Cassandra, filter latest rows from an append only table

Currently I have a simple table as follows:
CREATE TABLE datatable (timestamp bigint, value bigint, PRIMARY KEY (timestamp))
This table is only growing and never being modified. The key is unique timestamp. All queries are range queries of the form:
SELECT * from datatable WHERE timestamp > 123456 ALLOW FILTERING
Moreover, queries request only a small set of the latest rows inserted. The problem that I have right now is that performance of these queries negatively correlated with the table size. As table grows, it takes significantly longer to get response, even if query returns just a few rows.
Could you advise on how I should modify table schema to avoid performance degradation (e.g., create index or set clustering)?
Thanks!
Add some time bucketing like
CREATE TABLE datatable (
bucket timestamp,
time timestamp,
value bigint,
PRIMARY KEY ((bucket), time)
) WITH CLUSTERING ORDER BY (time DESC);
where bucket is the date truncated to the day or week or month (can figure out how many based on approx ingestion rate, a decent goal is about 64mb per partition but thats very flexible), that way you will collect all the rows for a period within a single partition very efficiently.
Having billions of partitions per node will cause slow down repairs and compactions significantly. Also partitioning order is random (murmur3 hash of the partition key order) so you cannot do things like have your above your query in order.
With the above you can then iterate from the bucket of your start time to the current bucket without ALLOW FILTERING (which you should never ever use outside of toy amounts of data or test environment kinda things) and the results will be in the order of the timestamps.

How do secondary indexes work in Cassandra?

Suppose I have a column family:
CREATE TABLE update_audit (
scopeid bigint,
formid bigint,
time timestamp,
record_link_id bigint,
ipaddress text,
user_zuid bigint,
value text,
PRIMARY KEY ((scopeid, formid), time)
) WITH CLUSTERING ORDER BY (time DESC)
With two secondary indexes, where record_link_id is a high-cardinality column:
CREATE INDEX update_audit_id_idx ON update_audit (record_link_id);
CREATE INDEX update_audit_user_zuid_idx ON update_audit (user_zuid);
According to my knowledge Cassandra will create two hidden column families like so:
CREATE TABLE update_audit_id_idx(
record_link_id bigint,
scopeid bigint,
formid bigint,
time timestamp
PRIMARY KEY ((record_link_id), scopeid, formid, time)
);
CREATE TABLE update_audit_user_zuid_idx(
user_zuid bigint,
scopeid bigint,
formid bigint,
time timestamp
PRIMARY KEY ((user_zuid), scopeid, formid, time)
);
Cassandra secondary indexes are implemented as local indexes rather than being distributed like normal tables. Each node only stores an index for the data it stores.
Consider the following query:
select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
How will this query execute 'under the hood' in Cassandra?
How will a high-cardinality column index (record_link_id) affect its performance?
Will Cassandra touch all nodes for the above query? Why?
Which criteria will be executed first, base table partition_key or secondary index partition_key? How will Cassandra intersect these two results?
select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
How the above query will work internally in cassandra?
Essentially, all data for partition scopeid=35 and formid=78005 will be returned, and then filtered by the record_link_id index. It will look for the record_link_id entry for 9897, and attempt to match-up entries that match the rows returned where scopeid=35 and formid=78005. The intersection of the rows for the partition keys and the index keys will be returned.
How high-cardinality column (record_link_id)index will affect the query performance for the above query?
High-cardinality indexes essentially create a row for (almost) each entry in the main table. Performance is affected, because Cassandra is designed to perform sequential reads for query results. An index query essentially forces Cassandra to perform random reads. As cardinality of your indexed value increases, so does the time it takes to find the queried value.
Does cassandra will touch all nodes for the above query? WHY?
No. It should only touch a node that is responsible for the scopeid=35 and formid=78005 partition. Indexes likewise are stored locally, only contain entries that are valid for the local node.
creating index over high-cardinality columns will be the fastest and best data model
The problem here is that approach does not scale, and will be slow if update_audit is a large dataset. MVP Richard Low has a great article on secondary indexes(The Sweet Spot For Cassandra Secondary Indexing), and particularly on this point:
If your table was significantly larger than memory, a query would be very slow even to return just a few thousand results. Returning potentially millions of users would be disastrous even though it would appear to be an efficient query.
...
In practice, this means indexing is most useful for returning tens, maybe hundreds of results. Bear this in mind when you next consider using a secondary index.
Now, your approach of first restricting by a specific partition will help (as your partition should certainly fit into memory). But I feel the better-performing choice here would be to make record_link_id a clustering key, instead of relying on a secondary index.
Edit
How does having index on low cardinality index when there are millions of users scale even when we provide the primary key
It will depend on how wide your rows are. The tricky thing about extremely low cardinality indexes, is that the % of rows returned is usually greater. For instance, consider a wide-row users table. You restrict by the partition key in your query, but there are still 10,000 rows returned. If your index is on something like gender, your query will have to filter-out about half of those rows, which won't perform well.
Secondary indexes tend to work best on (for lack of a better description) "middle of the road" cardinality. Using the above example of a wide-row users table, an index on country or state should perform much better than an index on gender (assuming that most of those users don't all live in the same country or state).
Edit 20180913
For your answer to 1st question "How the above query will work internally in cassandra?", do you know what's the behavior when query with pagination?
Consider the following diagram, taken from the Java Driver documentation (v3.6):
Basically, paging will cause the query to break itself up and return to the cluster for the next iteration of results. It'd be less likely to timeout, but performance will trend downward, proportional to the size of the total result set and the number of nodes in the cluster.
TL;DR; The more requested results spread over more nodes, the longer it will take.
Query with only secondary index is also possible in Cassandra 2.x
select * from update_audit where record_link_id=9897;
But this has a large impact on fetching data, because it reads all partitions on distributed environment. The data fetched by this query is also not consistent and could not relay on it.
Suggestion:
Use of Secondary index is considered to be a DIRT query from NoSQL Data Model view.
To avoid secondary index, we could create a new table and copy data to it. Since this is a query of the application, Tables are derived from queries.

Resources