I'm looking to sanity check my approach to paginating a Cassandra table. My use case is the following: I need a table that gives me the last X visitors to a website on a given day, to power an analytics dashboard. I log the visits with a session_id, and I have the following table schema:
session_id text,
yyyymmdd test,
bucket int,
timeuuid timeuuid,
primary key((yyyymmdd, bucket), timeuuid)
WITH CLUSTERING ORDER BY (timeuuid DESC)
The bucket is there to avoid hotspots on one node. On to pagination:
The query will look something like this:
SELECT session_id FROM recent_visitors WHERE yyyymmdd = ? AND bucket IN (?) LIMIT 1000;
Now, this query will most likely affect every node, since the bucket number is larger than the number of nodes. Will this query be too expensive/ is there a better way? Also, I know that for each partition, the data is sorted by clustering column, but will cassandra sort the result from all the partitions? In other words, the data will be returned sorted within each (yyyymmdd, bucket) group, but across groups will I have to sort the result for final display? Then, if I get the oldest timeuuid from the result, I am planning on paginating with the following query:
SELECT session_id FROM recent_visitors WHERE yyyymmdd = ? AND bucket IN (?) LIMIT 1000 WHERE timeuuid < previous_oldest_timeuuid;
Is that a sane approach? Thank you in advance for you time.
For some basics of modeling a time series in Cassandra see the following article:
http://planetcassandra.org/blog/getting-started-with-time-series-data-modeling/
Your data model looks sane, but I would change your read query. You are going to be better off sending off a bunch of queries for the different buckets asynchronously rather than querying them as a batch like that.
Your result set from the batch is going to be ordered per bucket, so you will have to combine the different buckets together either way, and it is better to only hit one server with each query, rather than have one query which will hit multiple servers.
Related
I need to find out if the count of records in Cassandra table is greater than certain number, e.g 10000.
I still don't have large data set, but at a large scale, with possible billions of records, how would I be able to achieve this efficiently?
There could potentially be billions of records, or just thousands. I just need to know if there are more or less than 10K.
This below doesn't seem right, I think it would fail or be very slow for large number of records.
SELECT COUNT(*) FROM data WHERE sourceId = {id} AND timestamp <
{endDate} AND timestamp > {startDate};
I could also do something like this:
SELECT * FROM data WHERE sourceId = {id} AND timestamp < {endDate} AND timestamp > {startDate} LIMIT 10000;
and count in memory
I can't have new table used for counting, e.g, when a new record is written, increase counter, that option is unacceptable.
Is there some other way to do this? Select with limit looks dumb, but seems most viable.
sourceId is partition key and timestamp is clustering key.
Cassandra version is 3.11.4, and I work in Spring if it has any relevance.
You may introduce bucket_id into partition key, so primary key will be ((sourceId, bucket_id), timestamp). Bucketing is used cassandra to constraint data rows belonging to single partition, i.e. partition will be split into smaller chunks. To count all rows issue async query for each partition (source_id, bucket_id) with additional timestamp field. Bucket_id may_be derived from timestamp so that is possible define which bucket_id is required to access.
Another solutions:
use cassandra's counters (but I read it affect performance, and cannot correctly handle repeat and speculative queries)
use another db, like redis which has atomic counters (but how synchronize redis and cassandra?)
precalculate values and save it's during write (for example into static columns)
something else
The first query:
SELECT COUNT(*) FROM data WHERE sourceId = {id}
AND timestamp < {endDate} AND timestamp > {startDate};
should work if you have a table with following primary key: (sourceId, timestamp, ...) - in this case, aggregation operation is executed inside the single partition, so it won't involve the hitting of multiple nodes, etc. It still may timeout if you have very slow disks, and too much data in given time range.
If you have another table structure, then you'll need to use something like Spark, that will read data from Cassandra, perform filtering, and counting...
I need to be able to return all users that performed an action during a specified interval. The table definition in Cassandra is just below:
create table t ( timestamp from, timestamp to, user text, PRIMARY KEY((from,to), user))
I'm trying to implement the following query in Cassandra:
select * from t WHERE from > :startInterval and to < :toInterval
However, this query will obviously not work because it represents a range query on the partition key, forcing Cassandra to search all nodes in the cluster, defeating its purpose as an efficient database.
Is there an efficient to model this query in Cassandra?
My solution would be to split both timestamps into their corresponding years and months and use those as the partition key. The table would look like this:
create table t_updated ( yearFrom int, monthFrom int,yearTo int,monthTo int, timestamp from, timestamp to, user text, PRIMARY KEY((yearFrom,monthFrom,yearTo,monthTo), user) )
If i wanted the users that performed the action between Jan 2017 and July 2017 the query would look like the following:
select user from t_updated where yearFrom IN (2017) and monthFrom IN (1,2,3,4,5,6,7) and yearTo IN (2017) and monthTo IN (1,2,3,4,5,6,7)
Would there be a better way to model this query in Cassandra? How would you approach this issue?
First, the partition key has to operate on equals operator. It is better to use PRIMARY KEY (BUCKET, TIME_STAMP) here where bucket can be combination of year, month (or include days, hrs etc depending on how big your data set is).
It is better to execute multiple queries and combine the result in client side.
The answer depends on the expected number of entries. Thumb rule, is that a partition should not exceed 100mb. So if you expect a moderate number of entries, it would be enough to go with year as partition key.
We use Week-First-Date as a partition key in a iot scenario, where values get written at most once a minute.
Currently I have a simple table as follows:
CREATE TABLE datatable (timestamp bigint, value bigint, PRIMARY KEY (timestamp))
This table is only growing and never being modified. The key is unique timestamp. All queries are range queries of the form:
SELECT * from datatable WHERE timestamp > 123456 ALLOW FILTERING
Moreover, queries request only a small set of the latest rows inserted. The problem that I have right now is that performance of these queries negatively correlated with the table size. As table grows, it takes significantly longer to get response, even if query returns just a few rows.
Could you advise on how I should modify table schema to avoid performance degradation (e.g., create index or set clustering)?
Thanks!
Add some time bucketing like
CREATE TABLE datatable (
bucket timestamp,
time timestamp,
value bigint,
PRIMARY KEY ((bucket), time)
) WITH CLUSTERING ORDER BY (time DESC);
where bucket is the date truncated to the day or week or month (can figure out how many based on approx ingestion rate, a decent goal is about 64mb per partition but thats very flexible), that way you will collect all the rows for a period within a single partition very efficiently.
Having billions of partitions per node will cause slow down repairs and compactions significantly. Also partitioning order is random (murmur3 hash of the partition key order) so you cannot do things like have your above your query in order.
With the above you can then iterate from the bucket of your start time to the current bucket without ALLOW FILTERING (which you should never ever use outside of toy amounts of data or test environment kinda things) and the results will be in the order of the timestamps.
I am trying to figure out what would be the best way to implement a valid from/to data filtering in Cassandra.
I need to have a table with records that are only valid in certain time window - always defined. Each of such records would not be valid for more than lets say: 3 months.
I would like to have a structure like this (more less ofc):
userId bigint,
validFrom timestamp ( or maybe split into columns like: from_year, from_month etc. if that helps )
validTo timestamp ( or as above )
someCollection set
All queries would be performed by userId, validFrom, validTo.
I know the limits of querying in Cassandra (both PK and clustering keys) but maybe I am missing some trick or clever usage of what is available in CQL.
Any help appreciated!
You could just select by validFrom but TTL the data by the validTo to make sure the number of records you need to filter in your app doesn't get too large. However, depending on how many records you have per user this may result in a lot of tombstones.
I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.