I am trying to figure out what would be the best way to implement a valid from/to data filtering in Cassandra.
I need to have a table with records that are only valid in certain time window - always defined. Each of such records would not be valid for more than lets say: 3 months.
I would like to have a structure like this (more less ofc):
userId bigint,
validFrom timestamp ( or maybe split into columns like: from_year, from_month etc. if that helps )
validTo timestamp ( or as above )
someCollection set
All queries would be performed by userId, validFrom, validTo.
I know the limits of querying in Cassandra (both PK and clustering keys) but maybe I am missing some trick or clever usage of what is available in CQL.
Any help appreciated!
You could just select by validFrom but TTL the data by the validTo to make sure the number of records you need to filter in your app doesn't get too large. However, depending on how many records you have per user this may result in a lot of tombstones.
Related
Say I have this Cassandra table:
CREATE TABLE orders (
customerId int,
datetime date,
amount int,
PRIMARY KEY (customerId, datetime)
);
Then why would the following query require an ALLOW FILTERING:
SELECT * FROM orders WHERE date >= '2020-01-01'
Cassandra could just go to all the individual partitions (i.e. customers) and filter on the clustering key date. Since date is sorted there is no need to retrieve all the rows in orders and filter out the ones that match my where clause (as far as I understand it).
I hope someone can enlighten me.
Thanks
This happens because for normal work, Cassandra needs the partition key - it's used to find what machine(s) are storing the data for it. If you don't have partition key, like, in your example, Cassandra need to scan all data to find those that are matching your query. And this requires the use of the ALLOW FILTERING.
P.S. Data is sorted only inside the individual partitions, not globally.
I need to be able to return all users that performed an action during a specified interval. The table definition in Cassandra is just below:
create table t ( timestamp from, timestamp to, user text, PRIMARY KEY((from,to), user))
I'm trying to implement the following query in Cassandra:
select * from t WHERE from > :startInterval and to < :toInterval
However, this query will obviously not work because it represents a range query on the partition key, forcing Cassandra to search all nodes in the cluster, defeating its purpose as an efficient database.
Is there an efficient to model this query in Cassandra?
My solution would be to split both timestamps into their corresponding years and months and use those as the partition key. The table would look like this:
create table t_updated ( yearFrom int, monthFrom int,yearTo int,monthTo int, timestamp from, timestamp to, user text, PRIMARY KEY((yearFrom,monthFrom,yearTo,monthTo), user) )
If i wanted the users that performed the action between Jan 2017 and July 2017 the query would look like the following:
select user from t_updated where yearFrom IN (2017) and monthFrom IN (1,2,3,4,5,6,7) and yearTo IN (2017) and monthTo IN (1,2,3,4,5,6,7)
Would there be a better way to model this query in Cassandra? How would you approach this issue?
First, the partition key has to operate on equals operator. It is better to use PRIMARY KEY (BUCKET, TIME_STAMP) here where bucket can be combination of year, month (or include days, hrs etc depending on how big your data set is).
It is better to execute multiple queries and combine the result in client side.
The answer depends on the expected number of entries. Thumb rule, is that a partition should not exceed 100mb. So if you expect a moderate number of entries, it would be enough to go with year as partition key.
We use Week-First-Date as a partition key in a iot scenario, where values get written at most once a minute.
Our problem is a bit different from a usual timeseries problem as we do not have natural partition key in our data. In our system we get not more than 5k/s messages, so following many publications (like this one) we figured out a following schema (it's more complex but the below matters most):
CREATE TABLE IF NOT EXISTS test.messages (
date TEXT,
hour INT,
createdAt TIMESTAMP,
uuid UUID,
data TEXT,
PRIMARY KEY ((date, hour), createdAt, uuid)
)
We mostly want to query the system based on the creation (event) time; other filtering will likely be done on different engines like Spark. The problem is that we may have a query that spans e.g. two months, so ideally we should put 60+ dates and 24 hours in the WHERE-IN-part of query, which is cumbersome to say the least. Of course, we can execute queries like below:
SELECT * FROM messages WHERE createdat >= '2017-03-01 00:00:00' LIMIT 10 ALLOW FILTERING;
My understanding is that, while the above works, it will make a full scan, which will be expensive on larger cluster. Or am I mistaken and C* knows, which partitions to scan?
I was thinking to add an index, but this problem likely falls into high-cardinality antipattern, as I understand.
EDIT: the question is not that much about the data model, though suggestions are welcome, but more about feasibility of making the queries with cratedat range instead or listing all date and hour values required in WHERE-IN-part of query to avoid full scans.
I'm looking to sanity check my approach to paginating a Cassandra table. My use case is the following: I need a table that gives me the last X visitors to a website on a given day, to power an analytics dashboard. I log the visits with a session_id, and I have the following table schema:
session_id text,
yyyymmdd test,
bucket int,
timeuuid timeuuid,
primary key((yyyymmdd, bucket), timeuuid)
WITH CLUSTERING ORDER BY (timeuuid DESC)
The bucket is there to avoid hotspots on one node. On to pagination:
The query will look something like this:
SELECT session_id FROM recent_visitors WHERE yyyymmdd = ? AND bucket IN (?) LIMIT 1000;
Now, this query will most likely affect every node, since the bucket number is larger than the number of nodes. Will this query be too expensive/ is there a better way? Also, I know that for each partition, the data is sorted by clustering column, but will cassandra sort the result from all the partitions? In other words, the data will be returned sorted within each (yyyymmdd, bucket) group, but across groups will I have to sort the result for final display? Then, if I get the oldest timeuuid from the result, I am planning on paginating with the following query:
SELECT session_id FROM recent_visitors WHERE yyyymmdd = ? AND bucket IN (?) LIMIT 1000 WHERE timeuuid < previous_oldest_timeuuid;
Is that a sane approach? Thank you in advance for you time.
For some basics of modeling a time series in Cassandra see the following article:
http://planetcassandra.org/blog/getting-started-with-time-series-data-modeling/
Your data model looks sane, but I would change your read query. You are going to be better off sending off a bunch of queries for the different buckets asynchronously rather than querying them as a batch like that.
Your result set from the batch is going to be ordered per bucket, so you will have to combine the different buckets together either way, and it is better to only hit one server with each query, rather than have one query which will hit multiple servers.
I have the following Cassandra table which records the user access to a web page.
create table user_access (
id timeuuid primary key,
user text,
access_time timestamp
);
and would like to do a query like this:
get the list of users who access the page for more than 10 times in the last hour.
Is it possible to do it in Cassandra? (I'm kind of stuck with the limited CQL query functionalities)
If not, how do I remodel the table to do this?
Can you do it? yes.
Can you do it efficiently? I'm not convinced.
It's not clear what the timeuuid you are using represents.
You could reorganize this to
CREATE TABLE user_access (
user text,
access_time timestamp,
PRIMARY KEY (user_id, access_time)
);
SELECT COUNT(*)
FROM user_access
WHERE user_id = '101'
AND access_time > 'current unix timestamp - 3600'
AND access_time < 'current unix timestamp';
Then filter the results on your own in your language of choice. I wouldn't hold your breathe waiting for sub query support.
That's going to be horribly inefficient if you have lots of users though.
There may be a better solution using cql's counter columns and binning accesses to the start of the hour. That could get you per hour accesses, but that's not the same as within the last hour.