Cassandra - Overlapping Data Ranges - cassandra

I have the following 'Tasks' table in Cassandra.
Task_ID UUID - Partition Key
Starts_On TIMESTAMP - Clustering Column
Ends_On TIMESTAMP - Clustering Column
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.

In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.

Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
And let's say we have this data:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
Anyway not a perfect method obviously, but it might be something you can work with. :)

A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:
CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));
With this structure, I can query by userid and eventtime:
SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');
userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch
(3 rows)
That query will give me all event rows for a particular user for today.
NOTES:
If you need to query by whether or not an event is starting or ending (I did not) you will want to order eventType ahead of eventTime in the primary key.
You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
In your case, you will want to find a good key to partition on, as Task_ID will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).

There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:
I cluster the table on the Starts_On timestamp in descending order. (Ends_On is just a regular column.) Then I constrain the query with Starts_On<? where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.
I then iterate through the results until the row Ends_On is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later Ends_On.)
Throwing away the rest of the result rows might seem wasteful, but here's the crucial bit: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.
Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.

Related

How to scale a range sharded index on a timestamp column in YugabyteDB?

Is there any performance tuning to do for a write-bound workload in YugabyteDB? We thought that by simply adding additional nodes to our YugabyteDB cluster, without further tuning, we would have seen some noticeable increase in writes, however this is not the case. Schema can be found below.
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
update_id | character varying(255) | | not null | | extended | |
node_id | character varying(255) | | not null | | extended | |
data | character varying | | not null | | extended | |
created_at | timestamp without time zone | | | timezone('utc'::text, now()) | plain | |
Indexes:
"test_pkey" PRIMARY KEY, lsm (update_id HASH)
"test_crat" lsm (created_at DESC)
This table has tablets spread across all tservers with RF=3. Created_at is a timestamp that changes all of the time. At this point it has no more than two days of data, all new inserts are acquiring a new timestamp.
In the case of the schema called out above, the test_crat index here is limited to 1 tablet because it is range-sharded. Since created_at has only recent values they will end up going to 1 shard/tablet even with tablet splitting, meaning that all inserts will go to 1 shard. As explained in this Google Spanner documentation, whose sharding, replication, and transactions architecture YugabyteDB is based off of, this is an antipattern for scalability. As mentioned in that documentation:
If you need a global (cross node) timestamp ordered table, and you need to support higher write rates to that table than a single node is capable of, use application-level sharding. Sharding a table means partitioning it into some number N of roughly equal divisions called shards. This is typically done by prefixing the original primary key with an additional ShardId column holding integer values between [0, N). The ShardId for a given write is typically selected either at random, or by hashing a part of the base key. Hashing is often preferred because it can be used to ensure all records of a given type go into the same shard, improving performance of retrieval. Either way, the goal is to ensure that, over time, writes are distributed across all shards equally. This approach sometimes means that reads need to scan all shards to reconstruct the original total ordering of writes.
What that would mean is: to get recent changes, you would have to query each of the shards. Suppose you have 32 shards:
select * from raw3 where shard_id = 0 and created_at > now() - INTERVAL 'xxx';
..
select * from raw3 where shard_id = 31 and created_at > now() - INTERVAL 'xxx';
On the insert, every row could just be given a random value for your shard_id column from 0..31. And your index would change from:
(created_at DESC)
to
(shard_id HASH, created_at DESC)
Another approach you could use that may not be as intuitive, but may be more effective, would be to use a partial index for each shard_id that you would want.
Here is a simple example using 4 shards:
create index partial_0 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4=0;
The partial index above only includes rows where the modulus of the epoch in milliseconds of created_at timestamp is 0. And you repeat for the other 3 shards:
create index partial_1 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 1;
create index partial_2 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 2;
create index partial_3 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 3;
And then when you query PostgreSQL is smart enough to pick the right index:
yugabyte=# explain analyze select * from raw3 where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 3 AND created_at < now();
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using partial_3 on raw3 (cost=0.00..5.10 rows=10 width=16) (actual time=1.429..1.429 rows=0 loops=1)
Index Cond: (created_at < now())
Planning Time: 0.210 ms
Execution Time: 1.502 ms
(4 rows)
No need for a new shard_id column in the base table or in the index. If you want to reshard down the road, you can recreate new partial indexes with different shards and drop the old indexes.
More information about the DocDB sharding layer within YugabyteDB can be found here. If you are interested in the different sharding strategies we evaluated, and why we decided on consistent hash sharding as the default sharding strategy, take a look at this blog written by our Co-Founder and CTO Karthik Ranganathan.

How to find range in Cassandra Primary key?

Use case: Find maximum counter value in a specific id range
I want to create a table with these columns: time_epoch int, t_counter counter
The frequent query is:
select time_epoch, MAX t_counter where time_epoch >= ... and time_epoch < ...
This is to find the counter in specific time range. Planning to make time_epoch as primary key. I am not able to query the data. It is always asking for ALLOW FILTERING. Since its a very costly function, We dont want to use it.
How to design the table and query for the use case.
Let's assume that we can "bucket" (partition) your data by day, assuming that enough write won't happen in a day to make the partitions too large. Then, we can cluster by time_epoch in DESCending order. With time based data, storing data in descending order often makes the most sense (as business reqs usually care more about the most-recent data).
Therefore, I'd build a table like this:
CREATE TABLE event_counter (
day bigint,
time_epoch timestamp,
t_counter counter,
PRIMARY KEY(day,time_epoch))
WITH CLUSTERING ORDER BY (time_epoch DESC);
After inserting a few rows, the clustering order becomes evident:
> SELECT * FROM event_counter ;
WHERE day=20210219
AND time_epoch>='2021-02-18 18:00'
AND time_epoch<'2021-02-19 8:00';
day | time_epoch | t_counter
----------+---------------------------------+-----------
20210219 | 2021-02-19 14:09:21.625000+0000 | 1
20210219 | 2021-02-19 14:08:32.913000+0000 | 2
20210219 | 2021-02-19 14:08:28.985000+0000 | 1
20210219 | 2021-02-19 14:08:05.389000+0000 | 1
(4 rows)
Now SELECTing the MAX t_counter in that range should work:
> SELECT day,max(t_counter) as max
FROM event_counter
WHERE day=20210219
AND time_epoch>='2021-02-18 18:00'
AND time_epoch<'2021-02-19 09:00';
day | max
----------+-----
20210219 | 2
Unfortunately there is no better way. Think about it.
If you know cassandra architecture then you would know that your data is spread across multiple nodes based on primary key. only way to filter on values from primary key would be to transverse each node which is essentially what "ALLOW FILTERING" is done.

As I know in range queries, Cassandra retrieves result ordered by culstring key. Can I change this behavior in my query?

I'm trying to store and retrieve last active sensors by this schema:
CREATE TABLE last_signals (
section bigint,
sensor bigint,
time bigint,
PRIMARY KEY (section, sensor)
);
Row of this table will be updated every few seconds and in the result hot sensors will remain in memtable. But what will happen when I get a run a query like this:
SELECT * FROM last_signals
WHERE section = ? AND time > ?
Limit ?
ALLOW FILTERING;
And the result will be something like this (Ordered by clustering key):
sect | sens | time
------+------+------
1 | 1 | 4
1 | 2 | 3
1 | 4 | 2
1 | 5 | 9
The first Question: Is this result guaranteed to be the same in all version? (I'm using 3.7) and the next one is that how I can change this behavior (with query option, modeling or etc.). Indeed I need to get last writes first without considering clustring-keys order. I think in this case my reads will be much faster.
I don't think there is any way to guarantee order besides using clustering keys. Thus your ALLOW FILTERING query is potentially costly and may even time out. You could consider the following schema:
CREATE TABLE last_signals_by_time (
section bigint,
sensor bigint,
time bigint,
dummy bool,
PRIMARY KEY ((section, sensor), time)
) WITH CLUSTERING ORDER BY (time DESC);
Instead of updates do inserts with TTL so that you do not have to clean up old entries manually. (The dummy field is needed in order for TTL to work)
And then just run your read queries per section/sensors in parallel:
SELECT * FROM last_signals_by_time
WHERE section = ? AND sensor = ?
LIMIT 1;

Data scheme Cassandra using various data types

Currently I am developing a solution in the field of time-series data. Within these data we have: an ID, a value and a timestamp.
So here it comes: the value might be of type boolean, float or string. I consider three approaches:
a) For every data type a distinct table, all sensor values of type boolean into a table, all sensor values of type string into another. The obvious disadvantage is that you have to know where to look for a certain sensor.
b) A meta-column describing the data type plus all values of type string. The obvious disadvantage is the data conversion e.g. for calculating the MAX, AVG and so on.
c) Having three columns of different type but only one will be with a value per record. The disadvantage is 500000 sensors firing every 100ms ... plenty of unused space.
As my knowledge is limited any help is appreciated.
500000 sensors firing every 100ms
First thing, is to make sure that you partition properly, to make sure that you don't exceed the limit of 2 billion columns per partition.
CREATE TABLE sensorData (
stationID uuid,
datebucket text,
recorded timeuuid,
intValue bigint,
strValue text,
blnValue boolean,
PRIMARY KEY ((stationID,datebucket),recorded));
With a half-million every 100ms, that's 500 million in a second. So you'll want to set your datebucket to be very granular...down to the second. Next I'll insert some data:
stationid | datebucket | recorded | blnvalue | intvalue | strvalue
--------------------------------------+---------------------+--------------------------------------+----------+----------+----------
8b466f1d-8d6b-46fa-9f5b-8c4eb51aa40c | 2015-04-22T14:54:29 | 6338df40-e929-11e4-88c8-21b264d4c94d | null | 59 | null
8b466f1d-8d6b-46fa-9f5b-8c4eb51aa40c | 2015-04-22T14:54:29 | 633e0f60-e929-11e4-88c8-21b264d4c94d | null | null | CD
8b466f1d-8d6b-46fa-9f5b-8c4eb51aa40c | 2015-04-22T14:54:29 | 6342f160-e929-11e4-88c8-21b264d4c94d | True | null | null
3221b1d7-13b4-40d4-b41c-8d885c63494f | 2015-04-22T14:56:19 | a48bbdf0-e929-11e4-88c8-21b264d4c94d | False | null | null
...plenty of unused space.
You might be suprised. With the CQL output of SELECT * above, it appears that there are null values all over the place. But watch what happens when we use the cassandra-cli tool to view how the data is stored "under the hood:"
RowKey: 3221b1d7-13b4-40d4-b41c-8d885c63494f:2015-04-22T14\:56\:19
=> (name=a48bbdf0-e929-11e4-88c8-21b264d4c94d:, value=, timestamp=1429733297352000)
=> (name=a48bbdf0-e929-11e4-88c8-21b264d4c94d:blnvalue, value=00, timestamp=1429733297352000)
As you can see, the data (above) stored for the CQL row where stationid=3221b1d7-13b4-40d4-b41c-8d885c63494f AND datebucket='2015-04-22T14:56:19' shows that blnValue has a value of 00 (false). But also notice that intValue and strValue are not present. Cassandra doesn't force a null value like an RDBMS does.
The obvious disadvantage is the data conversion e.g. for calculating the MAX, AVG and so on.
Perhaps you already know this, but I did want to mention that Cassandra CQL does not contain definitions for MAX, AVG or any other data aggregation function. You'll either need to do that client-side, or implement Apache-Spark to perform OLAP-type queries.
Be sure to read through Patrick McFadin's Getting Started With Time Series Data Modeling. It contains good suggestions on how to solve time series problems like this.

Cassandra compound clustering key and queries with ordering

We use cassandra wide rows heavily to store per user time-series as they are perfect for that use-case. Let's assume we have a table:
create table user_events (
user_id text,
timestmp timestamp,
event text,
primary key((user_id), timestmp));
What if clashes on timestamp may happen (same user can emit two different events with the same timestamp). What is the best way to tweak this schema to resolve that assuming we have an ordering for all events present (have a sequence int for each event).
If I modify schema the following way:
create table user_events (
user_id text,
timestmp timestamp,
seq int,
event text,
primary key((user_id), timestmp, seq));
I won’t be able to do WHERE user_id = ? ORDER BY timestmp ASC, seq ASC – cassandra does not allow that.
I won’t be able to do WHERE user_id = ? ORDER BY timestmp ASC, seq ASC – cassandra does not allow that.
You might be seeing an error because you are repeating ASC. This should work:
WHERE user_id = ? ORDER BY timestmp,seq ASC
Also, as long as you have defined your primary key as PRIMARY KEY((user_id),timestmp,seq)) you don't even need to specify ORDER BY x[,y] ASC. It will cluster the data on disk in that order, and thus return it to you already sorted in that order. ORDER BY should only be necessary when you want to put your results in descending order (or whatever the opposite of how you have it defined is).
What if clashes on timestamp may happen?
I think your extra seq column should be sufficient, depending on how you plan on inserting the data. If you are setting the timestmp from the client, then you should be ok. However, look what happens when I (using your second table) INSERT rows while creating the timestamp two different ways.
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('Mal',dateof(now()),1,'commanding');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('Wash',dateof(now()),1,'piloting');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),1,'freaking out');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),3,'being weird');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),2,'killing reavers');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',1,'freaking out');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',3,'being weird');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',2,'killing reavers');
Querying that data by a user_id of "River" yields:
aploetz#cqlsh:stackoverflow> SELECT * FROM user_events WHERE user_id='River';
user_id | timestmp | seq | event
---------+--------------------------+-----+-----------------
River | 2015-01-13 13:14:00-0600 | 1 | freaking out
River | 2015-01-13 13:14:00-0600 | 2 | killing reavers
River | 2015-01-13 13:14:00-0600 | 3 | being weird
River | 2015-01-14 12:58:41-0600 | 1 | freaking out
River | 2015-01-14 12:58:57-0600 | 3 | being weird
River | 2015-01-14 12:58:57-0600 | 2 | killing reavers
(6 rows)
Notice that using the now() function to generate a timeuuid, and then converting that to a timestamp with dateof() causes the two rows with the timestmp "2015-01-14 12:58:57-0600" to appear to be the same. But they are not the same, as you can tell by the seq column.
So just a bit of caution on using/generating timestamps. They might look the same, but they may not be stored as the same value. Just to be on the safe side, I would use a timeuuid instead.

Resources