Cassandra two dimensional data modelling - cassandra

The Use case:
For a game I am collecting the results of each game match. It's always Team A against Team B. Each team consists of 5 players each picking a champion and the possible outcome of a match is for one team either Won / Lost or for both teams a draw.
I would like to figure out the best champion combinationsI want to create win/lose/draw statistics based on the chosen champion combination of each team. In total there are ~100 champions a player can chose from. So there are many different champion combinations possible.
More (bonus) features:
I would like to figure out how one combination performed against another specific combination (in short: what's the best combination to counter a very strong champion combination)
As balance changes are applied to the game it makes sense to have a possibility to select / filter stats by specific timeranges (for instance past 14 days only) - daily precision is fine for that
My problem:
I wonder what's the best way to collect the statistics based on the champion combination? How would the data modelling look like?
My idea:
Create a hash of all championId in a combination which would literally represent a championCombinationId which is a unique identifier for the champion combo a team uses.
Create a two dimensional table which allows tracking combination vs combination stats. Something like this:
Timeframes (daily dates) and the actual championIds for a combinationId are missing there.
I tried myself creating a model for the above requirements, but I am absolutely not sure about it. Nor do I know what keys I would need to specify.
CREATE TABLE team_combination_statistics (
combinationIdA text, // Team A
combinationIdB text, // Team B
championIdsA text, // An array of all champion IDs of combination A
championIdsB text, // An array of all champion IDs of combination B
trackingTimeFrame text, // A date?
wins int,
losses int,
draws int
);

This question is quite long so I'll talk about different topics before suggesting my approach, be ready for a long answer:
Data normalization
Two-dimensional tables with same value axes
Data normalization
Storing total ammount of data is useful but ordering by it isn't, as the order doesn't determine if a combination is good vs another, it determines the combination that most times have won/lost vs the opposite but the total ammount of games played also matters.
When ordering the results, you want to order by win-ratio, draw-ratio, loose-ratio of two of the previous as the third is a linear combination.
Two-dimensional tables with same value axes
The problem on two-dimensional tables where both dimensions represent the same data, in this case a group of 5 champs, is that either you make a triangular table or you have data doubled as you will have to store cominationA vs combinationB and combinationB vs combinationA, being combinationX a specific group of 5 champs.
There are two aproaches here, using triangular tables or doubling the data manually:
1. Triangular tables:
You create a table where either the top right half is empty or the bottom left hand is empty. You then handle in the app which hash is A and which is B, and you may need to swap their order, as there is no duplicate data. You could for example consider alphabetical order where A < B always. If you then request the data in the wrong order you would get no data. The other option would be making both A vs B and B vs A query and then joining the results (swapping the wins and looses obviously).
2. Doubling the data manually:
By making two inserts with reflected values (A, B, wins, draws, looses & B, A, looses, draws, wins) you would duplicate the data. This lets you query in any order at the cost of using two times the space and requiring double inserts.
Pros and cons:
The pros of one approach are the cons of the other.
Pros of triangular tables
Does not store duplicate data
Requires half the insert
Pros of doubling the data
The application doesn't care in which order you make the request
I would probably use the triangular tables approach as the application complexity increase is not that big to be relevant, but the scalability does matter.
Proposed schema
Use whatever keyspace you want, I choose so from stackoverflow. Modify the replication strategy or factor as needed.
CREATE KEYSPACE so WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
Champion names table
The champion table will contain info about the different champions, for now it will only hold the name but you could store other things in the future.
CREATE TABLE so.champions (
c boolean,
id smallint,
name text,
PRIMARY KEY(c, id)
) WITH comment='Champion names';
A boolean is used as the partition key as we want to store all champs in a single partition for query performance and there will be a low ammount of records (~100) we will always be using c=True. A smallint was choosen for the id as 2^7 = 128 was to close to the actual number of champs and to leave room for future champs without using the negative numbers.
When querying the champs you could get them all by doing:
SELECT id, name FROM so.champions WHERE c=True;
or request a specific one by:
SELECT name FROM so.champions WHERE c=True and id=XX;
Historic match results table
This table will store the results of the matches without agregating:
CREATE TABLE so.matches (
dt date,
ts time,
id XXXXXXXX,
teams list<frozen<set<smallint>>>,
winA boolean,
winB boolean,
PRIMARY KEY(dt, ts, id)
) WITH comment='Match results';
For the partition of an historic data table, and as you mentioned daily precission, date seems to be a nice partition key. A time column is used as the first clustering key for ordering reasons and to complete the timestamp, doesn't matter if these timestamp belong to the ending or finishing instant, choose one and stick with it. An additional identifier is required in the clustering key as 2 games may end in the same instant (time has nanosecond precission which would basically mean that the data lost to overlap would be quite insignificant but your data source will probably not have this precission, thus making this last key column necesary). You can use whatever type you want for this column, probably you will already have some king of identifier with the data that you can use here. You could also go for a random number, an incremental int managed by the application, or even the name of the first players as you can be sure the same player will not start/finish two games at the same second.
The teams column is the most important one: it stores the ids of the champs that were played in the game. A sequence of two elements is used, one for each team. The inner (frozen) set is for the champs id in each team, for example: {1,3,5,7,9}. I've tried a couple different options: set< frozen<set<smallint>> >, tuple< set<smallint>> , set<smallint> > and list< frozen<set<smallint>> >. The first options doesn't store the order of the teams, so we would have no way to know who win the game. The second one doesn't accept using an index on this column and doing partial searchs through CONTAINS so I've opted for the third that does keep the order and allows partial searchs.
The other two values are two booleans representing who won the game. You could have additional columns such a draw boolean one but this one is not necesary or duration time if you want to store the length of the game (I'm not using Cassandra's duration type on purpouse as it is only worth when it takes months or at least days), end timestamp/start timestamp if you want to store the one that you are not using in the partition and clustering key, etc.
Partial searchs
It may be useful to create an index on teams so that you are allowed to query on this column:
CREATE INDEX matchesByTeams ON so.matches( teams );
Then we can execute the following SELECT statenments:
SELECT * FROM so.matches WHERE teams CONTAINS {1,3,5,7,9};
SELECT * FROM so.matches WHERE teams CONTAINS {1,3,5,7,9} AND dt=toDate(now());
The first one would select the matches in which any of the teams selected that composition and the second one will further filter it to today's matches.
Stats cache table
With these two tables you can hold all the info, and then request the data you need to calculate the stats involved. Once you calculate some data, you could store this info back in Cassandra as a "cache" in an additional table so that when a user requests some stats to be shown, you first check if they were already calculated and if they weren't calculate. This table would need to have a column for each parameter that the user can enter, for example: champion composition, starting date, final date, enemy team; and additional columns for the stats themselves.
CREATE TABLE so.stats (
team frozen<set<smallint>>,
s_ts timestamp,
e_ts timestamp,
enemy frozen<set<smallint>>,
win_ratio float,
loose_ratio float,
wins int,
draws int,
looses int,
PRIMARY KEY(team, s_ts, e_ts, enemy)
) WITH comment="Already calculated queries";
Ordered by win/loose ratios:
To get the results order by ratios instead of enemy team you can use materialized views.
CREATE MATERIALIZED VIEW so.statsByWinRatio AS
SELECT * FROM so.stats
WHERE team IS NOT NULL AND s_ts IS NOT NULL AND e_ts IS NOT NULL AND win_ratio IS NOT NULL AND enemy IS NOT NULL
PRIMARY KEY(team, s_ts, e_ts, win_ratio, enemy)
WITH comment='Allow ordering by win ratio';
NOTE:
While I was answering I realized that introducing the concept of "patch" inside the DB so that the user is not allowed to determine dates but patches could be a better solution. If you are interested comment and I'll edit the answer to include the patch concept. It would mean modifying both the so.historic and so.stats tables a bit, but quite minor changes.

You can create a statistics table which holds game stats for a champion in a given day.
CREATE TABLE champion_stats_by_day (
champion_ids FROZEN<SET<INT>>,
competing_champion_ids FROZEN<SET<INT>>,
competition_day DATE,
win_ratio DECIMAL,
loss_ratio DECIMAL,
draw_ratio DECIMAL,
wins INT,
draws INT,
losses INT,
matches INT,
PRIMARY KEY(champion_ids, competition_day, competing_champion_ids)
) WITH CLUSTERING ORDER BY(competition_day DESC, competing_champion_ids ASC);
You can ask stats for a champion starting from a certain date, but you have to do the sorting / aggregation in the clients:
SELECT * FROM champion_stats_by_day WHERE champion_ids = {1,2,3,4} AND competition_day > '2017-10-17';
champion_ids | competition_day | competing_champion_ids | draw_ratio | draws | loss_ratio | losses | matches | win_ratio | wins
--------------+-----------------+------------------------+------------+-------+------------+--------+---------+-----------+------
{1, 2, 3, 4} | 2017-11-01 | {2, 9, 21, 33} | 0.04 | 4 | 0.57 | 48 | 84 | 0.38 | 32
{1, 2, 3, 4} | 2017-11-01 | {5, 6, 22, 32} | 0.008 | 2 | 0.55 | 128 | 229 | 0.43 | 99
{1, 2, 3, 4} | 2017-11-01 | {12, 21, 33, 55} | 0.04 | 4 | 0.57 | 48 | 84 | 0.38 | 32
{1, 2, 3, 4} | 2017-10-29 | {3, 8, 21, 42} | 0 | 0 | 0.992 | 128 | 129 | 0.007 | 1
{1, 2, 3, 4} | 2017-10-28 | {2, 9, 21, 33} | 0.23 | 40 | 0.04 | 8 | 169 | 0.71 | 121
{1, 2, 3, 4} | 2017-10-22 | {7, 12, 23, 44} | 0.57 | 64 | 0.02 | 3 | 112 | 0.4 | 45
Update & insert works as following. You first select the existing statistic for that date and champion ID and then do an update. In case, when row is not in the table it's not going to be a problem as Cassandra performs and UPSERT in this case.:
SELECT * FROM champion_stats_by_day WHERE champion_ids = {1,2,3,4} AND competing_champion_ids = {21,2,9,33} AND competition_day = '2017-11-01';
UPDATE champion_stats_by_day
SET win_ratio = 0.38, draw_ratio = 0.04, loss_ratio = 0.57, wins = 32, draws = 4, losses = 48, matches = 84
WHERE champion_ids = {1,2,3,4}
AND competing_champion_ids = {21,2,9,33}
AND competition_day = '2017-11-01';
I also added the sample CQL commands here.
Let me know what you think.

Related

How to scale a range sharded index on a timestamp column in YugabyteDB?

Is there any performance tuning to do for a write-bound workload in YugabyteDB? We thought that by simply adding additional nodes to our YugabyteDB cluster, without further tuning, we would have seen some noticeable increase in writes, however this is not the case. Schema can be found below.
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
update_id | character varying(255) | | not null | | extended | |
node_id | character varying(255) | | not null | | extended | |
data | character varying | | not null | | extended | |
created_at | timestamp without time zone | | | timezone('utc'::text, now()) | plain | |
Indexes:
"test_pkey" PRIMARY KEY, lsm (update_id HASH)
"test_crat" lsm (created_at DESC)
This table has tablets spread across all tservers with RF=3. Created_at is a timestamp that changes all of the time. At this point it has no more than two days of data, all new inserts are acquiring a new timestamp.
In the case of the schema called out above, the test_crat index here is limited to 1 tablet because it is range-sharded. Since created_at has only recent values they will end up going to 1 shard/tablet even with tablet splitting, meaning that all inserts will go to 1 shard. As explained in this Google Spanner documentation, whose sharding, replication, and transactions architecture YugabyteDB is based off of, this is an antipattern for scalability. As mentioned in that documentation:
If you need a global (cross node) timestamp ordered table, and you need to support higher write rates to that table than a single node is capable of, use application-level sharding. Sharding a table means partitioning it into some number N of roughly equal divisions called shards. This is typically done by prefixing the original primary key with an additional ShardId column holding integer values between [0, N). The ShardId for a given write is typically selected either at random, or by hashing a part of the base key. Hashing is often preferred because it can be used to ensure all records of a given type go into the same shard, improving performance of retrieval. Either way, the goal is to ensure that, over time, writes are distributed across all shards equally. This approach sometimes means that reads need to scan all shards to reconstruct the original total ordering of writes.
What that would mean is: to get recent changes, you would have to query each of the shards. Suppose you have 32 shards:
select * from raw3 where shard_id = 0 and created_at > now() - INTERVAL 'xxx';
..
select * from raw3 where shard_id = 31 and created_at > now() - INTERVAL 'xxx';
On the insert, every row could just be given a random value for your shard_id column from 0..31. And your index would change from:
(created_at DESC)
to
(shard_id HASH, created_at DESC)
Another approach you could use that may not be as intuitive, but may be more effective, would be to use a partial index for each shard_id that you would want.
Here is a simple example using 4 shards:
create index partial_0 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4=0;
The partial index above only includes rows where the modulus of the epoch in milliseconds of created_at timestamp is 0. And you repeat for the other 3 shards:
create index partial_1 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 1;
create index partial_2 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 2;
create index partial_3 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 3;
And then when you query PostgreSQL is smart enough to pick the right index:
yugabyte=# explain analyze select * from raw3 where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 3 AND created_at < now();
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using partial_3 on raw3 (cost=0.00..5.10 rows=10 width=16) (actual time=1.429..1.429 rows=0 loops=1)
Index Cond: (created_at < now())
Planning Time: 0.210 ms
Execution Time: 1.502 ms
(4 rows)
No need for a new shard_id column in the base table or in the index. If you want to reshard down the road, you can recreate new partial indexes with different shards and drop the old indexes.
More information about the DocDB sharding layer within YugabyteDB can be found here. If you are interested in the different sharding strategies we evaluated, and why we decided on consistent hash sharding as the default sharding strategy, take a look at this blog written by our Co-Founder and CTO Karthik Ranganathan.

Salting Technique to tackle Skew in Spark SQL

I am trying to understand Salting techniques to tackle Skew in Spark SQL. I have done some reading online and I have come up with a very rudimentary implementation of the same in Spark SQL API.
Let's assume that table1 is Skewed on cid=1:
Table 1:
cid | item
---------
1 | light
1 | cookie
1 | ketchup
1 | bottle
2 | dish
3 | cup
As shown above, cid=1 occurs more than other keys.
Table 2:
cid | vehicle
---------
1 | taxi
1 | truck
2 | cycle
3 | plane
Now my code looks like the following:
create temporary view table1_salt as
select
cid, item, concat(cid, '-', floor(rand() * 19)) as salted_key
from table1;
create temporary view table2_salt as
select
cid, vehicle, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key
from table2;
Final Query:
select a.cid, a.item, b.name
from table1_salt a
inner join table2_salt b
on a.salted_key = concat(b.cid, '-', b.salted_key);
In the above example, I have used 20 salts/splits.
Questions:
Is there any rule of thumb to choose optimal number for the splits to
be used ? For e.g. if table1 has 10 Million records, how many bins/buckets should I use ? (In this simple test example I have used 20).
As shown above, when I am creating Table2_salt, I am hardcoding the
the salts like (0, 1, 2, 3.... thru 19). Is there a better
way to implement the same functionality, but without the
hardcoding and the clutter ? (What if I want to use 100 splits!)
Since we are replicating the second table (table2) N number of times, doesn't it mean that it will degrade the Join performance ?
Note: I need to use Spark 2.4 SQL API only.
Also, kindly let me know if there are any advanced examples available on the net. Any help is appreciated.

Duplicate rows/columns for the same primary key in Cassandra

I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)
I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...
Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3
"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".

Is there a way to get random rows each time if the data does not change in Cassandra like MySQL RAND()

CREATE TABLE users (
userId uuid,
firstname varchar,
mobileNo varchar,
PRIMARY KEY (userId)
);
CREATE TABLE users_by_firstname (
userId uuid,
firstname varchar,
mobileNo varchar,
PRIMARY KEY (firstname,userId)
);
I have 100 rows in these tables. I want to get randomly selected 10 rows each time.
In MySQL
select * from users order by RAND() limit 10;
In Cassandra
select * from users limit 10;
select * from users_by_firstname limit 10;
But from 1st table I would get the static 10 rows sorted by the generated hash of the partition key (userId).
From the second one I would get the static 10 rows sorted by userId.
But it will not be random if the data does not change.
Is there any way to get random rows each time in Cassandra.
Thanks
Chaity
It's not possible to archive this directly. There are possibilities to emulate this (this solution is not really random, but you should receive different values), but it's not really a perfect idea.
What you could do is, create a random value in the cassandra token range -2^63 - 2^64. With this random value you can perform such a query:
select * from users_by_firstname where token(userId) > #generated_value# limit 10;
Using this method you can define a random 'starting point' from where you can receive 10 users. As I said, this method is not perfect and it certainly needs some thoughts on how to generate the random token. An edge case could be, that your random value is so far on one side of the ring, that you would receive less than 10 values.
Here is a short example:
Lets say you have a users table with the following users:
token(uuid) | name
----------------------+---------
-2540966642987085542 | Kate
-1621523823236117896 | Pauline
-1297921881139976049 | Stefan
-663977588974966463 | Anna
-155496620801056360 | Hans
958005880272148645 | Max
3561637668096805189 | Doro
5293579765126103566 | Paul
8061178154297884044 | Frank
8213365047359667313 | Peter
Lets now say you generate the value 42 as a start-token, the select would be
select token(uuid), name from test where token(uuid) > 42 limit 10;
In this example the result would be
token(id) | name
---------------------+-------
958005880272148645 | Max
3561637668096805189 | Doro
5293579765126103566 | Paul
8061178154297884044 | Frank
8213365047359667313 | Peter
This method might be a reasonable approach if you have a lot of data, and a balanced cluster. To make sure you don't run into these edge case you could limit the range to not come near the edges of the cassandra token range.

Cassandra - Overlapping Data Ranges

I have the following 'Tasks' table in Cassandra.
Task_ID UUID - Partition Key
Starts_On TIMESTAMP - Clustering Column
Ends_On TIMESTAMP - Clustering Column
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.
In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.
Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
And let's say we have this data:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
Anyway not a perfect method obviously, but it might be something you can work with. :)
A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:
CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));
With this structure, I can query by userid and eventtime:
SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');
userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch
(3 rows)
That query will give me all event rows for a particular user for today.
NOTES:
If you need to query by whether or not an event is starting or ending (I did not) you will want to order eventType ahead of eventTime in the primary key.
You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
In your case, you will want to find a good key to partition on, as Task_ID will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).
There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:
I cluster the table on the Starts_On timestamp in descending order. (Ends_On is just a regular column.) Then I constrain the query with Starts_On<? where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.
I then iterate through the results until the row Ends_On is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later Ends_On.)
Throwing away the rest of the result rows might seem wasteful, but here's the crucial bit: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.
Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.

Resources