Partial Partition Key Querying With Per Partition Limit In Cassandra - cassandra

I have a table (let's call it T) set up with a PRIMARY KEY like the following:
PRIMARY KEY ((A, B), C, ....);
I want to query it like the following:
SELECT * FROM T WHERE A = ? and C <= ? PER PARTITION LIMIT 1 ALLOW FILTEIRNG;
(Note that C is a timstamp value. I am essentially asking for the most recent rows across all partitions whose first partition key belongs to my input).
This works with the allow filtering command, and it makes sense why I need it; I do not know beforehand the partition keys B, and I do not care - I want all of them. Therefore, it makes sense that Cassandra would need to scan the entire partition to give me the results, and it also makes sense why I would need to specify it to allow filtering for this to occur.
However, I have read that we should avoid 'ALLOW FILTERING' at all costs, as it can have a huge performance impact, especially in production environments. Indeed, I only use allow filtering very sparingly in my existing applications, and this is usually for one-off queries that calculate something of this nature.
My quesiton is this: is there a way to restructure this table or query to avoid filtering? I am thinking it is impossible, as I do not have knowledge of the keys that make up B beforehand, but I want to double check just to be sure. Thanks!

You cannot efficiently make that query if (A, B) is your partition key. your key would need to be ((A), B) (dropping clustering keys). Then SELECT * FROM T WHERE A = ?. If only care about the latest, then A, B would always be replaced with the most recent.
Another option if looking to get the A,B tuples from a time is to create a table thats indexed by time and have the tuples be clustering columns from there like ((time_bucket), A, B, C). time_bucket being a string like 2018-04-06:00:00:00 that contains all the events for that day. Then when you query like:
> CREATE TABLE example (time_bucket text, A int, B int, C int, D int, PRIMARY KEY ((time_bucket), A, B, C)) WITH CLUSTERING ORDER BY (A ASC, B ASC, C DESC);
> INSERT INTO example (time_bucket, A, B, C, D) VALUES ('2018-04', 1, 1, 100, 999);
> INSERT INTO example (time_bucket, A, B, C, D) VALUES ('2018-04', 1, 1, 120, 999);
> INSERT INTO example (time_bucket, A, B, C, D) VALUES ('2018-04', 1, 1, 130, 999);
> INSERT INTO example (time_bucket, A, B, C, D) VALUES ('2018-04', 1, 2, 130, 999);
> SELECT * FROM example WHERE time_bucket = '2018-04' GROUP BY time_bucket, A, B;
time_bucket | a | b | c | d
-------------+---+---+-----+-----
2018-04 | 1 | 1 | 130 | 999
2018-04 | 1 | 2 | 130 | 999
You would get the 1st result from each of the rows in the time bucket partition whose clustering by A and B. If you make the partitions small enough (use finer grain time buckets, like hours or 15 minutes or something, depending on data rate) its more acceptable to use ALLOW FILTERING here then like:
SELECT * FROM example WHERE time_bucket = '2018-04' AND A = 1 AND C < 120 GROUP BY time_bucket, A, B ALLOW FILTERING ;
time_bucket | a | b | c | d
-------------+---+---+-----+-----
2018-04 | 1 | 1 | 100 | 999
Because its all within one partition and within a bounded size (monitor it closely with tablestats/max partition size). Make sure always querying with time_bucket though so it doesnt become a range query. You want to make sure you do not end up going through too many things without returning a result (which is one of dangers of allow filtering).

Related

Use IN in any column in a Cassandra Table

I want to be able to use IN in any column in any order in Cassandra
So I have the next table:
CREATE TABLE test (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b, c));
and this data:
INSERT INTO test (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO test (a, b, c) VALUES ('a2', 'b2', 'c2');
This query works:
SELECT * FROM test WHERE c IN ('c1', 'c2') AND b IN ('b1') ALLOW FILTERING;
But if you remove the b IN it gives this error:
SELECT * FROM test WHERE c IN ('c1', 'c2') ALLOW FILTERING;
InvalidRequest: Error from server: code=2200 [Invalid query] message="IN
restrictions are not supported on indexed columns"
Seems like if I want to use IN in a column I should have used IN in some previous columns?
Is there a way to avoid this?
Modifying the Schema is valid but I need to use Cassandra and allow filtering through any columns (if there's no need to filter thought a columns then there would be no IN clause for that column).
Thanks for reading.
P.S: I know you are not supposed to use ALLOW FILTERING please assume there's no other way.
Edit: Seems like they may have fixed this?: https://issues.apache.org/jira/browse/CASSANDRA-14344
There is a lot of confusion cassandra's primary keys.
In order to respond to your question, i think you need to understand how cassandra primary keys are working internally.
When you are creating a Primary key with multiple fields like in your case:
CREATE TABLE test (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b, c));
"a" will be the partition key, you can imagine it as a hash that will chose the partition on which data will be distributed.
b and c will be the clustering keys, these keys will be like a sorted list of your data and c will be nested in each b value, that means that you have to provide b in order to do constraints on c.
The cassandra documentation states that you can only use In clause on last column of the partition key and the last of the clustering key, but attention you'll have to provide all the other clustering keys.
So basically there is no way to do that in one table.
You should think of a tradeOff of your query flexibility vs data duplication.
One solution will be to denormalize your data in 2 tables like this:
CREATE TABLE test1 (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b));
CREATE TABLE test2 (a TEXT, b TEXT, c TEXT, PRIMARY KEY (c, a, b));
By doing so, you will be able to query each table depending on your use case.
The following queries will work:
SELECT * FROM test2 WHERE c IN ('c1', 'c2');
SELECT * FROM test1 WHERE a IN ('a1', 'a2');
SELECT * FROM test1 WHERE b IN ('b1', 'b2') ALLOW FILTERING;
And so on, i think you got the point.
But really try to do the best tradeoff, in order to minimize the allow filtering usage. and remember that the queries on partition keys directly will be the fastest.

Cassandra two dimensional data modelling

The Use case:
For a game I am collecting the results of each game match. It's always Team A against Team B. Each team consists of 5 players each picking a champion and the possible outcome of a match is for one team either Won / Lost or for both teams a draw.
I would like to figure out the best champion combinationsI want to create win/lose/draw statistics based on the chosen champion combination of each team. In total there are ~100 champions a player can chose from. So there are many different champion combinations possible.
More (bonus) features:
I would like to figure out how one combination performed against another specific combination (in short: what's the best combination to counter a very strong champion combination)
As balance changes are applied to the game it makes sense to have a possibility to select / filter stats by specific timeranges (for instance past 14 days only) - daily precision is fine for that
My problem:
I wonder what's the best way to collect the statistics based on the champion combination? How would the data modelling look like?
My idea:
Create a hash of all championId in a combination which would literally represent a championCombinationId which is a unique identifier for the champion combo a team uses.
Create a two dimensional table which allows tracking combination vs combination stats. Something like this:
Timeframes (daily dates) and the actual championIds for a combinationId are missing there.
I tried myself creating a model for the above requirements, but I am absolutely not sure about it. Nor do I know what keys I would need to specify.
CREATE TABLE team_combination_statistics (
combinationIdA text, // Team A
combinationIdB text, // Team B
championIdsA text, // An array of all champion IDs of combination A
championIdsB text, // An array of all champion IDs of combination B
trackingTimeFrame text, // A date?
wins int,
losses int,
draws int
);
This question is quite long so I'll talk about different topics before suggesting my approach, be ready for a long answer:
Data normalization
Two-dimensional tables with same value axes
Data normalization
Storing total ammount of data is useful but ordering by it isn't, as the order doesn't determine if a combination is good vs another, it determines the combination that most times have won/lost vs the opposite but the total ammount of games played also matters.
When ordering the results, you want to order by win-ratio, draw-ratio, loose-ratio of two of the previous as the third is a linear combination.
Two-dimensional tables with same value axes
The problem on two-dimensional tables where both dimensions represent the same data, in this case a group of 5 champs, is that either you make a triangular table or you have data doubled as you will have to store cominationA vs combinationB and combinationB vs combinationA, being combinationX a specific group of 5 champs.
There are two aproaches here, using triangular tables or doubling the data manually:
1. Triangular tables:
You create a table where either the top right half is empty or the bottom left hand is empty. You then handle in the app which hash is A and which is B, and you may need to swap their order, as there is no duplicate data. You could for example consider alphabetical order where A < B always. If you then request the data in the wrong order you would get no data. The other option would be making both A vs B and B vs A query and then joining the results (swapping the wins and looses obviously).
2. Doubling the data manually:
By making two inserts with reflected values (A, B, wins, draws, looses & B, A, looses, draws, wins) you would duplicate the data. This lets you query in any order at the cost of using two times the space and requiring double inserts.
Pros and cons:
The pros of one approach are the cons of the other.
Pros of triangular tables
Does not store duplicate data
Requires half the insert
Pros of doubling the data
The application doesn't care in which order you make the request
I would probably use the triangular tables approach as the application complexity increase is not that big to be relevant, but the scalability does matter.
Proposed schema
Use whatever keyspace you want, I choose so from stackoverflow. Modify the replication strategy or factor as needed.
CREATE KEYSPACE so WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
Champion names table
The champion table will contain info about the different champions, for now it will only hold the name but you could store other things in the future.
CREATE TABLE so.champions (
c boolean,
id smallint,
name text,
PRIMARY KEY(c, id)
) WITH comment='Champion names';
A boolean is used as the partition key as we want to store all champs in a single partition for query performance and there will be a low ammount of records (~100) we will always be using c=True. A smallint was choosen for the id as 2^7 = 128 was to close to the actual number of champs and to leave room for future champs without using the negative numbers.
When querying the champs you could get them all by doing:
SELECT id, name FROM so.champions WHERE c=True;
or request a specific one by:
SELECT name FROM so.champions WHERE c=True and id=XX;
Historic match results table
This table will store the results of the matches without agregating:
CREATE TABLE so.matches (
dt date,
ts time,
id XXXXXXXX,
teams list<frozen<set<smallint>>>,
winA boolean,
winB boolean,
PRIMARY KEY(dt, ts, id)
) WITH comment='Match results';
For the partition of an historic data table, and as you mentioned daily precission, date seems to be a nice partition key. A time column is used as the first clustering key for ordering reasons and to complete the timestamp, doesn't matter if these timestamp belong to the ending or finishing instant, choose one and stick with it. An additional identifier is required in the clustering key as 2 games may end in the same instant (time has nanosecond precission which would basically mean that the data lost to overlap would be quite insignificant but your data source will probably not have this precission, thus making this last key column necesary). You can use whatever type you want for this column, probably you will already have some king of identifier with the data that you can use here. You could also go for a random number, an incremental int managed by the application, or even the name of the first players as you can be sure the same player will not start/finish two games at the same second.
The teams column is the most important one: it stores the ids of the champs that were played in the game. A sequence of two elements is used, one for each team. The inner (frozen) set is for the champs id in each team, for example: {1,3,5,7,9}. I've tried a couple different options: set< frozen<set<smallint>> >, tuple< set<smallint>> , set<smallint> > and list< frozen<set<smallint>> >. The first options doesn't store the order of the teams, so we would have no way to know who win the game. The second one doesn't accept using an index on this column and doing partial searchs through CONTAINS so I've opted for the third that does keep the order and allows partial searchs.
The other two values are two booleans representing who won the game. You could have additional columns such a draw boolean one but this one is not necesary or duration time if you want to store the length of the game (I'm not using Cassandra's duration type on purpouse as it is only worth when it takes months or at least days), end timestamp/start timestamp if you want to store the one that you are not using in the partition and clustering key, etc.
Partial searchs
It may be useful to create an index on teams so that you are allowed to query on this column:
CREATE INDEX matchesByTeams ON so.matches( teams );
Then we can execute the following SELECT statenments:
SELECT * FROM so.matches WHERE teams CONTAINS {1,3,5,7,9};
SELECT * FROM so.matches WHERE teams CONTAINS {1,3,5,7,9} AND dt=toDate(now());
The first one would select the matches in which any of the teams selected that composition and the second one will further filter it to today's matches.
Stats cache table
With these two tables you can hold all the info, and then request the data you need to calculate the stats involved. Once you calculate some data, you could store this info back in Cassandra as a "cache" in an additional table so that when a user requests some stats to be shown, you first check if they were already calculated and if they weren't calculate. This table would need to have a column for each parameter that the user can enter, for example: champion composition, starting date, final date, enemy team; and additional columns for the stats themselves.
CREATE TABLE so.stats (
team frozen<set<smallint>>,
s_ts timestamp,
e_ts timestamp,
enemy frozen<set<smallint>>,
win_ratio float,
loose_ratio float,
wins int,
draws int,
looses int,
PRIMARY KEY(team, s_ts, e_ts, enemy)
) WITH comment="Already calculated queries";
Ordered by win/loose ratios:
To get the results order by ratios instead of enemy team you can use materialized views.
CREATE MATERIALIZED VIEW so.statsByWinRatio AS
SELECT * FROM so.stats
WHERE team IS NOT NULL AND s_ts IS NOT NULL AND e_ts IS NOT NULL AND win_ratio IS NOT NULL AND enemy IS NOT NULL
PRIMARY KEY(team, s_ts, e_ts, win_ratio, enemy)
WITH comment='Allow ordering by win ratio';
NOTE:
While I was answering I realized that introducing the concept of "patch" inside the DB so that the user is not allowed to determine dates but patches could be a better solution. If you are interested comment and I'll edit the answer to include the patch concept. It would mean modifying both the so.historic and so.stats tables a bit, but quite minor changes.
You can create a statistics table which holds game stats for a champion in a given day.
CREATE TABLE champion_stats_by_day (
champion_ids FROZEN<SET<INT>>,
competing_champion_ids FROZEN<SET<INT>>,
competition_day DATE,
win_ratio DECIMAL,
loss_ratio DECIMAL,
draw_ratio DECIMAL,
wins INT,
draws INT,
losses INT,
matches INT,
PRIMARY KEY(champion_ids, competition_day, competing_champion_ids)
) WITH CLUSTERING ORDER BY(competition_day DESC, competing_champion_ids ASC);
You can ask stats for a champion starting from a certain date, but you have to do the sorting / aggregation in the clients:
SELECT * FROM champion_stats_by_day WHERE champion_ids = {1,2,3,4} AND competition_day > '2017-10-17';
champion_ids | competition_day | competing_champion_ids | draw_ratio | draws | loss_ratio | losses | matches | win_ratio | wins
--------------+-----------------+------------------------+------------+-------+------------+--------+---------+-----------+------
{1, 2, 3, 4} | 2017-11-01 | {2, 9, 21, 33} | 0.04 | 4 | 0.57 | 48 | 84 | 0.38 | 32
{1, 2, 3, 4} | 2017-11-01 | {5, 6, 22, 32} | 0.008 | 2 | 0.55 | 128 | 229 | 0.43 | 99
{1, 2, 3, 4} | 2017-11-01 | {12, 21, 33, 55} | 0.04 | 4 | 0.57 | 48 | 84 | 0.38 | 32
{1, 2, 3, 4} | 2017-10-29 | {3, 8, 21, 42} | 0 | 0 | 0.992 | 128 | 129 | 0.007 | 1
{1, 2, 3, 4} | 2017-10-28 | {2, 9, 21, 33} | 0.23 | 40 | 0.04 | 8 | 169 | 0.71 | 121
{1, 2, 3, 4} | 2017-10-22 | {7, 12, 23, 44} | 0.57 | 64 | 0.02 | 3 | 112 | 0.4 | 45
Update & insert works as following. You first select the existing statistic for that date and champion ID and then do an update. In case, when row is not in the table it's not going to be a problem as Cassandra performs and UPSERT in this case.:
SELECT * FROM champion_stats_by_day WHERE champion_ids = {1,2,3,4} AND competing_champion_ids = {21,2,9,33} AND competition_day = '2017-11-01';
UPDATE champion_stats_by_day
SET win_ratio = 0.38, draw_ratio = 0.04, loss_ratio = 0.57, wins = 32, draws = 4, losses = 48, matches = 84
WHERE champion_ids = {1,2,3,4}
AND competing_champion_ids = {21,2,9,33}
AND competition_day = '2017-11-01';
I also added the sample CQL commands here.
Let me know what you think.

How do I select all rows for a clustering column in cassandra?

I have a Partion key: A
Clustering columns: B, C
I do understand I can query like this
Select * from table where A = ?
Select * from table where A = ? and B = ?
Select * from table where A = ? and B = ? and C = ?
On certain cases, I want the B value to be any value in that column.
Is there a way I can query like the following?
Select * from table where A = ? and B = 'any value' and C = ?
Option 1:
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your fourth query (queries by A and C, but not necessarily knowing B value), is to create a new table to handle that specific query. This table will be pretty much the same, except the CLUSTERING COLUMNS will be in slightly different order:
PRIMARY KEY (A, C, B)
Now this query will work:
Select * from table where A = ? and C = ?
Option 2:
Alternatively you can create a materialized view, with a different clustering order. Now Cassandra will keep the MV in sync with your table data.
create materialized view mv_acbd as
select A, B, C, D
from TABLE1
where A is not null and B is not null and C is not null
primary key (A, C, B);
Now the query against this MV will work like a charm
Select * from mv_acbd where A = ? and C = ?
Option 3:
Not the best, but you could use the following query with your table as it is
Select * from table where A = ? and C = ? ALLOW FILTERING
Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster. For this particular case, the scan is within the same partition and performance may vary depending on ratio of how many clustering columns per partition your use case has.

Cassandra list type conflicts

If I have a List field in Cassandra and two people write to it at the same time, is it a simple last write wins or will it merge the writes?
For example: [a, b, c, d]
User1 -> [b, a, c, d] (move b to index 0)
User2 -> [a, b, d, c] (move c to index 3)
Will Cassandra merge the results and end up with [b, a, d, c] or will it use last write wins to the microsecond?
You will get the merge result
Every write data to cassandra, a timestamp associated with each column is also inserted. when you execute read query, timestamps are used to pick a "winning" update within a single column or collection element.
What if I have a truly concurrent write with the same time stamp? In the unlikely case that you precisely end up with two time stamps that match in its microsecond, you might end up with a bad version but Cassandra ensures that ties are consistently broken by comparing the byte values.
Cassandra store list (collection) different than normal column.
Example :
CREATE TABLE friendlists (
user text PRIMARY KEY,
friends list <text>
);
If we insert some dummy data :
user | friends
----------+-------------------------
john | [doug, patricia, scott]
patricia | [john, lucifer]
The internal representation:
RowKey: john
=> (column=, value=, timestamp=1374687324950000)
=> (column=friends:26017c10f48711e2801fdf9895e5d0f8, value='doug', timestamp=1374687206993000)
=> (column=friends:26017c11f48711e2801fdf9895e5d0f8, value='patricia', timestamp=1374687206993000)
=> (column=friends:26017c12f48711e2801fdf9895e5d0f8, value='scott', timestamp=1374687206993000)
=> (column=friends:6c504b60f48711e2801fdf9895e5d0f8, value='matt', timestamp=1374687324950000)
=> (column=friends:6c504b61f48711e2801fdf9895e5d0f8, value='eric', timestamp=1374687324950000)
-------------------
RowKey: patricia
=> (column=, value=, timestamp=1374687352290000)
=> (column=friends:3b817b80f48711e2801fdf9895e5d0f8, value='john', timestamp=1374687243064000)
Here the internal column name is more complicated because a UUID is appended to the name of the CQL field "friends". This is used to keep track of the order of items in the list.
Every time you insert data cassandra with below query :
INSERT INTO friendlists (user , friends ) VALUES ( 'patricia', ['john', 'lucifer']);
//or
UPDATE friendlists SET friends = ['john', 'lucifer'] where user = 'patricia';
Will create a tombstone with a less timestamp than current, it tells that the previous data has been deleted. So if concurrent insert happened with the same exact timestamp both data are ahead of tombstone so both data will live.
Source :
http://mighty-titan.blogspot.com/2012/06/understanding-cassandras-consistency.html
http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure-sets-lists-and-maps/

Cassandra - Overlapping Data Ranges

I have the following 'Tasks' table in Cassandra.
Task_ID UUID - Partition Key
Starts_On TIMESTAMP - Clustering Column
Ends_On TIMESTAMP - Clustering Column
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.
In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.
Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
And let's say we have this data:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
Anyway not a perfect method obviously, but it might be something you can work with. :)
A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:
CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));
With this structure, I can query by userid and eventtime:
SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');
userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch
(3 rows)
That query will give me all event rows for a particular user for today.
NOTES:
If you need to query by whether or not an event is starting or ending (I did not) you will want to order eventType ahead of eventTime in the primary key.
You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
In your case, you will want to find a good key to partition on, as Task_ID will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).
There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:
I cluster the table on the Starts_On timestamp in descending order. (Ends_On is just a regular column.) Then I constrain the query with Starts_On<? where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.
I then iterate through the results until the row Ends_On is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later Ends_On.)
Throwing away the rest of the result rows might seem wasteful, but here's the crucial bit: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.
Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.

Resources