Can Cassandra partition tables? - cassandra

I'm inserting ~8 rows per sec, and I would like to have one big table with all rows and I want to partition this table into many tables every week.
e.g
select * from keyspace.rootTable; -> returns all rows from all tables
select * from keyspace.27-2016Table -> return all rows from week 27

At 86400 seconds per day and 604800 seconds per week, you'll be storing 691200 rows per day and 4838400 rows each week. Even without knowing how wide your rows are, that's too many to return in a single query. Cassandra is great for storing lots of data like this. But querying lots of data like this...not so much.
You would probably want to partition by hour, but even that would give you 28800 rows. That's at least semi-manageable, so let's go with that.
I'd build a table that looks like this, partitioning on week and hourBucket while clustering on writeTime:
CREATE TABLE youAreAskingCassandraForTooManyRows (
week text,
hourBucket text,
writeTime timestamp,
value text,
PRIMARY KEY ((week,hourBucket),writeTime))
WITH CLUSTERING ORDER BY (writeTime DESC);
Then I could query by a specific week and hour, just by the partition keys:
aploetz#cqlsh:stackoverflow> SELECT *
FROM youareaskingcassandrafortoomanyrows
WHERE week='201607-3' AND hourBucket ='20160713-14';
week | hourBucket | writetime | value
----------+--------------+--------------------------+--------
201607-3 | 20160713-14 | 2016-07-13 14:01:18+0000 | value6
201607-3 | 20160713-14 | 2016-07-13 14:01:14+0000 | value5
201607-3 | 20160713-14 | 2016-07-13 14:01:12+0000 | value4
201607-3 | 20160713-14 | 2016-07-13 14:01:10+0000 | value3
201607-3 | 20160713-14 | 2016-07-13 14:01:07+0000 | value2
201607-3 | 20160713-14 | 2016-07-13 14:01:04+0000 | value1
(6 rows)
Or even for a specific range, based on the clustering key writetime.
aploetz#cqlsh:stackoverflow> SELECT *
FROM youareaskingcassandrafortoomanyrows
WHERE week='201607-3' AND hourBucket ='20160713-14'
AND writetime > '2016-07-13 14:01:05+0000'
AND writetime < '2016-07-13 14:01:18+0000';
week | hourBucket | writetime | value
----------+--------------+--------------------------+--------
201607-3 | 20160713-14 | 2016-07-13 14:01:14+0000 | value5
201607-3 | 20160713-14 | 2016-07-13 14:01:12+0000 | value4
201607-3 | 20160713-14 | 2016-07-13 14:01:10+0000 | value3
201607-3 | 20160713-14 | 2016-07-13 14:01:07+0000 | value2
(4 rows)
select * from keyspace.rootTable; -> returns all rows from all tables
It should go without saying that if I think that querying an entire week's worth of 4 million-plus rows will be so huge that it will time-out, then querying your entire table is a monumentally bad idea.
Important to note, Cassandra is not a relational database. It is a distributed system, and thus running unbound queries (queries without a WHERE clause) introduces LOTS of network time into your equation. That's why you always want to specify at least a partition key(s) with all SELECT queries, because then you can guarantee that you'll be satisfying that query from a single node.
You should take a look at Patrick McFadin's article on Getting Started with Time Series Data Modeling. That should help you to understand how to partition data like this, and get you on the right path.

Related

partition by multiple columns in Spark SQL not working properly

I want to partition by three columns in my query :
user id
cancelation month year.
retention month year.
I used row number and partition by as follows
row_number() over (partition by user_id ,cast ( date_format(cancelation_date,'yyyyMM') as integer),cast ( date_format(retention_date,'yyyyMM') as integer) order by cast ( date_format(cancelation_date,'yyyyMM') as integer) asc, cast ( date_format(retention_date,'yyyyMM') as integer) asc) as row_count
example of the output I got :
| user_id |cancelation_date |cancelation_month_year|retention_date|retention_month_year|row_count|
| -------- | -------------- |----------------------|--------------|--------------------|---------|
| 566 | 28-5-2020 | 202005 | 20-7-2020 | 202007 |1 |
| 566 | 28-5-2020 | 202005 | 30-7-2-2020 | 202007 |2 |
example of the output I want to get:
user_id
cancelation_date
cancelation_month_year
retention_date
retention_month_year
row_count
566
28-5-2020
202005
20-7-2020
202007
1
566
28-5-2020
202005
30-7-2-2020
202007
1
note that user may have more than cancelation months, for example f he has canceled in August , I want row count =2 for all dates in August and so on.
it's not obvious why partition by is partitioning by retention date instead of partitioning by retention month year.
I get the impression that row_number is not what you want, rather you are interested in dense_rank, wherein you would get your expected output.

Cassandra: How to model data so that percentage change in time ranges can be calculated?

I have very huge amount of data, which I plan to store in Cassandra. I am new to Cassandra and am trying to find a data model that will work for me.
My data is various parameters for commodities gathered over irregular time intervals:
commodity_id | timestamp | param1 | param2
c1 | '2018-01-01' | 5 | 15
c1 | '2018-01-03' | 7 | 15
c1 | '2018-01-08' | 8 | 10
c2 | '2018-01-01' | 100 | 13
c2 | '2018-01-02' | 140 | 13
c2 | '2018-01-05' | 130 | 13
c2 | '2018-01-06' | 150 | 13
I need to query the database, and get commodity IDs by "percentage change" in the params.
Ex. Find out all commodities whose param2 increased by more than 50% between '2018-01-02' and '2018-01-06'
CREATE TABLE "commodity" (
commodity_id text,
timestamp date,
param1 int,
param2 int,
PRIMARY KEY (commodity_id, timestamp)
)
You should be fine with this table. You can expect daysPerYear entries for a commodity partition, which is reasonably small so you dont need any artificial keys. Even if you have a large number of commodities, you wont run out of partitions, as the murmur3 partitioner actually has a range of -2^63 to +2^63-1. That are 18,446,744,073,709,551,616 possible values.
I would pull the data from cassandra and calculate the values in the app layer.

Cassandra query table based on row range

I am new to cassandra. I am using cassandra-3.0 and datastax java driver for development. I would like to know whether cassandra provide any option to fecth the data based on rowkey range?
something like
select * from <table-name> where rowkey > ? and rowkey < ?;
If not, is there any other option in cassandra ( java/cql) to fetchdata based on row ranges?
Unfortunately, there really isn't a mechanism in Cassandra that works in the way that you are asking. The only way to run a range query on your partition keys (rowkey) is with the token function. This is because Cassandra orders its rows in the cluster by the hashed token value of the partition key. That value would not really have any meaning for you, but it would allow you to "page" through the a large table without encountering timeouts.
SELECT * FROM <table-name>
WHERE token(rowkey) > -9223372036854775807
AND token(rowkey) < -5534023222112865485;
The way to go about range querying on meaningful values, is to find a value to partition your rows by, and then cluster by a numeric or time value. For example, I can query a table of events by date range, if I partition my data by month (PRIMARY KEY(month,eventdate)):
aploetz#cqlsh:stackoverflow> SELECT * FROM events
WHERE monthbucket='201509'
AND eventdate > '2015-09-19' AND eventdate < '2015-09-26';
monthbucket | eventdate | beginend | eventid | eventname
-------------+--------------------------+----------+--------------------------------------+------------------------
201509 | 2015-09-25 06:00:00+0000 | B | a223ad16-2afd-4213-bee3-08a2c4dd63e6 | Hobbit Day
201509 | 2015-09-25 05:59:59+0000 | E | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-22 06:00:00+0000 | B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-20 05:59:59+0000 | E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
201509 | 2015-09-19 06:00:00+0000 | B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
(5 rows)

Cassandra: low cardinality partition

Let's say I have a table, something like this:
CREATE TABLE Users (
user UUID,
seq INT,
group TEXT,
time BIGINT,
PRIMARY KEY ((user), seq)
);
This follows the desired pattern of Cassandra, with good distribution across partitions (assuming the default Murmur3 hash partitioner).
However, I also need to (rarely) perform range queries on and in time order. This doesn't seem possible in Cassandra. In reality I do need to access the data by group, so (group, time) is acceptable. Since there doesn't seem a way to have secondary index have multiple columns, I guess the right thing is to denormalize, into something like this:
CREATE TABLE UsersByGroupTime (
user UUID,
seq INT,
group TEXT,
time BIGINT,
PRIMARY KEY ((group), time)
) WITH CLUSTERING ORDER BY (time ASC);
This works entirely as it should, except that group is really low cardinality, let's say ('A','B','C'), and uneven distribution across users. Since queries on that table is rare, I'm not worried about hot nodes, but I am worried about uneven distribution, perhaps even a single node getting all.
Is this a common scenario and is there any way to mitigate this or are there alternative solutions?
One technique to help avoid hot-spots in Cassandra time series models, is in making use of a "time bucket." Essentially what you would do is determine the "happy medium" level of time precision that provides adequate data distribution, while also being known and semi-convenient to query by.
For the purposes of this example, I'll choose year and month ("yyyyMM"). Note: I have no idea if year and month will work for you...it's just an example. Once you determine your time bucket, you would add it as an additional partition key, like this:
CREATE TABLE UsersByGroupTime (
user UUID,
seq INT,
group TEXT,
time TIMEUUID,
yearmonth BIGINT,
PRIMARY KEY ((group, yearmonth), time)
) WITH CLUSTERING ORDER BY (time DESC);
After inserting some rows, queries like this will work:
aploetz#cqlsh:stackoverflow2> SELECT group, yearmonth, dateof(time), time, seq, user
FROM usersbygrouptime WHERE group='B' AND yearmonth=201505;
group | yearmonth | dateof(time) | time | seq | user
-------+-----------+--------------------------+--------------------------------------+-----+--------------------------------------
B | 201505 | 2015-05-16 10:04:10-0500 | ceda56f0-fbdc-11e4-bd43-21b264d4c94d | 1 | d57ba8a4-db24-440c-a983-b1dd6b0d2e27
B | 201505 | 2015-05-16 10:04:09-0500 | ce1cac40-fbdc-11e4-bd43-21b264d4c94d | 1 | 66d07cbb-a2ff-4d56-8fa1-14dfaf684474
B | 201505 | 2015-05-16 10:04:08-0500 | cd525760-fbdc-11e4-bd43-21b264d4c94d | 1 | 07b589ac-4d5f-401e-a34f-e3479e269e01
B | 201505 | 2015-05-16 10:04:06-0500 | cc76c470-fbdc-11e4-bd43-21b264d4c94d | 1 | 984f85b5-ea58-4cf8-b512-43abacb227c9
(4 rows)
Now that may or may not help you query-wise, so you will need to spend some time ensuring that you pick an appropriate time bucket. But, this does help in terms of data distribution in the ring, which you can see with the token function:
aploetz#cqlsh:stackoverflow2> SELECT group, yearmonth, token(group,yearmonth)
FROM usersbygrouptime ;
group | yearmonth | token(group, yearmonth)
-------+-----------+-------------------------
A | 201503 | -3784784210711042553
A | 201504 | -610775546464185720
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
(12 rows)
Notice how different tokens are generated for each group/yearmonth pair, even though some of them have the same group ("A").

Range query - Data modeling for time series in CQL Cassandra

I have a table like this:
CREATE TABLE test ( partitionkey text, rowkey text, date
timestamp, policyid text, policyname text, primary key
(partitionkey, rowkey));
with some data:
partitionkey | rowkey | policyid | policyname | date
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
p2 | r3 | pl3 | plicy3 | 2008-01-03 00:00:00+0000
I want to be able to find:
1/ data from a particular partition key
2/ data from a particular partition key & rowkey
3/ Range query on date given a partitionkey
1/ and 2/ are trivial:
select * from test where partitionkey='p1';
partitionkey | rowkey | policyid | policyname | range
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
but what about 3/?
Even with an index it doesnt work:
create index i1 on test (date);
select * from test where partitionkey='p1' and date =
'2007-01-02';
partitionkey | rowkey | policyid | policyname | date
p1 | r1 | pl1 plicy1 | 2007-01-02 00:00:00+0000
but
select * from test where partitionkey='p1' and
date > '2007-01-02';
Bad Request: No indexed columns present in
by-columns clause with Equal operator
Any idea?
thanks,
Matt
CREATE TABLE test ( partitionkey text, rowkey text, date timestamp,
policyid text, policyname text, primary key (partitionkey, rowkey));
First of all, you really should use more descriptive column names instead of partitionkey and rowkey (and even date, for that matter). By looking at those column names, I really can't tell what kind of data this table is supposed to be indexed by.
select * from test where partitionkey='p1' and date > '2007-01-02';
Bad Request: No indexed columns present in by-columns clause with Equal operator
As for this issue, try making your "date" column a part of your primary key.
primary key (partitionkey, rowkey, date)
Once you do that, I think your date range queries will function appropriately.
For more information on this, check out DataStax Academy's (free) course called Java Development With Apache Cassandra. Session 5, Module 104 discusses how to model time series data and that should help you out.

Resources