Data Modeling for range queries in Cassandra - cassandra

I’m designing a new table in Cassandra
create table student (
studentid text PRIMARY KEY,
department text,
major text,
updatedon timestamp)
I would need to to perform three queries on this table
Query all data (findByAll)
Query all data based on major, planning on adding a secondary index on this column
Query data based on time range I.e updated on column
I can achieve this using a composite primary key, however I also want rows to be uniquely identifiable based on id only. For example:
Row 1 :
1| engineering | electrical | 01-01-2021
If the student were to a different major?
1| engineeering | mechanical | 02-02-2021
I would like to perform an upsert where only the major and updated on columns would change.
My conundrum is I don’t understand what I should have as my primary key if I want to perform range queries on updatedon, where rows a uniquely identified by id only.
I came across a bucketing approach but wasn’t sure if that would add additional complexity to my simple/minimal design.

It looks like you're approaching it backwards by starting with the table design. When modelling your data in Cassandra, it sounds counter-intuitive but you need to start with the application queries first and design tables against those queries.
Let me illustrate by listing all your app queries and designing a table for each of them.
APP QUERY 1 - Query all data (findByAll)
If your intention is to retrieve all the records to display them, this is a bad idea in Cassandra since it will require a full table scan. I'm aware that developers are used to doing this on toy applications with a small amount of data but in Cassandra, data is distributed across nodes so full table scans don't scale.
Think of situations where you have a million or more records with hundreds of nodes in the cluster. It doesn't make sense for an app to wait for the query to finish retrieving all records.
APP QUERY 2 - Query all data based on major, planning on adding a secondary index on this column
Adding an index on major isn't a good idea if performance matters to you. You should design a table specifically optimised for this query. For example:
CREATE TABLE students_by_major (
major text,
studentid text,
department text,
updatedon timestamp,
PRIMARY KEY (major, studentid)
)
In this table, each major partition has 1 or more rows of studentid. For example:
major | studentid | department | updatedon
------------------------+-----------+-------------+---------------------------------
computer science | 321 | science | 2020-01-23 00:00:00.000000+0000
electrical engineering | 321 | engineering | 2020-02-24 00:00:00.000000+0000
electrical engineering | 654 | engineering | 2019-05-06 00:00:00.000000+0000
chemical engineering | 654 | engineering | 2019-07-08 00:00:00.000000+0000
arts | 987 | law | 2020-09-12 00:00:00.000000+0000
civil engineering | 654 | engineering | 2019-02-04 00:00:00.000000+0000
APP QUERY 3 - Query data based on time range I.e updated on column
You'll only be able to do a range query on updatedon if the column is defined in the primary key.
APP QUERY 4 - If the student were to do a different major?
You can have a table where each student has multiple rows of majors:
CREATE TABLE majors_by_student (
studentid text,
major text,
department text,
updatedon timestamp,
PRIMARY KEY (studentid, major)
)
For example, student ID 654 has updated their major 3 times:
cqlsh> SELECT * FROM majors_by_student WHERE studentid = '654';
studentid | updatedon | department | major
-----------+---------------------------------+-------------+------------------------
654 | 2019-07-08 00:00:00.000000+0000 | engineering | chemical engineering
654 | 2019-05-06 00:00:00.000000+0000 | engineering | electrical engineering
654 | 2019-02-04 00:00:00.000000+0000 | engineering | civil engineering
QUERY 5 - You want to perform range queries on updatedon where rows are uniquely identified by studentid only.
CREATE TABLE community.updated_majors_by_student (
studentid text,
updatedon timestamp,
department text,
major text,
PRIMARY KEY (studentid, updatedon)
)
Using student 654 above as an example, you can do a range query for any updates made after April 30 with:
SELECT * FROM updated_majors_by_student WHERE studentid = '654' AND updatedon > '2019-04-30 +0000';
Note that since updatedon is a timestamp, you need to specify the timezone for precision and +0000 is the TZ for UTC.
studentid | updatedon | department | major
-----------+---------------------------------+-------------+------------------------
654 | 2019-07-08 00:00:00.000000+0000 | engineering | chemical engineering
654 | 2019-05-06 00:00:00.000000+0000 | engineering | electrical engineering
To keep the tables above in sync, you need to use CQL BATCH statements as I've described in this article -- https://community.datastax.com/articles/2744/. Cheers!

Related

Cassandra CLUSTERING ORDER does not order data properly

I created a table that has timestamps in it but when I try to Cluster Order By the timestamp variable, it is not ordered properly.
To create the table I wrote:
CREATE TABLE videos_by_tag (
tag text,
video_id uuid,
added_date timestamp,
title text,
PRIMARY KEY ((tag), added_date, video_id))
WITH CLUSTERING ORDER BY (added_date ASC);
And the output I got when doing a SELECT * FROM videos_by_tag is:
tag | added_date | video_id | title
-----------+---------------------------------+--------------------------------------+------------------------------
datastax | 2013-04-16 00:00:00.000000+0000 | 5645f8bd-14bd-11e5-af1a-8638355b8e3a | What is DataStax Enterprise?
datastax | 2013-10-16 00:00:00.000000+0000 | 4845ed97-14bd-11e5-8a40-8338255b7e33 | DataStax Studio
cassandra | 2012-04-03 00:00:00.000000+0000 | 245e8024-14bd-11e5-9743-8238356b7e32 | Cassandra & SSDs
cassandra | 2013-03-17 00:00:00.000000+0000 | 3452f7de-14bd-11e5-855e-8738355b7e3a | Cassandra Intro
cassandra | 2014-01-29 00:00:00.000000+0000 | 1645ea59-14bd-11e5-a993-8138354b7e31 | Cassandra History
(5 rows)
As you can see the dates are out of order. There is a 2012 year value in the middle of the output.
You can fine-tune the display order using the ORDER BY clause. The partition key must be defined in the WHERE clause and the ORDER BY clause defines the clustering column to use for ordering.
Example:
SELECT * FROM videos_by_tag
WHERE tag = 'datastax' ORDER BY added_date ASC;
This is a very common misconception in Cassandra. The data is in fact ordered correctly in the sample data you posted.
The CLUSTERING ORDER applies to the sort order of the rows within a partition -- NOT across ALL partitions.
Using the example you posted, the clustering column added_date is correctly sorted in ascending order for the partition tag = 'datastax':
tag | added_date
-----------+---------------------------------
datastax | 2013-04-16 00:00:00.000000+0000
datastax | 2013-10-16 00:00:00.000000+0000
Similarly, added_date is sorted in ascending order for tag = 'cassandra':
tag | added_date
-----------+---------------------------------
cassandra | 2012-04-03 00:00:00.000000+0000
cassandra | 2013-03-17 00:00:00.000000+0000
cassandra | 2014-01-29 00:00:00.000000+0000
Like I said, the sort order only applies to rows within a partition.
It would be impossible to sort all rows in all partitions because such task does not scale. Imagine if you had billions of partitions in the table across hundreds of nodes. Every time you inserted a new row to any partition, Cassandra has to do a full table scan to sort the data and it just wouldn't make sense to do so. Cheers!

How do I model a CQL table such that it can be queried by zip_code, or by zip_code and hash?

Hi all I have a cassandra Table containing Hash as Primary key and another column containing List. I want to add another column named Zipcode such that I can query cassandra based on either zipcode or zipcode and hash
Hash | List | zipcode
select * from table where zip_code = '12345';
select * from table where zip_code = '12345' && hash='abcd';
Is there any way that I could do this?
Recommendation in Cassandra is that you design your data tables based on your access patterns. For example in your case you would like to get results by zipcode and by zipcode and hash, so ideally you can have two tables like this
CREATE TABLE keyspace.table1 (
zipcode text,
field1 text,
field2 text,
PRIMARY KEY (zipcode));
and
CREATE TABLE keyspace.table2 (
hashcode text
zipcode text,
field1 text,
field2 text,
PRIMARY KEY ((hashcode,zipcode)));
Then you may be required to redesign your tables based on your data. I recommend you understand data model design in cassandra before proceeding further.
ALLOW FILTERING construct can be used but its usage depends on how big/small is your data. If you have a very large data then avoid using this construct as it will require complete scan of the database which is quite expensive in terms of resources and time.
It is possible to design a single table that will satisfy both app queries.
In this example schema, the table is partitioned by zip code with hash as the clustering key:
CREATE TABLE table_by_zipcode (
zipcode int,
hash text,
...
PRIMARY KEY(zipcode, hash)
)
With this design, each zip code can have one or more rows of hash. Here's the table with some test data in it:
zipcode | hash | intcol | textcol
---------+------+--------+---------
123 | abc | 1 | alice
123 | def | 2 | bob
123 | ghi | 3 | charli
456 | tuv | 5 | banana
456 | xyz | 4 | apple
The table contains two partitions zipcode = 123 and zipcode = 456. The first zip code has three rows (abc, def, ghi) and the second has two rows (tuv, xyz).
You can query the table using just the partition key (zipcode), for example:
cqlsh> SELECT * FROM table_by_zipcode WHERE zipcode = 123;
zipcode | hash | intcol | textcol
---------+------+--------+---------
123 | abc | 1 | alice
123 | def | 2 | bob
123 | ghi | 3 | charli
It is also possible to query the table with the partition key zipcode and clustering key hash, for example:
cqlsh> SELECT * FROM table_by_zipcode WHERE zipcode = 123 AND hash = 'abc';
zipcode | hash | intcol | textcol
---------+------+--------+---------
123 | abc | 1 | alice
Cheers!

Cassandra query table based on row range

I am new to cassandra. I am using cassandra-3.0 and datastax java driver for development. I would like to know whether cassandra provide any option to fecth the data based on rowkey range?
something like
select * from <table-name> where rowkey > ? and rowkey < ?;
If not, is there any other option in cassandra ( java/cql) to fetchdata based on row ranges?
Unfortunately, there really isn't a mechanism in Cassandra that works in the way that you are asking. The only way to run a range query on your partition keys (rowkey) is with the token function. This is because Cassandra orders its rows in the cluster by the hashed token value of the partition key. That value would not really have any meaning for you, but it would allow you to "page" through the a large table without encountering timeouts.
SELECT * FROM <table-name>
WHERE token(rowkey) > -9223372036854775807
AND token(rowkey) < -5534023222112865485;
The way to go about range querying on meaningful values, is to find a value to partition your rows by, and then cluster by a numeric or time value. For example, I can query a table of events by date range, if I partition my data by month (PRIMARY KEY(month,eventdate)):
aploetz#cqlsh:stackoverflow> SELECT * FROM events
WHERE monthbucket='201509'
AND eventdate > '2015-09-19' AND eventdate < '2015-09-26';
monthbucket | eventdate | beginend | eventid | eventname
-------------+--------------------------+----------+--------------------------------------+------------------------
201509 | 2015-09-25 06:00:00+0000 | B | a223ad16-2afd-4213-bee3-08a2c4dd63e6 | Hobbit Day
201509 | 2015-09-25 05:59:59+0000 | E | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-22 06:00:00+0000 | B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-20 05:59:59+0000 | E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
201509 | 2015-09-19 06:00:00+0000 | B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
(5 rows)

Cassandra storage internal

I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
category text,
subcategory text,
itemid text,
count int,
price int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, category, subcategory, itemid, count, price) - #2
);
Suppose that I have a table like above.
In case of #1, a CQL row will generate 6(or 5?) columns in storage.
In case of #2, a CQL row will generate a very composite column in storage.
I'm wondering what's more effective way for storing logs into Cassandra.
Please focus on those given two situations.
I don't need any real-time reads. Just writings.
If you want to suggest other options please refer to the following.
The reasons I chose Cassandra for storing logs are
Linear scalability and good for heavy writing.
It has schema in CQL. I really prefer having a schema.
Seems to support Spark well enough. Datastax's cassandra-spark connector seems to have data locality awareness.
I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.
Let's say that I build tables with both of your PRIMARY KEYs, and INSERT some data:
aploetz#cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date1;
userid | time | dateof(time) | category | subcategory | itemid | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799
1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200
1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599
(5 rows)
aploetz#cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date2;
userid | time | dateof(time) | category | subcategory | itemid | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799
1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200
1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599
(5 rows)
Looks pretty much the same via cqlsh. So let's have a look from the cassandra-cli, and query all rows foor userid 1002:
RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:category, value=426f6f6b73, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:itemid, value=3637382d322d34343339382d3331322d39, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:price, value=0000031e, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:subcategory, value=4e6f76656c73, timestamp=1431092900008568)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:category, value=417564696f, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:itemid, value=3232382d352d34343334332d3334342d35, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:price, value=000012bf, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:subcategory, value=4865616470686f6e6573, timestamp=1431092985326774)
Simple enough, right? We see userid 1002 as the RowKey, and our clustering column of time as a column key. Following that, are all of our columns for each column key (time). And I believe your first instance generates 6 columns, as I'm pretty sure that includes the placeholder for the column key, because your PRIMARY KEY could point to an empty value (as your 2nd example key does).
But what about the 2nd version for userid 1002?
RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:Books:Novels:678-2-44398-312-9:1:798:, value=, timestamp=1431093011349994)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:Audio:Headphones:228-5-44343-344-5:1:4799:, value=, timestamp=1431093011360402)
Two columns are returned for RowKey 1002, one for each unique combination of our column (clustering) keys, with an empty value (as mentioned above).
So what does this all mean for you? Well, a few things:
This should tell you that PRIMARY KEYs in Cassandra ensure uniqueness. So if you decide that you need to update key values like category or subcategory (2nd example) that you really can't unless you DELETE and recreate the row. Although from a logging perspective, that's probably ok.
Cassandra stores all data for a particular partition/row key (userid) together, sorted by the column (clustering) keys. If you were concerned about querying and sorting your data, it would be important to understand that you would have to query for each specific userid for sort order to make any difference.
The biggest issue I see, is that right now you are setting yourself up for unbounded column growth. Partition/row keys can support a maximum of 2 billion columns, so your 2nd example will help you out the most there. If you think some of your userids might exceed that, you could implement a "date bucket" as an additional partition key (say, if you knew that a userid would never exceed more than 2 billion in a year, or whatever).
It looks to me like your 2nd option might be the better choice. But honestly for what you're doing, either of them will probably work ok.

Financial data analysis modelling in Apache Cassandra?

I need to model and store financial data in Apache Cassandra.
Data is accessed by date and business unit, so currently my model uses the date and business unit id as a compound row key.
I want to use wide-rows so I can pull the figures for a whole day (and unit) in one query.
For any given day, for a particular business unit, I need to store a series of increasingly granular breakdowns, like so (ignore the figures, they're purely illustrative):
| rowkey | USD | GBP | JPY | etc ....
|-------------|-------|------|------|----------
| 31122014-1 | 112 | 3006 | 234 |
| 31122014-2 | 3378 | -12.4| 998 |
| 31122014-3 | -456 | 2034 | 127 |
And then a more detailed breakdown, using compound columns:
| rowkey | USD-D1 | USD-D2 | GBP-D1 | GBP-D2 | etc ....
|-------------|--------|--------|--------|------------------
| 31122014-1 | 65 | 54 | 175 | 29 |
| 31122014-2 | 2003 | -6.4 | 603 | 349 |
| 31122014-3 | -230 | -198 | -53 | 217 |
And then an even more detailed breakdown:
| rowkey | USD-D1-X1 | USD-D1-X2 | USD-D1-X3 | USD-D2-X1 | etc ....
|-------------|-----------|-----------|-----------|-----------|-------
| 31122014-1 | 23 | 16 | 98 | 29 |
| 31122014-2 | 389 | -3.2 | 237 | 119 |
| 31122014-3 | -105 | -67 | -28 | 178 |
Is this the best way to model these breakdowns using three separate column families (as shown here)?
Or does it make more sense to store only the most granular breakdown and then use some form of column aggregation (if it exists) to extract the less granular data-sets?
I know Cassandra's aggregation capability is limited / non existent, I haven't found anything in the API to suggest how I might aggregate across columns like this.
I know I could do the aggregation in the application tier, but then the question is about the trade offs between retrieving unnecessary data, moving computational overhead and maintaining additional column families. I'm hoping Cassandra provides some way of solving this at the data tier.
Depending on how you want you want the data to be modeled you can
Use your solution. In this you create a column family for more details
If you feel that there are far too column families or that you will always use the next column family, i would suggest making it part of the primary key as a clustering key or directly as part of the partition key
For example:
If according to your data model, if row key access is always going to include a currency you could model it like this
| rowkey |currency|
|---------------|--------|
| 31122014-1,GBP| 112 |
Obviously this will spread your data for a single rowkey much better, but will increase the number of row keys
You could use aggregation as well as custom types that cassandra allows.
Consider the following before you choose one of the stategies
a. Distribution of the rows across nodes
b. Sparse columns vs wide columns
c. Effects on row cache (if you are going to turn it on) and key cache
d. And the most important, your selection queries
I think your solution is likely to be effective. For Cassandra it's generally better to store data multiple in multiple places based on what queries you're expecting to run against it.
If you see each of these use cases as three separate use cases that will be queried at different times, then you've got a solid datamodel.
For what it's worth, this use case plays very well to the strengths of CQL which would model it as follows:
CREATE TABLE finance0 (
day DATE,
unit INT,
currency TEXT,
amount BIGINT,
PRIMARY KEY ((day, unit) currency)
);
CREATE TABLE finance1 (
day DATE,
unit INT,
currency TEXT,
sorter1 TEXT,
amount BIGINT,
PRIMARY KEY ((day, unit) currency, sorter1)
);
CREATE TABLE finance2 (
day DATE,
unit INT,
currency TEXT,
sorter1 TEXT,
sorter2 TEXT,
amount BIGINT,
PRIMARY KEY ((day, unit) currency, sorter1, sorter2)
);

Resources