I need to model and store financial data in Apache Cassandra.
Data is accessed by date and business unit, so currently my model uses the date and business unit id as a compound row key.
I want to use wide-rows so I can pull the figures for a whole day (and unit) in one query.
For any given day, for a particular business unit, I need to store a series of increasingly granular breakdowns, like so (ignore the figures, they're purely illustrative):
| rowkey | USD | GBP | JPY | etc ....
|-------------|-------|------|------|----------
| 31122014-1 | 112 | 3006 | 234 |
| 31122014-2 | 3378 | -12.4| 998 |
| 31122014-3 | -456 | 2034 | 127 |
And then a more detailed breakdown, using compound columns:
| rowkey | USD-D1 | USD-D2 | GBP-D1 | GBP-D2 | etc ....
|-------------|--------|--------|--------|------------------
| 31122014-1 | 65 | 54 | 175 | 29 |
| 31122014-2 | 2003 | -6.4 | 603 | 349 |
| 31122014-3 | -230 | -198 | -53 | 217 |
And then an even more detailed breakdown:
| rowkey | USD-D1-X1 | USD-D1-X2 | USD-D1-X3 | USD-D2-X1 | etc ....
|-------------|-----------|-----------|-----------|-----------|-------
| 31122014-1 | 23 | 16 | 98 | 29 |
| 31122014-2 | 389 | -3.2 | 237 | 119 |
| 31122014-3 | -105 | -67 | -28 | 178 |
Is this the best way to model these breakdowns using three separate column families (as shown here)?
Or does it make more sense to store only the most granular breakdown and then use some form of column aggregation (if it exists) to extract the less granular data-sets?
I know Cassandra's aggregation capability is limited / non existent, I haven't found anything in the API to suggest how I might aggregate across columns like this.
I know I could do the aggregation in the application tier, but then the question is about the trade offs between retrieving unnecessary data, moving computational overhead and maintaining additional column families. I'm hoping Cassandra provides some way of solving this at the data tier.
Depending on how you want you want the data to be modeled you can
Use your solution. In this you create a column family for more details
If you feel that there are far too column families or that you will always use the next column family, i would suggest making it part of the primary key as a clustering key or directly as part of the partition key
For example:
If according to your data model, if row key access is always going to include a currency you could model it like this
| rowkey |currency|
|---------------|--------|
| 31122014-1,GBP| 112 |
Obviously this will spread your data for a single rowkey much better, but will increase the number of row keys
You could use aggregation as well as custom types that cassandra allows.
Consider the following before you choose one of the stategies
a. Distribution of the rows across nodes
b. Sparse columns vs wide columns
c. Effects on row cache (if you are going to turn it on) and key cache
d. And the most important, your selection queries
I think your solution is likely to be effective. For Cassandra it's generally better to store data multiple in multiple places based on what queries you're expecting to run against it.
If you see each of these use cases as three separate use cases that will be queried at different times, then you've got a solid datamodel.
For what it's worth, this use case plays very well to the strengths of CQL which would model it as follows:
CREATE TABLE finance0 (
day DATE,
unit INT,
currency TEXT,
amount BIGINT,
PRIMARY KEY ((day, unit) currency)
);
CREATE TABLE finance1 (
day DATE,
unit INT,
currency TEXT,
sorter1 TEXT,
amount BIGINT,
PRIMARY KEY ((day, unit) currency, sorter1)
);
CREATE TABLE finance2 (
day DATE,
unit INT,
currency TEXT,
sorter1 TEXT,
sorter2 TEXT,
amount BIGINT,
PRIMARY KEY ((day, unit) currency, sorter1, sorter2)
);
Related
I’m designing a new table in Cassandra
create table student (
studentid text PRIMARY KEY,
department text,
major text,
updatedon timestamp)
I would need to to perform three queries on this table
Query all data (findByAll)
Query all data based on major, planning on adding a secondary index on this column
Query data based on time range I.e updated on column
I can achieve this using a composite primary key, however I also want rows to be uniquely identifiable based on id only. For example:
Row 1 :
1| engineering | electrical | 01-01-2021
If the student were to a different major?
1| engineeering | mechanical | 02-02-2021
I would like to perform an upsert where only the major and updated on columns would change.
My conundrum is I don’t understand what I should have as my primary key if I want to perform range queries on updatedon, where rows a uniquely identified by id only.
I came across a bucketing approach but wasn’t sure if that would add additional complexity to my simple/minimal design.
It looks like you're approaching it backwards by starting with the table design. When modelling your data in Cassandra, it sounds counter-intuitive but you need to start with the application queries first and design tables against those queries.
Let me illustrate by listing all your app queries and designing a table for each of them.
APP QUERY 1 - Query all data (findByAll)
If your intention is to retrieve all the records to display them, this is a bad idea in Cassandra since it will require a full table scan. I'm aware that developers are used to doing this on toy applications with a small amount of data but in Cassandra, data is distributed across nodes so full table scans don't scale.
Think of situations where you have a million or more records with hundreds of nodes in the cluster. It doesn't make sense for an app to wait for the query to finish retrieving all records.
APP QUERY 2 - Query all data based on major, planning on adding a secondary index on this column
Adding an index on major isn't a good idea if performance matters to you. You should design a table specifically optimised for this query. For example:
CREATE TABLE students_by_major (
major text,
studentid text,
department text,
updatedon timestamp,
PRIMARY KEY (major, studentid)
)
In this table, each major partition has 1 or more rows of studentid. For example:
major | studentid | department | updatedon
------------------------+-----------+-------------+---------------------------------
computer science | 321 | science | 2020-01-23 00:00:00.000000+0000
electrical engineering | 321 | engineering | 2020-02-24 00:00:00.000000+0000
electrical engineering | 654 | engineering | 2019-05-06 00:00:00.000000+0000
chemical engineering | 654 | engineering | 2019-07-08 00:00:00.000000+0000
arts | 987 | law | 2020-09-12 00:00:00.000000+0000
civil engineering | 654 | engineering | 2019-02-04 00:00:00.000000+0000
APP QUERY 3 - Query data based on time range I.e updated on column
You'll only be able to do a range query on updatedon if the column is defined in the primary key.
APP QUERY 4 - If the student were to do a different major?
You can have a table where each student has multiple rows of majors:
CREATE TABLE majors_by_student (
studentid text,
major text,
department text,
updatedon timestamp,
PRIMARY KEY (studentid, major)
)
For example, student ID 654 has updated their major 3 times:
cqlsh> SELECT * FROM majors_by_student WHERE studentid = '654';
studentid | updatedon | department | major
-----------+---------------------------------+-------------+------------------------
654 | 2019-07-08 00:00:00.000000+0000 | engineering | chemical engineering
654 | 2019-05-06 00:00:00.000000+0000 | engineering | electrical engineering
654 | 2019-02-04 00:00:00.000000+0000 | engineering | civil engineering
QUERY 5 - You want to perform range queries on updatedon where rows are uniquely identified by studentid only.
CREATE TABLE community.updated_majors_by_student (
studentid text,
updatedon timestamp,
department text,
major text,
PRIMARY KEY (studentid, updatedon)
)
Using student 654 above as an example, you can do a range query for any updates made after April 30 with:
SELECT * FROM updated_majors_by_student WHERE studentid = '654' AND updatedon > '2019-04-30 +0000';
Note that since updatedon is a timestamp, you need to specify the timezone for precision and +0000 is the TZ for UTC.
studentid | updatedon | department | major
-----------+---------------------------------+-------------+------------------------
654 | 2019-07-08 00:00:00.000000+0000 | engineering | chemical engineering
654 | 2019-05-06 00:00:00.000000+0000 | engineering | electrical engineering
To keep the tables above in sync, you need to use CQL BATCH statements as I've described in this article -- https://community.datastax.com/articles/2744/. Cheers!
I have a table with a structure like this:
CREATE TABLE kaefko.se_vi_f55dfeebae00d2b3 (
value text PRIMARY KEY,
id text,
popularity bigint);
With data that looks like this:
value | id | popularity
--------+------------------+------------
rally | 4eff16cb91f96cd6 | 2
reddit | 11aa39686ed66ba5 | 3
red | 552d7e95af481415 | 1
really | 756bfa499965863c | 1
right | c5850c6b08f7966b | 1
redis | 7f1d251f399442d7 | 1
And I've created a materialized view that should sort these values by the popularity from the biggest to the smallest ones:
CREATE MATERIALIZED VIEW kaefko.se_vi_f55dfeebae00d2b3_by_popularity AS
SELECT *
FROM kaefko.se_vi_f55dfeebae00d2b3
WHERE popularity IS NOT null
PRIMARY KEY (value, popularity)
WITH CLUSTERING ORDER BY (popularity DESC);
But the data in the materialized view looks like this:
value | popularity | id
--------+------------+------------------
rally | 2 | 4eff16cb91f96cd6
reddit | 3 | 11aa39686ed66ba5
really | 1 | 756bfa499965863c
right | 1 | c5850c6b08f7966b
redis | 1 | 7f1d251f399442d7
As you can see there are two main issues:
Data is not sorted as defined in the materialized view
There is just a part of all data in the materialized view
I'm not very experienced in Cassandra and I've already spent hours trying to find the reason why this happens with no avail. Could somebody please help me? Thank you <3
__
I'm using ScyllaDB 4.1.9-0 and cqlsh shows this:
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]
Alex's comment is 100% correct, the order is within the partition.
PRIMARY KEY (value, popularity)
WITH CLUSTERING ORDER BY (popularity DESC);
This means that the ordering of popularity is descending only for values where the 'value' field is the same - if I was to alter the data you used to show what this would look like as an example, you would get the following:
value | popularity | id
--------+------------+------------------
rally | 3 | 4eff16cb91f96cd6
rally | 2 | 11aa39686ed66ba5
really | 3 | 756bfa499965863c
really | 2 | c5850c6b08f7966b
really | 1 | 7f1d251f399442d7
The order is on a per partition key basis, not globally ordered.
I have very huge amount of data, which I plan to store in Cassandra. I am new to Cassandra and am trying to find a data model that will work for me.
My data is various parameters for commodities gathered over irregular time intervals:
commodity_id | timestamp | param1 | param2
c1 | '2018-01-01' | 5 | 15
c1 | '2018-01-03' | 7 | 15
c1 | '2018-01-08' | 8 | 10
c2 | '2018-01-01' | 100 | 13
c2 | '2018-01-02' | 140 | 13
c2 | '2018-01-05' | 130 | 13
c2 | '2018-01-06' | 150 | 13
I need to query the database, and get commodity IDs by "percentage change" in the params.
Ex. Find out all commodities whose param2 increased by more than 50% between '2018-01-02' and '2018-01-06'
CREATE TABLE "commodity" (
commodity_id text,
timestamp date,
param1 int,
param2 int,
PRIMARY KEY (commodity_id, timestamp)
)
You should be fine with this table. You can expect daysPerYear entries for a commodity partition, which is reasonably small so you dont need any artificial keys. Even if you have a large number of commodities, you wont run out of partitions, as the murmur3 partitioner actually has a range of -2^63 to +2^63-1. That are 18,446,744,073,709,551,616 possible values.
I would pull the data from cassandra and calculate the values in the app layer.
I create a table in Cassandra for monitoring insert from an application.
My partition key is an int composed by year+month+day, my clustering key a timestamp and after that my username and some others fields.
I would like to display the last 5 inserts but it's seems that the partition key go before the "order by desc".
How can I get the correct result ? Normaly clustering key induces the order so why I get this result? (Thank in advance)
Informations :
Query : select tsp_insert, txt_name from ks_myKeyspace.myTable limit 5;
Result :
idt_day | tsp_insert | txt_name
----------+--------------------------+----------
20161028 | 2016-10-28 15:21:09+0000 | Jean
20161028 | 2016-10-28 15:21:01+0000 | Michel
20161028 | 2016-10-28 15:20:44+0000 | Quentin
20161031 | 2016-10-31 09:24:32+0000 | Jacquie
20161031 | 2016-10-31 09:23:32+0000 | Gabriel
Wanted :
idt_day | tsp_insert | txt_name
----------+--------------------------+----------
20161031 | 2016-10-31 09:24:32+0000 | Jacquie
20161031 | 2016-10-31 09:23:32+0000 | Gabriel
20161028 | 2016-10-28 15:21:09+0000 | Jean
20161028 | 2016-10-28 15:21:01+0000 | Michel
20161028 | 2016-10-28 15:20:44+0000 | Quentin
My table :
CREATE TABLE ks_myKeyspace.myTable(
idt_day int,
tsp_insert timestamp,
txt_name text, ...
PRIMARY KEY (idt_day, tsp_insert)) WITH CLUSTERING ORDER BY (tsp_insert DESC);
Ultimately, you are seeing the current order because you are not using a WHERE clause. You can see what's going on if you use the token function on your partition key:
aploetz#cqlsh:stackoverflow> SELECT idt_day,tsp_insert,token(idt_day),txt_name FROM mytable ;
idt_day | tsp_insert | system.token(idt_day) | txt_name
----------+---------------------------------+-----------------------+----------
20161028 | 2016-10-28 15:21:09.000000+0000 | 810871225231161248 | Jean
20161028 | 2016-10-28 15:21:01.000000+0000 | 810871225231161248 | Michel
20161028 | 2016-10-28 15:20:44.000000+0000 | 810871225231161248 | Quentin
20161031 | 2016-10-31 09:24:32.000000+0000 | 5928478420752051351 | Jacquie
20161031 | 2016-10-31 09:23:32.000000+0000 | 5928478420752051351 | Gabriel
(5 rows)
Results in Cassandra CQL will always come back in order of the hashed token value of the partition key (which you can see by using token). Within the partition keys, your CLUSTERING ORDER will be enforced.
That's key to understand... Result set ordering in Cassandra can only be enforced within a partition key. You have no control over the order that the partition keys come back in.
In short, use a WHERE clause on your idt_day and you'll see the order you expect.
It seems to me that you are getting the whole thing wrong. Partition keys are not used for ordering data, they are used only to know the location of your data in the cluster, specifically the node. Moreover, the order really matters inside a partition only...
Your query results really are unpredictable. Depending on which node is faster to answer (assuming a cluster and not a single node), you can get every time a different result. You should try to avoid selecting without partition restrictions, they don't scale.
You can however change your queries and perform one select per day, then you'd query for ordered data (your clustering key) in an ordered manner ( you manually chose the order of the days in your queries). And as a side note it would be faster because you could query multiple partitions in parallel.
I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
category text,
subcategory text,
itemid text,
count int,
price int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, category, subcategory, itemid, count, price) - #2
);
Suppose that I have a table like above.
In case of #1, a CQL row will generate 6(or 5?) columns in storage.
In case of #2, a CQL row will generate a very composite column in storage.
I'm wondering what's more effective way for storing logs into Cassandra.
Please focus on those given two situations.
I don't need any real-time reads. Just writings.
If you want to suggest other options please refer to the following.
The reasons I chose Cassandra for storing logs are
Linear scalability and good for heavy writing.
It has schema in CQL. I really prefer having a schema.
Seems to support Spark well enough. Datastax's cassandra-spark connector seems to have data locality awareness.
I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.
Let's say that I build tables with both of your PRIMARY KEYs, and INSERT some data:
aploetz#cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date1;
userid | time | dateof(time) | category | subcategory | itemid | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799
1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200
1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599
(5 rows)
aploetz#cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date2;
userid | time | dateof(time) | category | subcategory | itemid | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799
1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200
1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599
(5 rows)
Looks pretty much the same via cqlsh. So let's have a look from the cassandra-cli, and query all rows foor userid 1002:
RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:category, value=426f6f6b73, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:itemid, value=3637382d322d34343339382d3331322d39, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:price, value=0000031e, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:subcategory, value=4e6f76656c73, timestamp=1431092900008568)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:category, value=417564696f, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:itemid, value=3232382d352d34343334332d3334342d35, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:price, value=000012bf, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:subcategory, value=4865616470686f6e6573, timestamp=1431092985326774)
Simple enough, right? We see userid 1002 as the RowKey, and our clustering column of time as a column key. Following that, are all of our columns for each column key (time). And I believe your first instance generates 6 columns, as I'm pretty sure that includes the placeholder for the column key, because your PRIMARY KEY could point to an empty value (as your 2nd example key does).
But what about the 2nd version for userid 1002?
RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:Books:Novels:678-2-44398-312-9:1:798:, value=, timestamp=1431093011349994)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:Audio:Headphones:228-5-44343-344-5:1:4799:, value=, timestamp=1431093011360402)
Two columns are returned for RowKey 1002, one for each unique combination of our column (clustering) keys, with an empty value (as mentioned above).
So what does this all mean for you? Well, a few things:
This should tell you that PRIMARY KEYs in Cassandra ensure uniqueness. So if you decide that you need to update key values like category or subcategory (2nd example) that you really can't unless you DELETE and recreate the row. Although from a logging perspective, that's probably ok.
Cassandra stores all data for a particular partition/row key (userid) together, sorted by the column (clustering) keys. If you were concerned about querying and sorting your data, it would be important to understand that you would have to query for each specific userid for sort order to make any difference.
The biggest issue I see, is that right now you are setting yourself up for unbounded column growth. Partition/row keys can support a maximum of 2 billion columns, so your 2nd example will help you out the most there. If you think some of your userids might exceed that, you could implement a "date bucket" as an additional partition key (say, if you knew that a userid would never exceed more than 2 billion in a year, or whatever).
It looks to me like your 2nd option might be the better choice. But honestly for what you're doing, either of them will probably work ok.