Cassandra Skinny vs Wide Row for time series - consumption - cassandra

i want to store every second one value to a table. Therefore i testet two approches against each other. If I have understood correctly, the data should be stored internally almost identical.
Wide-Row
CREATE TABLE timeseries (
id int,
date date,
timestamp timestamp,
value decimal,
PRIMARY KEY ((id, date), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC) AND
compaction={'class': 'DateTieredCompactionStrategy'}
and compression = { 'sstable_compression' : 'DeflateCompressor' };
Skinny Row
CREATE TABLE timeseries(
id int,
date date,
"0" decimal, "1" decimal,"2" decimal, -- ... 86400 decimal values
-- each column index is the second of the day
PRIMARY KEY ((id, date))
)
Test:
10 different id's
1 million values (100.000 per id)
each values is increased by one minute
In my test the skinny row approach for a sinus function only consums half of the storage for 1 million values. Even the random test is significant. Can somebody explain this behaviour?

The only difference between these schema is the cell key
A sample cell of The wide row model :
["2017-06-09 15\\:05+0600:value","3",1496999149885944]
| | | |
timestamp column value timestamp
And A sample cell of the Skinny row model :
["0","3",1497019292686908]
| | |
column value timestamp
You can clearly see that wide row model cell key is timestamp value and column name of value. And for skinny model cell key is only column name.
The overhead of wide row model is the timestamp(8 bytes) and the size of column name (value).you can keep the column name small and instead of using timestamp, use int and put the seconds of the day, like your skinny row column name. This will save more space.

Related

Cassandra asking to allow filter even after mentioning all partition key in query?

I have been trying to model a data in Cassandra, and was trying to filter the data based on date in that, as given by the answer here on SO, Here second answer is not using allow filter.
This is my current schema,
CREATE TABLE Banking.BankData(acctID TEXT,
email TEXT,
transactionDate Date ,
transactionAmount double ,
balance DOUBLE,
currentTime timestamp ,
PRIMARY KEY((acctID, transactionDate), currentTime )
WITH CLUSTERING ORDER BY (currentTime DESC);
Now have inserted a data by
INSERT INTO banking.BankData(acctID, email, transactionDate, transactionAmount, balance, currentTime) values ('11', 'alpitanand20#gmail.com','2013-04-03',10010, 10010, toTimestamp(now()));
Now when I try to query, like
SELECT * FROM banking.BankData WHERE acctID = '11' AND transactionDate > '2012-04-03';
It's saying me to allow filtering, however in the link mentioned above, it was not the case.
The final requirement was to get data by year, month, week and so on, thats why had taken to partition it by day, but date range query is not working.
Any suggestion in remodel or i am doing something wrong ?
Thanks
Cassandra supports only equality predicate on the partition key columns, so you can use only = operation on it.
Range predicates (>, <, >=, <=) are supported only only on the clustering columns, and it should be a last clustering column of condition.
For example, if you have following primary key: (pk, c1, c2, c3), you can have range predicate as following:
where pk = xxxx and c1 > yyyy
where pk = xxxx and c1 = yyyy and c2 > zzzz
where pk = xxxx and c1 = yyyy and c2 = zzzz and c3 > wwww
but you can't have:
where pk = xxxx and c2 > zzzz
where pk = xxxx and c3 > zzzz
because you need to restrict previous clustering columns before using range operation.
If you want to perform a range query on this data, you need to declare corresponding column as clustering column, like this:
PRIMARY KEY(acctID, transactionDate, currentTime )
in this case you can perform your query. But because you have time component, you can simply do:
PRIMARY KEY(acctID, currentTime )
and do the query like this:
SELECT * FROM banking.BankData WHERE acctID = '11'
AND currentTime > '2012-04-03T00:00:00Z';
But you need to take 2 things into consideration:
your primary should be unique - maybe you'll need to add another clustering column, like, transaction ID (for example, as uuid type) - in this case even 2 transactions happen into the same millisecond, they won't overwrite each other;
if you have a lot of transactions per account, then you may need to add an another column into partition key. For example, year, or year/month, so you don't have big partitions.
P.S. In linked answer use of non-equality operation is possible because ts is clustering column.

Storing time specific data in cassandra

I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear
The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;
Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.

Schema for tick data on cassandra

Overview
I would like to determine the correct schema in cassandra for financial tick data.
Data and schema
I have the following sample data in csv:
SYMBOL,DATE,TIME,PRICE,SIZE
A,2011-01-03,9:28:00,41.46,200
A,2011-01-03,9:28:00,41.46,100
A,2011-01-03,9:30:00,41.56,1300
A,2011-01-03,9:30:00,41.56,1300
A,2011-01-03,9:30:00,41.55,100
A,2011-01-03,9:30:19,41.55,200
A,2011-01-03,9:30:23,41.5169,100
A,2011-01-03,9:30:29,41.44,66534
A,2011-01-03,9:30:29,41.45,225
A,2011-01-03,9:30:30,41.44,100
A,2011-01-03,9:30:30,41.43,100
A,2011-01-03,9:30:30,41.49,100
A,2011-01-03,9:30:30,41.45,200
and I store into the following table:
CREATE TABLE tickdata (
symbol text,
date date,
time time,
price float,
size int,
PRIMARY KEY ((symbol,date),time)
);
This a slice of a SELECT of the table:
symbol | date | time | price | size
--------+------------+--------------------+---------+-------
A | 2011-01-03 | 09:28:00.000000000 | 41.46 | 100
A | 2011-01-03 | 09:30:00.000000000 | 41.56 | 1300
A | 2011-01-03 | 09:30:19.000000000 | 41.55 | 200
A | 2011-01-03 | 09:30:23.000000000 | 41.5169 | 100
A | 2011-01-03 | 09:30:29.000000000 | 41.45 | 66534
Use case
The data will be written to Cassandra once, and mostly read with conditions on date and symbol, e.g. a set of symbols for a given time-period.
Questions
The tuple (symbol,date,time) are not a proper PRIMARY KEY, since my granularity is limited to seconds. Hence, the COPY FROM e.g. drops the second row of the csv during the import due to the repetition in the key. How can I preserve the record?
Assuming the PRIMARY KEY is unique, how can I avoid storing repeated values of SYMBOL and DATE? Or is partitioning taking care of that under the hood?
I was thinking to use the following schema:
CREATE TABLE tickdata (
symbol text,
date date,
time blob,
price blob,
size blob,
PRIMARY KEY ((symbol,date))
);
to store raw data. Is this the correct way to address the points above?
The data is NOT ordered according to the definition of the PRIMARY KEY when I SELECT it. Is that related to the non-uniqueness problem entioned above?
Should I stick with my binary file-store which keeps a map of symbols and dates and loads the relevant files on request? This avoids repeating symbol and date for each row and is indifferent to limited granularity (repetition) of the timestamp.
The tuple (symbol,date,time) are not a proper PRIMARY KEY, since my
granularity is limited to seconds. Hence, the COPY FROM e.g. drops the
second row of the csv during the import due to the repetition in the
key. How can I preserve the record?
The primary key in your first table definition is ((symbol,date),time) NOT (symbol,date,time). Both are different in cassandra.
((symbol,date),time) => will store all records for same symbol (A) and date in one node. For same symbol(A) but other date might go on other node.
Row Key will be symbol+date
Physical Data layout (example)
|A_2011-01-03||time1.price & time1.value||time2.price & time2.value|
|A_2011-01-04||time1.price & time1.value||time2.price & time2.value|
|B_2011-01-03||time1.price & time1.value||time2.price & time2.value|
|B_2011-01-04||time1.price & time1.value||time2.price & time2.value|
(symbol,date,time) => All records for same symbol will reside on one node. This might result in wide rows.
Row key will be symbol.
Physical Data layout (example)
|A||date1.time1.price & date1.time1.value||date1.time2.price & date1.time2.value||date2.time1.price & date2.time1.value||date2.time2.price & date2.time2.value|
|B||date1.time1.price & date1.time1.value||date1.time2.price & date1.time2.value||date2.time1.price & date2.time1.value||date2.time2.price & date2.time2.value|
To Avoid dropping of records you can add one more column like uuid or timeuuid
CREATE TABLE tickdata (
symbol text,
date date,
time time,
price float,
size int,
id timeuuid
PRIMARY KEY ((symbol,date),time,id)
);
Assuming the PRIMARY KEY is unique, how can I avoid storing repeated
values of SYMBOL and DATE? Or is partitioning taking care of that
under the hood?
Based on physical storage structure explained above this issue is already taken care of.
The alternate schema you are talking about will have only 1 record for one symbol and a date. You will have to handle the blob at application side... which i think might be overhead.
The data is NOT ordered according to the definition of the PRIMARY KEY
when I SELECT it. Is that related to the non-uniqueness problem
entioned above?
By default data is ordered by clustering key in ascending order (in your case time). Though you can change order by changing CLUSTERING ORDER BY property of table to descending.
Example:
CREATE TABLE tickdata (
symbol text,
date date,
time time,
price float,
size int,
id timeuuid
PRIMARY KEY ((symbol,date),time,id)
) WITH CLUSTERING ORDER BY(time desc,id desc);
Should I stick with my binary file-store which keeps a map of symbols
and dates and loads the relevant files on request? This avoids
repeating symbol and date for each row and is indifferent to limited
granularity (repetition) of the timestamp.
you can decide this on your own :)

Understanding Cassandra's storage overhead

I have been reading this section of the Cassandra docs and found the following a little puzzling:
Determine column overhead:
regular_total_column_size = column_name_size + column_value_size + 15
counter - expiring_total_column_size = column_name_size + column_value_size + 23
Every column in Cassandra incurs 15 bytes of overhead. Since each row in a table can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, you should add an additional 8 bytes (23 bytes total).
The way I interpret the above for a CQL3 defined schema such as:
CREATE TABLE mykeyspace.mytable(
id text,
report_id text,
subset_id text,
report_date timestamp,
start_date timestamp,
end_date timestamp,
subset_descr text,
x int,
y double,
z int,
PRIMARY KEY (id, report_id, subset_id)
);
is that each row will contain the metadata for the column names, e.g., the strings report_date, start_date, end_date, etc. and their type along with the data. However, it's not clear to me what it means that each row in a table can have different column names. This sounds wrong to me given the schema above is totally static, i.e., Cassandra 2.0 will most certainly complain if I try to write:
INSERT INTO mykeyspace.mytable (id, report_id , subset_id, x, y, z, w)
VALUES ( 'asd','qwe','rty',100,1.234,12, 123.123);
Bad Request: Unknown identifier w
Now it looks to me like column names are fixed given this table schema and thus the metadata should not need to be stored per each row. I am guessing either the phrasing in the documentation is outdated (it's the same as Cassandra 1.2) or I'm misunderstanding some core concept at work here.
Can anybody clarify? Bottom line: do I have to worry about the length of the names of my columns or not?
We have been playing it safe and used single character names where possible (so the above columns would actually be i, r, s, dr, ds, de, sd, ...), but it's so non human unreadable and can be confusing to work with.
The easiest way to figure out what is going on in situations like this is to check the sstable2json (cassandra/bin) representation of your data. This will show you what ends up actually be saved on disk.
Here is the example for your situation
[
{"key": "4b6579","columns": [
["rid1:ssid1:","",1401469033325000],
["rid1:ssid1:end_date","2004-10-03 00:00:00-0700",1401469033325000],
["rid1:ssid1:report_date","2004-10-03 00:00:00-0700",1401469033325000],
["rid1:ssid1:start_date","2004-10-03 00:00:00-0700",1401469033325000],
["rid1:ssid1:subset_descr","descr",1401469033325000],
["rid1:ssid1:x","1",1401469033325000],
["rid1:ssid1:y","5.5",1401469033325000],
["rid1:ssid1:z","1",1401469033325000],
["rid2:ssid2:","",1401469938599000],
["rid2:ssid2:end_date", "2004-10-03 00:00:00-0700",1401469938599000],
["rid2:ssid2:report_date","2004-10-03 00:00:00-0700",1401469938599000],
["rid2:ssid2:start_date","2004-10-03 00:00:00-0700",1401469938599000],
["rid2:ssid2:subset_descr","descr",1401469938599000],
["rid2:ssid2:x","1",1401469938599000],
["rid2:ssid2:y","5.5",1401469938599000],
["rid2:ssid2:z","1",1401469938599000]
}
]
The value of the partition key is saved once per partition (per sstable) as you can see above, the column name in this case doesn't matter at all since it is implicit given the table. The column names for the clustering columns are also not present because with C* you aren't allowed to insert without specifying all portions of the key.
Whats left though does have the column name, this is needed incase a partial update to a row is made so it can be saved without the rest of the row information. You could imagine an update to a single column field in a row, to indicate which field this is C* currently uses the column name but there are tickets to change this to a smaller representation.
https://issues.apache.org/jira/browse/CASSANDRA-4175
To generate this
cqlsh
CREATE TABLE mykeyspace.mytable( id text, report_id text, subset_id text, report_date timestamp, start_date timestamp, end_date timestamp, subset_descr text, x int, y double, z int, PRIMARY KEY (id, report_id, subset_id) );
INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid1','ssid1', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1);
INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid2','ssid2', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1);
exit;
nodetool flush
bin/sstable2json $DATA_DIR/mytable/mykeyspace-mytable-jb-1-Data.db

Multiple columns in Cassandra tables

I am wondering what happens when there are multiple Non-PK columns in a table. I've read this example:
http://johnsanda.blogspot.co.uk/2012/10/why-i-am-ready-to-move-to-cql-for.html
Which shows that with single column:
CREATE TABLE raw_metrics (
schedule_id int,
time timestamp,
value double,
PRIMARY KEY (schedule_id, time)
);
We get:
Now I wonder what happens when we have two columns:
CREATE TABLE raw_metrics (
schedule_id int,
time timestamp,
value1 double,
value2 int,
PRIMARY KEY (schedule_id, time)
);
Are we going to end up with something like:
row key columns...
123 1339707619:"value1" | 1339707679:"value2" | 1339707784:"value2"
...
or rather:
row key columns...
123 1339707619:"value1":"value2" | 1339707679:"value1":"value2" | 1339707784:"value1""value2"
...
etc. I guess what I am asking is if this is going to be a sparse table given that I only insert "value1" or "value2" at a time.
In such situations if I want to store more columns (one per each type, eg. double, int, date, etc) would it be better perhaps to have separate tables rather than storing everything in a single table?
This post might help in explaining what is happening when composite keys are created:
Cassandra Composite Columns - How are CompositeTypes chosen?
So essentially the table will look in the following way:
row key columns...
123 1339707619:"value1" | 1339707679:"value2" | 1339707784:"value2"
See also reference to secondary indexes:
http://wiki.apache.org/cassandra/SecondaryIndexes

Resources