Overview
I would like to determine the correct schema in cassandra for financial tick data.
Data and schema
I have the following sample data in csv:
SYMBOL,DATE,TIME,PRICE,SIZE
A,2011-01-03,9:28:00,41.46,200
A,2011-01-03,9:28:00,41.46,100
A,2011-01-03,9:30:00,41.56,1300
A,2011-01-03,9:30:00,41.56,1300
A,2011-01-03,9:30:00,41.55,100
A,2011-01-03,9:30:19,41.55,200
A,2011-01-03,9:30:23,41.5169,100
A,2011-01-03,9:30:29,41.44,66534
A,2011-01-03,9:30:29,41.45,225
A,2011-01-03,9:30:30,41.44,100
A,2011-01-03,9:30:30,41.43,100
A,2011-01-03,9:30:30,41.49,100
A,2011-01-03,9:30:30,41.45,200
and I store into the following table:
CREATE TABLE tickdata (
symbol text,
date date,
time time,
price float,
size int,
PRIMARY KEY ((symbol,date),time)
);
This a slice of a SELECT of the table:
symbol | date | time | price | size
--------+------------+--------------------+---------+-------
A | 2011-01-03 | 09:28:00.000000000 | 41.46 | 100
A | 2011-01-03 | 09:30:00.000000000 | 41.56 | 1300
A | 2011-01-03 | 09:30:19.000000000 | 41.55 | 200
A | 2011-01-03 | 09:30:23.000000000 | 41.5169 | 100
A | 2011-01-03 | 09:30:29.000000000 | 41.45 | 66534
Use case
The data will be written to Cassandra once, and mostly read with conditions on date and symbol, e.g. a set of symbols for a given time-period.
Questions
The tuple (symbol,date,time) are not a proper PRIMARY KEY, since my granularity is limited to seconds. Hence, the COPY FROM e.g. drops the second row of the csv during the import due to the repetition in the key. How can I preserve the record?
Assuming the PRIMARY KEY is unique, how can I avoid storing repeated values of SYMBOL and DATE? Or is partitioning taking care of that under the hood?
I was thinking to use the following schema:
CREATE TABLE tickdata (
symbol text,
date date,
time blob,
price blob,
size blob,
PRIMARY KEY ((symbol,date))
);
to store raw data. Is this the correct way to address the points above?
The data is NOT ordered according to the definition of the PRIMARY KEY when I SELECT it. Is that related to the non-uniqueness problem entioned above?
Should I stick with my binary file-store which keeps a map of symbols and dates and loads the relevant files on request? This avoids repeating symbol and date for each row and is indifferent to limited granularity (repetition) of the timestamp.
The tuple (symbol,date,time) are not a proper PRIMARY KEY, since my
granularity is limited to seconds. Hence, the COPY FROM e.g. drops the
second row of the csv during the import due to the repetition in the
key. How can I preserve the record?
The primary key in your first table definition is ((symbol,date),time) NOT (symbol,date,time). Both are different in cassandra.
((symbol,date),time) => will store all records for same symbol (A) and date in one node. For same symbol(A) but other date might go on other node.
Row Key will be symbol+date
Physical Data layout (example)
|A_2011-01-03||time1.price & time1.value||time2.price & time2.value|
|A_2011-01-04||time1.price & time1.value||time2.price & time2.value|
|B_2011-01-03||time1.price & time1.value||time2.price & time2.value|
|B_2011-01-04||time1.price & time1.value||time2.price & time2.value|
(symbol,date,time) => All records for same symbol will reside on one node. This might result in wide rows.
Row key will be symbol.
Physical Data layout (example)
|A||date1.time1.price & date1.time1.value||date1.time2.price & date1.time2.value||date2.time1.price & date2.time1.value||date2.time2.price & date2.time2.value|
|B||date1.time1.price & date1.time1.value||date1.time2.price & date1.time2.value||date2.time1.price & date2.time1.value||date2.time2.price & date2.time2.value|
To Avoid dropping of records you can add one more column like uuid or timeuuid
CREATE TABLE tickdata (
symbol text,
date date,
time time,
price float,
size int,
id timeuuid
PRIMARY KEY ((symbol,date),time,id)
);
Assuming the PRIMARY KEY is unique, how can I avoid storing repeated
values of SYMBOL and DATE? Or is partitioning taking care of that
under the hood?
Based on physical storage structure explained above this issue is already taken care of.
The alternate schema you are talking about will have only 1 record for one symbol and a date. You will have to handle the blob at application side... which i think might be overhead.
The data is NOT ordered according to the definition of the PRIMARY KEY
when I SELECT it. Is that related to the non-uniqueness problem
entioned above?
By default data is ordered by clustering key in ascending order (in your case time). Though you can change order by changing CLUSTERING ORDER BY property of table to descending.
Example:
CREATE TABLE tickdata (
symbol text,
date date,
time time,
price float,
size int,
id timeuuid
PRIMARY KEY ((symbol,date),time,id)
) WITH CLUSTERING ORDER BY(time desc,id desc);
Should I stick with my binary file-store which keeps a map of symbols
and dates and loads the relevant files on request? This avoids
repeating symbol and date for each row and is indifferent to limited
granularity (repetition) of the timestamp.
you can decide this on your own :)
Related
i want to store every second one value to a table. Therefore i testet two approches against each other. If I have understood correctly, the data should be stored internally almost identical.
Wide-Row
CREATE TABLE timeseries (
id int,
date date,
timestamp timestamp,
value decimal,
PRIMARY KEY ((id, date), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC) AND
compaction={'class': 'DateTieredCompactionStrategy'}
and compression = { 'sstable_compression' : 'DeflateCompressor' };
Skinny Row
CREATE TABLE timeseries(
id int,
date date,
"0" decimal, "1" decimal,"2" decimal, -- ... 86400 decimal values
-- each column index is the second of the day
PRIMARY KEY ((id, date))
)
Test:
10 different id's
1 million values (100.000 per id)
each values is increased by one minute
In my test the skinny row approach for a sinus function only consums half of the storage for 1 million values. Even the random test is significant. Can somebody explain this behaviour?
The only difference between these schema is the cell key
A sample cell of The wide row model :
["2017-06-09 15\\:05+0600:value","3",1496999149885944]
| | | |
timestamp column value timestamp
And A sample cell of the Skinny row model :
["0","3",1497019292686908]
| | |
column value timestamp
You can clearly see that wide row model cell key is timestamp value and column name of value. And for skinny model cell key is only column name.
The overhead of wide row model is the timestamp(8 bytes) and the size of column name (value).you can keep the column name small and instead of using timestamp, use int and put the seconds of the day, like your skinny row column name. This will save more space.
I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)
I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...
Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3
"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".
I'm trying to display the latest values from a list of sensors. The list should also be sortable by the time-stamp.
I tried two different approaches. I included the update time of the sensor in the primary key:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Then I can select the list like this:
select * from sensors where customerid=0 order by changedate desc;
which results in this:
customerid | changedate | sensorid | value
------------+--------------------------+----------+-------
0 | 2015-07-10 12:46:53+0000 | 1 | 2
0 | 2015-07-10 12:46:52+0000 | 1 | 1
0 | 2015-07-10 12:46:52+0000 | 0 | 2
0 | 2015-07-10 12:46:26+0000 | 0 | 1
The problem is, I don't get only the latest results, but all the old values too.
If I remove the changedate from the primary key, the select fails all together.
InvalidRequest: code=2200 [Invalid query] message="Order by is currently only supported on the clustered columns of the PRIMARY KEY, got changedate"
Updating the sensor values is also no option:
update overview set changedate=unixTimestampOf(now()), value = '5' where customerid=0 and sensorid=0;
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY part changedate found in SET part"
This fails because changedate is part of the primary key.
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
Edit:
In the meantime I tried another approach, to only storing the latest value.
I used this schema:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Before inserting the latest value, I would delete all old values
DELETE FROM sensors WHERE customerid=? and sensorid=?;
But this fails because changedate is NOT part of the WHERE clause.
The problem is, I don't get only the latest results, but all the old values too.
Since you are storing in a CLUSTERING ORDER of DESC, it will always be very easy to get the latest records, all you need to do is add 'LIMIT' to your query, i.e.:
select * from sensors where customerid=0 order by changedate desc limit 10;
Would return you at most 10 records with the highest changedate. Even though you are using limit, you are still guaranteed to get the latest records since your data is ordered that way.
If I remove the changedate from the primary key, the select fails all together.
This is because you cannot order on a column that is not the clustering key(s) (the secondary part of the primary key) except maybe with a secondary index, which I would not recommend.
Updating the sensor values is also no option
Your update query is failing because it is not legal to include part of the primary key in 'set'. To make this work all you need to do is update your query to include changedate in the where clause, i.e.:
update overview set value = '5' and sensorid = 0 where customerid=0 and changedate=unixTimestampOf(now())
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
You can do this by creating a separate table named 'latest_sensor_data' with the same table definition with exception to the primary key. The primary key will now be 'customerid, sensorid' so you can only have 1 record per sensor. The process of creating separate tables is called denormalization and is a common use pattern particularly in Cassandra data modeling. When you insert sensor data you would now insert data into both 'sensors' and 'latest_sensor_data'.
CREATE TABLE latest_sensor_data (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid)
);
In cassandra 3.0 'materialized views' will be introduced which will make this unnecessary as you can use materialized views to accomplish this for you.
Now doing the following query:
select * from latest_sensor_data where customerid=0
Will give you the latest value for every sensor for that customer.
I would recommend renaming 'sensors' to 'sensor_data' or 'sensor_history' to make it more clear what the data is. Additionally you should change the primary key to 'customerid, changedate, sensorid' as that would allow you to have multiple sensors at the same date (which seems possible).
Your first approach looks reasonable. If you add "limit 1" to your query, you would only get the latest result, or limit 2 to see the latest 2 results, etc.
If you want to automatically remove old values from the table, you can specify a TTL (Time To Live) for data points when you do the insert. So if you wanted to keep data points for 10 days, you could do this by adding "USING TTL 864000" on your insert statements. Or you could set a default TTL for the entire table.
I am wondering what happens when there are multiple Non-PK columns in a table. I've read this example:
http://johnsanda.blogspot.co.uk/2012/10/why-i-am-ready-to-move-to-cql-for.html
Which shows that with single column:
CREATE TABLE raw_metrics (
schedule_id int,
time timestamp,
value double,
PRIMARY KEY (schedule_id, time)
);
We get:
Now I wonder what happens when we have two columns:
CREATE TABLE raw_metrics (
schedule_id int,
time timestamp,
value1 double,
value2 int,
PRIMARY KEY (schedule_id, time)
);
Are we going to end up with something like:
row key columns...
123 1339707619:"value1" | 1339707679:"value2" | 1339707784:"value2"
...
or rather:
row key columns...
123 1339707619:"value1":"value2" | 1339707679:"value1":"value2" | 1339707784:"value1""value2"
...
etc. I guess what I am asking is if this is going to be a sparse table given that I only insert "value1" or "value2" at a time.
In such situations if I want to store more columns (one per each type, eg. double, int, date, etc) would it be better perhaps to have separate tables rather than storing everything in a single table?
This post might help in explaining what is happening when composite keys are created:
Cassandra Composite Columns - How are CompositeTypes chosen?
So essentially the table will look in the following way:
row key columns...
123 1339707619:"value1" | 1339707679:"value2" | 1339707784:"value2"
See also reference to secondary indexes:
http://wiki.apache.org/cassandra/SecondaryIndexes
I have to create and query a column family with composite key as [timestamp,long]. Also,
while querying I want to fire range query for timestamp (like timestamp between xxx and yyy) Is this possible ?
Currently I am doing something really funny (Which I know its not correct). I create keys with timestamp string for given range and concatenate with long.
like ,
1254345345435-1234
3423432423432-1234
1231231231231-9999
and pass set of keys to hector api. (so if i have date range for 1 month and I want every minute data, i create 30 * 24 * 60 * [number of secondary key - long])
I can solve concatenation issue with composite key. But query part is what I am trying to understand.
As far as I understood, As we are using RandomPartitioner we cannot really query based on range as keys are MD5 checksum. Whats ideal design for this kind of use case ?
my schema and requirements are as follows : (actual csh)
CREATE TABLE report(
ts timestamp,
user_id long,
svc1 long,
svc2 long,
svc3 long,
PRIMARY KEY(ts, user_id));
select from report where ts between (123445345435 and 32423423424) and user_id is in (123,567,987)
You cannot do range queries on the first component of a composite key. Instead, you should write a sentinel value such as a daystamp (the unix epoch at midnight on the current day) as the key, then write a composite column as timestamp:long. This way you can provide the keys that comprise your range, and slice on the timestamp component of the composite column.
Denormalize! You must model your schema in a manner that will enable the types of queries you wish to perform. We create a reverse (aka inverted, inverse) index for such scenarios.
CREATE TABLE report(
KEY uuid PRIMARY KEY,
svc1 bigint,
svc2 bigint,
svc3 bigint
);
CREATE TABLE ReportsByTime(
KEY ascii PRIMARY KEY
) with default_validation=uuid AND comparator=uuid;
CREATE TABLE ReportsByUser(
KEY bigint PRIMARY KEY
)with default_validation=uuid AND comparator=uuid;
See here for a nice explanation. What you are doing now is generating your own ascii key in the times table, to enable yourself to perform the range slice query you want - it doesn't have to be ascii though just something you can use to programmatically generate your own slice keys with.
You can use this approach to facilitate all of your queries, this likely isn't going to suit your application directly but the idea is the same. You can squeeze more out of this by adding meaningful values to the column keys of each table above.
cqlsh:tester> select * from report;
KEY | svc1 | svc2 | svc3
--------------------------------------+------+------+------
1381b530-1dd2-11b2-0000-242d50cf1fb5 | 332 | 333 | 334
13818e20-1dd2-11b2-0000-242d50cf1fb5 | 222 | 223 | 224
13816710-1dd2-11b2-0000-242d50cf1fb5 | 112 | 113 | 114
cqlsh:tester> select * from times;
KEY,1212051037 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5 | 1381b530-1dd2-11b2-0000-242d50cf1fb5,1381b530-1dd2-11b2-0000-242d50cf1fb5
KEY,1212051035 | 13816710-1dd2-11b2-0000-242d50cf1fb5,13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5
KEY,1212051036 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5
cqlsh:tester> select * from users;
KEY | 13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5
-------------+--------------------------------------+--------------------------------------
23123123231 | 13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5
Why don't you use wide rows, where Key is timestamp and Column Name as Long-Value then you can pass multiple key's (timestamp's) to getKeySlice and select multiple column's to withColumnSlice by there name (which is id).
As I don't know what is column name and value, I feel this can help you. Can you provide more details of your column family definition.