I'm inserting into a Cassandra table with timestamp columns. The data I have comes with microsecond precision, so the time data string looks like this:
2015-02-16T18:00:03.234+00:00
However, in cqlsh when I run a select query the microsecond data is not shown, I can only see time down to second precision. The 234 microseconds data is not shown.
I guess I have two questions:
1) Does Cassandra capture microseconds with timestamp data type? My guess is yes?
2) How can I see that with cqlsh to verify?
Table definition:
create table data (
datetime timestamp,
id text,
type text,
data text,
primary key (id, type, datetime)
)
with compaction = {'class' : 'DateTieredCompactionStrategy'};
Insert query ran with Java PreparedStatment:
insert into data (datetime, id, type, data) values(?, ?, ?, ?);
Select query was simply:
select * from data;
In an effort to answer your questions, I did a little digging on this one.
Does Cassandra capture microseconds with timestamp data type?
Microseconds no, milliseconds yes. If I create your table, insert a row, and try to query it by the truncated time, it doesn't work:
aploetz#cqlsh:stackoverflow> INSERT INTO data (datetime, id, type, data)
VALUES ('2015-02-16T18:00:03.234+00:00','B26354','Blade Runner','Deckard- Filed and monitored.');
aploetz#cqlsh:stackoverflow> SELECT * FROM data
WHERE id='B26354' AND type='Blade Runner' AND datetime='2015-02-16 12:00:03-0600';
id | type | datetime | data
----+------+----------+------
(0 rows)
But when I query for the same id and type values while specifying milliseconds:
aploetz#cqlsh:stackoverflow> SELECT * FROM data
WHERE id='B26354' AND type='Blade Runner' AND datetime='2015-02-16 12:00:03.234-0600';
id | type | datetime | data
--------+--------------+--------------------------+-------------------------------
B26354 | Blade Runner | 2015-02-16 12:00:03-0600 | Deckard- Filed and monitored.
(1 rows)
So the milliseconds are definitely there. There was a JIRA ticket created for this issue (CASSANDRA-5870), but it was resolved as "Won't Fix."
How can I see that with cqlsh to verify?
One possible way to actually verify that the milliseconds are indeed there, is to nest the timestampAsBlob() function inside of blobAsBigint(), like this:
aploetz#cqlsh:stackoverflow> SELECT id, type, blobAsBigint(timestampAsBlob(datetime)),
data FROM data;
id | type | blobAsBigint(timestampAsBlob(datetime)) | data
--------+--------------+-----------------------------------------+-------------------------------
B26354 | Blade Runner | 1424109603234 | Deckard- Filed and monitored.
(1 rows)
While not optimal, here you can clearly see the millisecond value of "234" on the very end. This becomes even more apparent if I add a row for the same timestamp, but without milliseconds:
aploetz#cqlsh:stackoverflow> INSERT INTO data (id, type, datetime, data)
VALUES ('B25881','Blade Runner','2015-02-16T18:00:03+00:00','Holden- Fine as long as nobody unplugs him.');
aploetz#cqlsh:stackoverflow> SELECT id, type, blobAsBigint(timestampAsBlob(datetime)),
... data FROM data;
id | type | blobAsBigint(timestampAsBlob(datetime)) | data
--------+--------------+-----------------------------------------+---------------------------------------------
B25881 | Blade Runner | 1424109603000 | Holden- Fine as long as nobody unplugs him.
B26354 | Blade Runner | 1424109603234 | Deckard- Filed and monitored.
(2 rows)
You can configure the output format of datetime objects in the .cassandra/cqlshrc file, using python's 'strftime' syntax.
Unfortunately, the %f directive for microseconds (there does not seem to be a directive for milliseconds) does not work for older python versions, which means you have to fall back to the blobAsBigint(timestampAsBlob(date)) solution.
I think by "microseconds" (e.g 03.234567) you mean "milliseconds" (e.g. (03.234).
The issue here was a cqlsh bug that failed to support fractional seconds when dealing with timestamps.
So, while your millisecond value was preserved in the actual persistence layer (cassandra), the shell (cqlsh) failed to display them.
This was true even if you were to change time_format in .cqlshrc to display fractional seconds with an %f directive (e.g. %Y-%m-%d %H:%M:%S.%f%z). In this configuration cqlsh would render 3.000000 for our 3.234 value, since the issue was in how cqlsh loaded the datetime objects without loading the partial seconds.
That all being said, this issue was fixed in CASSANDRA-10428, and released in Cassandra 3.4.
It is impossible to show microseconds (1 millionth of a second) using the Cassandra datatype 'timestamp' because the greatest precision available for that datatype is milliseconds (1 thousandth of a second).
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/timestamp_type_r.html
Values for the timestamp type are encoded as 64-bit signed integers
representing a number of milliseconds since the standard base time
known as the epoch
Some related code:
cqlsh> CREATE KEYSPACE udf
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
cqlsh> USE udf;
cqlsh:udf> CREATE OR REPLACE FUNCTION udf.timeuuid_as_us ( t timeuuid )
RETURNS NULL ON NULL INPUT
RETURNS bigint LANGUAGE JAVA AS '
long msb = t.getMostSignificantBits();
return
( ((msb >> 32) & 0x00000000FFFFFFFFL)
| ((msb & 0x00000000FFFF0000L) << 16)
| ((msb & 0x0000000000000FFFL) << 48)
) / 10
- 12219292800000000L;
';
cqlsh:udf> SELECT
toUnixTimestamp(now()) AS now_ms
, udf.timeuuid_as_us(now()) AS now_us
FROM system.local;
now_ms | now_us
---------------+------------------
1525995892841 | 1525995892841000
Related
I'm new to Cassandra and I'm trying to define a data model that fits my requirements.
I have a sensor that collects one value every millisecond and I have to store those data in Cassandra. The queries that I want to perform are:
1) Give me all the sensor values from - to these timestamp values
2) Tell me when this range of values was recorded
I'm not sure if there exist a common schema that can satisfy both queries because I want to perform range queries on both values. For the first query I should use something like:
CREATE TABLE foo (
value text,
timestamp timestamp,
PRIMARY KEY (value, timestamp));
but then for the second query I need the opposite since I can't do range queries on the partition key without using a token that restricts the timestamp:
CREATE TABLE foo (
value text,
timestamp timestamp,
PRIMARY KEY (timestamp, value));
So do I need two tables for this? Or there exist another way?
Thanks
PS: I need to be as fast as possible while reading
I have a sensor that collects one value every millisecond and I have to store those data in Cassandra.
The main problem I see here, is that you're going to run into Cassandra's limit of 2 billion col values per partition fairly quickly. DataStax's Patrick McFadin has a good example for weather station data (Getting Started with Time Series Data Modeling) that seems to fit here. If I apply it to your model, it looks something like this:
CREATE TABLE fooByTime (
sensor_id text,
day text,
timestamp timestamp,
value text,
PRIMARY KEY ((sensor_id,day),timestamp)
);
This will partition on both sensor_id and day, while sorting rows within the partition by timestamp. So you could query like:
> SELECT * FROM fooByTime WHERE sensor_id='5' AND day='20151002'
AND timestamp > '2015-10-02 00:00:00' AND timestamp < '2015-10-02 19:00:00';
sensor_id | day | timestamp | value
-----------+----------+--------------------------+-------
5 | 20151002 | 2015-10-02 13:39:22-0500 | 24
5 | 20151002 | 2015-10-02 13:49:22-0500 | 23
And yes, the way to model in Cassandra, is to have one table for each query pattern. So your second table where you want to range query on value might look something like this:
CREATE TABLE fooByValues (
sensor_id text,
day text,
timestamp timestamp,
value text,
PRIMARY KEY ((sensor_id,day),value)
);
And that would support queries like:
> SELECT * FROm foobyvalues WHERE sensor_id='5'
AND day='20151002' AND value > '20' AND value < '25';
sensor_id | day | value | timestamp
-----------+----------+-------+--------------------------
5 | 20151002 | 22 | 2015-10-02 14:49:22-0500
5 | 20151002 | 23 | 2015-10-02 13:49:22-0500
5 | 20151002 | 24 | 2015-10-02 13:39:22-0500
I have the following 'Tasks' table in Cassandra.
Task_ID UUID - Partition Key
Starts_On TIMESTAMP - Clustering Column
Ends_On TIMESTAMP - Clustering Column
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.
In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.
Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
And let's say we have this data:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
Anyway not a perfect method obviously, but it might be something you can work with. :)
A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:
CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));
With this structure, I can query by userid and eventtime:
SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');
userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch
(3 rows)
That query will give me all event rows for a particular user for today.
NOTES:
If you need to query by whether or not an event is starting or ending (I did not) you will want to order eventType ahead of eventTime in the primary key.
You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
In your case, you will want to find a good key to partition on, as Task_ID will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).
There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:
I cluster the table on the Starts_On timestamp in descending order. (Ends_On is just a regular column.) Then I constrain the query with Starts_On<? where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.
I then iterate through the results until the row Ends_On is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later Ends_On.)
Throwing away the rest of the result rows might seem wasteful, but here's the crucial bit: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.
Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.
I have a table like this:
CREATE TABLE mytable (
user_id int,
device_id ascii,
record_time timestamp,
timestamp timeuuid,
info_1 text,
info_2 int,
PRIMARY KEY (user_id, device_id, record_time, timestamp)
);
When I ask Cassandra to delete a record (an entry in the columnfamily) like this:
DELETE from my_table where user_id = X and device_id = Y and record_time = Z and timestamp = XX;
it returns without an error, but when I query again the record is still there. Now if I try to delete a whole row like this:
DELETE from my_table where user_id = X
It works and removes the whole row, and querying again immediately doesn't return any more data from that row.
What I am doing wrong? How you can remove a record in Cassandra?
Thanks
Ok, here is my theory as to what is going on. You have to be careful with timestamps, because they will store data down to the millisecond. But, they will only display data to the second. Take this sample table for example:
aploetz#cqlsh:stackoverflow> SELECT id, datetime FROM data;
id | datetime
--------+--------------------------
B25881 | 2015-02-16 12:00:03-0600
B26354 | 2015-02-16 12:00:03-0600
(2 rows)
The datetimes (of type timestamp) are equal, right? Nope:
aploetz#cqlsh:stackoverflow> SELECT id, blobAsBigint(timestampAsBlob(datetime)),
datetime FROM data;
id | blobAsBigint(timestampAsBlob(datetime)) | datetime
--------+-----------------------------------------+--------------------------
B25881 | 1424109603000 | 2015-02-16 12:00:03-0600
B26354 | 1424109603234 | 2015-02-16 12:00:03-0600
(2 rows)
As you are finding out, this becomes problematic when you use timestamps as part of your PRIMARY KEY. It is possible that your timestamp is storing more precision than it is showing you. And thus, you will need to provide that hidden precision if you will be successful in deleting that single row.
Anyway, you have a couple of options here. One, find a way to ensure that you are not entering more precision than necessary into your record_time. Or, you could define record_time as a timeuuid.
Again, it's a theory. I could be totally wrong, but I have seen people do this a few times. Usually it happens when they insert timestamp data using dateof(now()) like this:
INSERT INTO table (key, time, data) VALUES (1,dateof(now()),'blah blah');
CREATE TABLE worker_login_table (
worker_id text,
logged_in_time timestamp,
PRIMARY KEY (worker_id, logged_in_time)
);
INSERT INTO worker_login_table (worker_id, logged_in_time)
VALUES ("worker_1",toTimestamp(now()));
after 1 hour executed the above insert statement once again
select * from worker_login_table;
worker_id| logged_in_time
----------+--------------------------
worker_1 | 2019-10-23 12:00:03+0000
worker_1 | 2015-10-23 13:00:03+0000
(2 rows)
Query the table to get absolute timestamp
select worker_id, blobAsBigint(timestampAsBlob(logged_in_time )), logged_in_time from worker_login_table;
worker_id | blobAsBigint(timestampAsBlob(logged_in_time)) | logged_in_time
--------+-----------------------------------------+--------------------------
worker_1 | 1524109603000 | 2019-10-23 12:00:03+0000
worker_1 | 1524209403234 | 2019-10-23 13:00:03+0000
(2 rows)
The below command will not delete the entry from Cassandra as the precise value of timestamp is required to delete the entry
DELETE from worker_login_table where worker_id='worker_1' and logged_in_time ='2019-10-23 12:00:03+0000';
By using the timestamp from blob we can delete the entry from Cassandra
DELETE from worker_login_table where worker_id='worker_1' and logged_in_time ='1524209403234';
Currently I am developing a solution in the field of time-series data. Within these data we have: an ID, a value and a timestamp.
So here it comes: the value might be of type boolean, float or string. I consider three approaches:
a) For every data type a distinct table, all sensor values of type boolean into a table, all sensor values of type string into another. The obvious disadvantage is that you have to know where to look for a certain sensor.
b) A meta-column describing the data type plus all values of type string. The obvious disadvantage is the data conversion e.g. for calculating the MAX, AVG and so on.
c) Having three columns of different type but only one will be with a value per record. The disadvantage is 500000 sensors firing every 100ms ... plenty of unused space.
As my knowledge is limited any help is appreciated.
500000 sensors firing every 100ms
First thing, is to make sure that you partition properly, to make sure that you don't exceed the limit of 2 billion columns per partition.
CREATE TABLE sensorData (
stationID uuid,
datebucket text,
recorded timeuuid,
intValue bigint,
strValue text,
blnValue boolean,
PRIMARY KEY ((stationID,datebucket),recorded));
With a half-million every 100ms, that's 500 million in a second. So you'll want to set your datebucket to be very granular...down to the second. Next I'll insert some data:
stationid | datebucket | recorded | blnvalue | intvalue | strvalue
--------------------------------------+---------------------+--------------------------------------+----------+----------+----------
8b466f1d-8d6b-46fa-9f5b-8c4eb51aa40c | 2015-04-22T14:54:29 | 6338df40-e929-11e4-88c8-21b264d4c94d | null | 59 | null
8b466f1d-8d6b-46fa-9f5b-8c4eb51aa40c | 2015-04-22T14:54:29 | 633e0f60-e929-11e4-88c8-21b264d4c94d | null | null | CD
8b466f1d-8d6b-46fa-9f5b-8c4eb51aa40c | 2015-04-22T14:54:29 | 6342f160-e929-11e4-88c8-21b264d4c94d | True | null | null
3221b1d7-13b4-40d4-b41c-8d885c63494f | 2015-04-22T14:56:19 | a48bbdf0-e929-11e4-88c8-21b264d4c94d | False | null | null
...plenty of unused space.
You might be suprised. With the CQL output of SELECT * above, it appears that there are null values all over the place. But watch what happens when we use the cassandra-cli tool to view how the data is stored "under the hood:"
RowKey: 3221b1d7-13b4-40d4-b41c-8d885c63494f:2015-04-22T14\:56\:19
=> (name=a48bbdf0-e929-11e4-88c8-21b264d4c94d:, value=, timestamp=1429733297352000)
=> (name=a48bbdf0-e929-11e4-88c8-21b264d4c94d:blnvalue, value=00, timestamp=1429733297352000)
As you can see, the data (above) stored for the CQL row where stationid=3221b1d7-13b4-40d4-b41c-8d885c63494f AND datebucket='2015-04-22T14:56:19' shows that blnValue has a value of 00 (false). But also notice that intValue and strValue are not present. Cassandra doesn't force a null value like an RDBMS does.
The obvious disadvantage is the data conversion e.g. for calculating the MAX, AVG and so on.
Perhaps you already know this, but I did want to mention that Cassandra CQL does not contain definitions for MAX, AVG or any other data aggregation function. You'll either need to do that client-side, or implement Apache-Spark to perform OLAP-type queries.
Be sure to read through Patrick McFadin's Getting Started With Time Series Data Modeling. It contains good suggestions on how to solve time series problems like this.
I am trying to INSERT (also UPDATE and DELETE) data in Cassandra using timestamp, but no change occur to the table. Any help please?
BEGIN BATCH
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('1',null,null,null) USING TIMESTAMP 0;
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('2',null,null,null) USING TIMESTAMP 1;
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('3',null,null,null) USING TIMESTAMP 2;
APPLY BATCH;
I think you're falling into Cassandra's "control of timestamps". Operations in C* are (in effect1) executed only if the timestamp of the new operation is "higher" than previous one.
Let's see an example. Given the following insert
INSERT INTO test (key, value ) VALUES ( 'mykey', 'somevalue') USING TIMESTAMP 1000;
You expect this as output:
select key,value,writetime(value) from test where key='mykey';
key | value | writetime(value)
-------+-----------+------------------
mykey | somevalue | 1000
And it should be like this unless someone before you didn't perform an operation on this information with a higher timestamp. For instance, if you now write
INSERT INTO test (key, value ) VALUES ( 'mykey', '999value') USING TIMESTAMP 999;
Here's the output
select key,value,writetime(value) from test where key='mykey';
key | value | writetime(value)
-------+-----------+------------------
mykey | somevalue | 1000
As you can see neither the value nor the timestamp have been updated.
1 That's a slight simplification. Unless you are doing a specialised 'compare-and-set' write, Cassandra doesn't read anything from the table before it writes and it doesn't know if there is existing data or what its timestamp is. So you end up with two versions of the row, with different timestamps. But when you read the row back you always get the one with the latest timestamp. Normally Cassandra will compact such duplicate rows after a while, which is when the older timestamp row gets discarded.