Cassandra - Delete Time Series Rows in cqlsh - cassandra

Running Cassandra 2.0.11 and I'm having difficulty deleting a row of time series data in CQLSH. Since I'm unable to use > or < in the WHERE clause of a DELETE statement, I'm assuming I need the exact time
Schema:
CREATE TABLE account_data_by_user (
user_id int,
time timestamp,
account_id int,
account_desc text,
...
PRIMARY KEY ((user_id), time, account_id)
Row in question:
user_id | time | account_id | account_desc |
--------+--------------------------+-----------------+------------------+-
1 | 2015-02-20 08:51:55-0600 | 1 | null |
Attempting:
DELETE
FROM account_data_by_user
WHERE user_id = 1 and time = '2015-02-20 08:51:55-0600' and account_id = 1
The above executes successfully, but the row is still there. I'm assuming the cqlsh output [time] is the problem.
I should note that I can delete a row like this through cqlengine.Model.delete, but I'm not sure what it's executing to accomplish the delete.

So after much google, I've discovered the blob conversion functions from this JIRA issue: https://issues.apache.org/jira/browse/CASSANDRA-5870
Query:
SELECT user_id, host_account_id, blobasbigint(timestampasblob(time))
FROM account_data_by_user where user_id
Returns:
user_id | account_id | blobasbigint(timestampasblob(time))
---------+-----------------+-------------------------------------
1 | 1 | 1424458973126
1 | 184531 | 1423738054142
DELETE
FROM account_data_by_user
WHERE user_id = 1 and time = 1424458973126 and host_account_id = 1;
This successfully removed the desired row.

Related

Cassandra: get last with a non-null value in a column

I have a Cassandra table where each column can contain a value or a NULL. But if it contains a NULL, I know that all the next values in that column are also NULL.
Something like this:
+------------+---------+---------+---------+
| date | column1 | column2 | column3 |
+------------+---------+---------+---------+
| 2017-01-01 | 1 | 'a' | NULL |
| 2017-01-02 | 2 | 'b' | NULL |
| 2017-01-03 | 3 | NULL | NULL |
| 2017-01-04 | 4 | NULL | NULL |
| 2017-01-05 | NULL | NULL | NULL |
+------------+---------+---------+---------+
I need a query that, for a given column, returns the date of the last column with a non-null value. In this case:
For column1, '2017-01-04'
For column2, '2017-01-02'
For column3, no result returned.
In SQL it would be something like this:
SELECT date
FROM my_table
WHERE column1 IS NOT NULL
ORDER BY date DESC LIMIT 1
Is it possible in any way, or should I break the table into one table for each column to avoid the NULL situation at all?
tldr; Create a new table that tracks this separately.
This would only be possible if 'column 1' was part of the primary key, with secondary indexes or with a materialized view.
You don't want your primary key to have nulls. As an aside make sure you're writing 'UNSET' inplace of null to the rest of your table. This should be handled by the driver but some drivers are not terribly mature. Writing nulls is effectively a delete operation and will cause tombstones.
Secondary indexes come with performance problems as potentially they hit the entire cluster and don't scale very well beyond a certain point.
Materialized views are being deprecated, so probably avoid those.
You are likely better served by creating a separate table that tracks this exact functionality. This would mean multiple writes and multiple reads but would avoid large table scans and secondary indexes.
I'm going to assume your partition isn't by date and that you've got wide rows because it makes this simpler but this is what that would look like.
CREATE TABLE my_table (
partition bigint,
date text,
column1 bigint,
column2 text,
column3 text,
PRIMARY KEY(partition, date);
CREATE TABLE offset_tracker(
partition bigint,
date text,
PRIMARY KEY(partition);
Here you would do a select date FROM offset_tracker WHERE partition=x to get your 'largest date with values'.

Delete Rows using timestamp column cassandra

I want to delete data between timestamp from my table.
CREATE TABLE propatterns_test.test (
clientId text,
meterId text,
meterreading text,
date timestamp,
PRIMARY KEY (meterId, date) );
My delete query is:
DELETE FROM test WHERE meterid = 'M5' AND date > '2016-12-27 10:00:00+0000';
Which returned this error :
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Invalid operator < for PRIMARY KEY part date"
After that I tried to delete a specific row :
DELETE FROM test WHERE meterid = 'M5' AND date = '2016-12-27 09:42:30+0000';
Actually the table contains the same record, but it was not deleted.
This is what my data looks like:
meterid | date | clientid | meterreading
---------+--------------------------+----------+--------------
M5 | 2016-12-27 09:42:30+0000 | RDS | 35417.8
M5 | 2016-12-27 09:42:44+0000 | RDS | 35417.8
M5 | 2016-12-27 09:47:20+0000 | RDS | 35417.8
M5 | 2016-12-27 09:47:33+0000 | RDS | 35417.8
Nothing is deleting from table. So how can I delete data between timestamp dates which is part of the primary key?
I see a couple of things happening here. First of all, like iconnj mentioned, range deletes are not possible in versions prior to Cassandra 3.0.
Secondly, your single-row delete attempt is failing (I believe) due to the fact that you are not accounting for the milliseconds present on the timestamp. You can see this if you nest your date column inside the timestsampasblob and blobasbigint functions:
aploetz#cqlsh:stackoverflow> SELECT meterid,date,blobAsBigint(timestampAsBlob(date))
FROM propatterns WHERE meterid='M5';
meterid | date | system.blobasbigint(system.timestampasblob(date))
---------+--------------------------+---------------------------------------------------
M5 | 2016-12-27 09:42:30+0000 | 1482831750000
M5 | 2016-12-30 17:31:53+0000 | 1483119113231
M5 | 2016-12-30 17:32:08+0000 | 1483119128812
(3 rows)
Note the zeros on the end of the 2016-12-27 09:42:30+0000 row, that I explicitly INSERTed from your example. Note that the two rows I INSERTed using the dateof(now()) nested functions actually has the milliseconds as the last three digits on the timestamps.
Watch what happens when I take those three digits and add them as milliseconds when I delete one of the rows:
aploetz#cqlsh:stackoverflow> DELETE FROM propatterns WHERE meterid='M5'
AND date='2016-12-30 17:32:08.812+0000';
aploetz#cqlsh:stackoverflow> SELECT meterid,date,blobAsBigint(timestampAsBlob(date))
FROM propatterns WHERE meterid='M5';
meterid | date | system.blobasbigint(system.timestampasblob(date))
---------+--------------------------+---------------------------------------------------
M5 | 2016-12-27 09:42:30+0000 | 1482831750000
M5 | 2016-12-30 17:31:53+0000 | 1483119113231
(2 rows)
In summary:
You cannot perform range deletes prior to Cassandra 3.0.
You cannot delete individual rows keyed by timestamps without specifying milliseconds, if milliseconds are indeed present.
Delete with range clause is possible in C* 3.0 onwards. Looking at the error you got I think you are on a pre 3.0 version in which case you won't be able to do this via CQL
In Cassandra 3 you can use the "...from Y using timestamp XXX where ..." command:
create table mytime (
location_id text,
tour_id text,
mytime timestamp,
PRIMARY KEY (location_id, tour_id));
INSERT INTO mytime (location_id, tour_id, mytime) values ('location1', '1', toTimeStamp(now()));
INSERT INTO mytime (location_id, tour_id, mytime) values ('location1', '2', toTimeStamp(now()));
Be aware: the value you need to use for the timestamp is nanoseconds not miliseconds:
select location_id, mytime, blobAsBigint(mytime), WRITETIME(mytime) from mytime;
location_id |mytime |system.blobasbigint(mytime) |writetime(mytime) |
------------|------------------------|----------------------------|------------------|
location1 |2018-11-28-09.53.52.110 |1543395232110 |1543395232109517 |
location1 |2018-11-28-09.53.52.742 |1543395232742 |1543395232740055 |
So now you can do
delete from mytime using timestamp 1543395232109517 where location_id = 'location1';
Which correctly deletes the entry <= 1543395232109517:
select location_id, mytime, blobAsBigint(mytime), WRITETIME(mytime) from mytime;
location_id |mytime |system.blobasbigint(mytime) |writetime(mytime) |
------------|------------------------|----------------------------|------------------|
location1 |2018-11-28-09.53.52.742 |1543395232742 |1543395232740055 |

Cassandra Data Model for Sensor Data - Value | Timestamp

I'm new to Cassandra and I'm trying to define a data model that fits my requirements.
I have a sensor that collects one value every millisecond and I have to store those data in Cassandra. The queries that I want to perform are:
1) Give me all the sensor values from - to these timestamp values
2) Tell me when this range of values was recorded
I'm not sure if there exist a common schema that can satisfy both queries because I want to perform range queries on both values. For the first query I should use something like:
CREATE TABLE foo (
value text,
timestamp timestamp,
PRIMARY KEY (value, timestamp));
but then for the second query I need the opposite since I can't do range queries on the partition key without using a token that restricts the timestamp:
CREATE TABLE foo (
value text,
timestamp timestamp,
PRIMARY KEY (timestamp, value));
So do I need two tables for this? Or there exist another way?
Thanks
PS: I need to be as fast as possible while reading
I have a sensor that collects one value every millisecond and I have to store those data in Cassandra.
The main problem I see here, is that you're going to run into Cassandra's limit of 2 billion col values per partition fairly quickly. DataStax's Patrick McFadin has a good example for weather station data (Getting Started with Time Series Data Modeling) that seems to fit here. If I apply it to your model, it looks something like this:
CREATE TABLE fooByTime (
sensor_id text,
day text,
timestamp timestamp,
value text,
PRIMARY KEY ((sensor_id,day),timestamp)
);
This will partition on both sensor_id and day, while sorting rows within the partition by timestamp. So you could query like:
> SELECT * FROM fooByTime WHERE sensor_id='5' AND day='20151002'
AND timestamp > '2015-10-02 00:00:00' AND timestamp < '2015-10-02 19:00:00';
sensor_id | day | timestamp | value
-----------+----------+--------------------------+-------
5 | 20151002 | 2015-10-02 13:39:22-0500 | 24
5 | 20151002 | 2015-10-02 13:49:22-0500 | 23
And yes, the way to model in Cassandra, is to have one table for each query pattern. So your second table where you want to range query on value might look something like this:
CREATE TABLE fooByValues (
sensor_id text,
day text,
timestamp timestamp,
value text,
PRIMARY KEY ((sensor_id,day),value)
);
And that would support queries like:
> SELECT * FROm foobyvalues WHERE sensor_id='5'
AND day='20151002' AND value > '20' AND value < '25';
sensor_id | day | value | timestamp
-----------+----------+-------+--------------------------
5 | 20151002 | 22 | 2015-10-02 14:49:22-0500
5 | 20151002 | 23 | 2015-10-02 13:49:22-0500
5 | 20151002 | 24 | 2015-10-02 13:39:22-0500

How to delete a record in Cassandra?

I have a table like this:
CREATE TABLE mytable (
user_id int,
device_id ascii,
record_time timestamp,
timestamp timeuuid,
info_1 text,
info_2 int,
PRIMARY KEY (user_id, device_id, record_time, timestamp)
);
When I ask Cassandra to delete a record (an entry in the columnfamily) like this:
DELETE from my_table where user_id = X and device_id = Y and record_time = Z and timestamp = XX;
it returns without an error, but when I query again the record is still there. Now if I try to delete a whole row like this:
DELETE from my_table where user_id = X
It works and removes the whole row, and querying again immediately doesn't return any more data from that row.
What I am doing wrong? How you can remove a record in Cassandra?
Thanks
Ok, here is my theory as to what is going on. You have to be careful with timestamps, because they will store data down to the millisecond. But, they will only display data to the second. Take this sample table for example:
aploetz#cqlsh:stackoverflow> SELECT id, datetime FROM data;
id | datetime
--------+--------------------------
B25881 | 2015-02-16 12:00:03-0600
B26354 | 2015-02-16 12:00:03-0600
(2 rows)
The datetimes (of type timestamp) are equal, right? Nope:
aploetz#cqlsh:stackoverflow> SELECT id, blobAsBigint(timestampAsBlob(datetime)),
datetime FROM data;
id | blobAsBigint(timestampAsBlob(datetime)) | datetime
--------+-----------------------------------------+--------------------------
B25881 | 1424109603000 | 2015-02-16 12:00:03-0600
B26354 | 1424109603234 | 2015-02-16 12:00:03-0600
(2 rows)
As you are finding out, this becomes problematic when you use timestamps as part of your PRIMARY KEY. It is possible that your timestamp is storing more precision than it is showing you. And thus, you will need to provide that hidden precision if you will be successful in deleting that single row.
Anyway, you have a couple of options here. One, find a way to ensure that you are not entering more precision than necessary into your record_time. Or, you could define record_time as a timeuuid.
Again, it's a theory. I could be totally wrong, but I have seen people do this a few times. Usually it happens when they insert timestamp data using dateof(now()) like this:
INSERT INTO table (key, time, data) VALUES (1,dateof(now()),'blah blah');
CREATE TABLE worker_login_table (
worker_id text,
logged_in_time timestamp,
PRIMARY KEY (worker_id, logged_in_time)
);
INSERT INTO worker_login_table (worker_id, logged_in_time)
VALUES ("worker_1",toTimestamp(now()));
after 1 hour executed the above insert statement once again
select * from worker_login_table;
worker_id| logged_in_time
----------+--------------------------
worker_1 | 2019-10-23 12:00:03+0000
worker_1 | 2015-10-23 13:00:03+0000
(2 rows)
Query the table to get absolute timestamp
select worker_id, blobAsBigint(timestampAsBlob(logged_in_time )), logged_in_time from worker_login_table;
worker_id | blobAsBigint(timestampAsBlob(logged_in_time)) | logged_in_time
--------+-----------------------------------------+--------------------------
worker_1 | 1524109603000 | 2019-10-23 12:00:03+0000
worker_1 | 1524209403234 | 2019-10-23 13:00:03+0000
(2 rows)
The below command will not delete the entry from Cassandra as the precise value of timestamp is required to delete the entry
DELETE from worker_login_table where worker_id='worker_1' and logged_in_time ='2019-10-23 12:00:03+0000';
By using the timestamp from blob we can delete the entry from Cassandra
DELETE from worker_login_table where worker_id='worker_1' and logged_in_time ='1524209403234';

Cassandra: can you add dynamic columns within existing column clustering?

I'm using Cassandra 1.2.12 with CQL 3, and am having trouble modeling my column family.
I currently store snapshots of customer data at particular times. Works great:
CREATE TABLE data (
cust_id varchar,
time timeuuid,
data_text text,
PRIMARY KEY (cust_id, time)
);
The cust_id is the partition key and time is the clustering id, so, as I understand it, I can think of each row in the table like:
| cust_id | timeuuid1 : data_text | timeuuid2 : data_text |
| CUST1 | data at this time | data at this time |
Now I'd like to store another group of metrics for each snapshot - but the name of each of these columns isn't fixed. So something like:
| cust_id | timeuuid1 : data_text | timeuuid1 : dynamicCol1 | timeuuid1 : dynamicCol2 | timeuuid1 : dynamicColN |
| CUST1 | data |{some value} |{some value} |{some value} |
I've achieved dynamic columns for timestamp by using a composite primary key, but I can't see how to achieve this within each cluster of columns, if you see what I mean.
If I add, say, "dynamicColumnName" to the existing composite key, I'll end up with customer data stored for each dynamic column, which is not what I want.
Is this possible, without using a Map column? Hope you can help, thanks!
I am not a CQL user... With the thrift API you dynamically add a column to a column family by inserting/updating a record with a value for a column with name X. The column X will start to exist right there and then for that record.
Have you tried an INSERT statement specifying a column that you have not explicitly defined? I would expect that to have the same effect (column is created).

Resources