Is it necessary to SET something while updating the TTL in Cassandra?
I have a table like this,
CREATE TABLE session(
tokenId text PRIMARY KEY,
username text);
I insert the data like this,
INSERT INTO session(tokenId, username) VALUES ('123123123123','admin') USING TTL 30;
I update the TTL like this,
UPDATE session USING TTL 30 SET username = 'admin' WHERE tokenid = '123123123123' IF EXISTS;
where It makes me forcefully update the 'username', Update demands a set, Any way to just update the TTL?
This is very tricky question. Let me try my best to explain my understanding.
Basically, cassandra doesn't allow you to update TTL for a row. TTL is maintained for each column as mentioned here. Coming back to your example, the insert statement with TTL value will create the same TTL for all columns in insert statement. Meanwhile, update statement is only for intended columns (e.g. username) only. Hope it helps.
Both the INSERT and UPDATE commands support setting a time for data in
a column to expire. Use CQL to set the expiration time (TTL).
Related
i try to update record in Cassandra using CQL, and noticed for some reason i cannot change the column to its old values, here are the steps i performed,
insert a brand new record with column token set to value1
insert into instrucment(instrument_id, account_id, token) values('CDX-IT-359512FD43D3', 'CDX-IT-970A44E2DAF4','value1') USING TIMESTAMP 1605546853130000
update the record to set column token to value2
insert into instrucment(instrument_id, token) values('CDX-IT-359512FD43D3', 'value2') USING TIMESTAMP 1605546853130000
update the record to set column token back to value1
insert into instrucment(instrument_id, token) values('CDX-IT-359512FD43D3', 'value1') USING TIMESTAMP 1605546853130000
step 1 & 2 worked fine, but step3 failed, DB record showed the column token is still value2, why is that? is that because Cassandra think the value1+ timestamp 1605546853130000 is an old record thus wont' update it ?
You are updating the same row (same partition key) with different values.
Cassandra normally determines the valid record for a row by timestamp. The record with the most recent timestamp 'wins'.
See here for more information how updates work:
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/dml/dmlWriteUpdate.html
Since you are inserting with the same timestamp, you are simulating concurrent writes to the same row, concurrent to the same milisecond. If you are not setting the timestamp explicitly for your inserts, such concurrency is very unlikely.
In such truly concurrent cases Cassandra needs to turn to other methods to determine the 'winner'. Cassandra breaks the timestamp tie by comparing the byte values in a deterministic manner. In your case, the record with value2 wins.
I'm looking into Cassandra for a potential upcoming project which I think it could be a good fit for. The one potential place where it is stumping me is around a requirement for data retention. Basically we have a schema like this:
CREATE TABLE Things (
user_id int
thing_id int
a text static
b text static
.... more static fields
updated_at timestamp static
type text
subthing_id int
PRIMARY KEY (user_id, thing_id, subthing_id)
)
In relational database terms I would say that a Thing belongs to a User and a Thing has many Subthings.
A Thing has various sub-things associated with it that come in at later times that will do a new insert in turn updating the appropriate static fields. We need to store each Thing for 30 days after the last time a subthing was inserted for that Thing. So for example, Thing A and Thing B get inserted. A subthing for Thing B is inserted a week later. Thing A is deleted 30 days after initial insertion. Thing B (and all associated subthings) are deleted 7 days later.
As far as I can tell, I can't just insert with a TTL since I need to update the TTL of the other Thing rows sharing the same user_id and thing_id. I'm also not entirely sure how I would just run a DELETE command here since I'm not deleting by any of the keys. I believe the primary key is correct here since ALL queries will be based on the user_id (except the deletion which is determined by the updated_at).
My other concern is the idea of the tombstones. I have only read about them but the concern here is that I would be deleting potentially millions of these Things each day. Is that going to require daily compaction after the daily deletes are performed?
Update:
An alternative I have thought of since the original posting was having a second table that gets inserted to each time a subthing is added. It would look like:
CREATE TABLE Expirations (
expiry date
user_id int
thing_id int
PRIMARY KEY (expiry, user_id, thing_id)
)
Where expiry is the date of the given user_id and thing_id to be deleted. This table would have to be updated as necessary as things are inserted into the Things table and then I would have to run something each day to query for values where expiry is today and iterate over them to delete things from the Things table. I am not sure if this is considered the "Cassandra way" but it seems like it could work.
This is an interesting challenge. I would use a map data type to map each thing_id to all its subthing_ids. I'd go for something like:
CREATE TABLE Things (
partition_date timestamp,
insertion_date timestamp,
user_id int,
thing_map map<int,int>
a text static
b text static
.... more static fields
updated_at timestamp static
type text
PRIMARY KEY (partition_date, insertion_date, user_id)
) WITH CLUSTERING ORDER BY (insertion_date DESC)
Here I inserted a new field insertion_date that should hold exactly the insertion date, and a new field partition_date that becomes new new only PARTITION KEY, that should store a truncation of the insertion_date field, just to avoid some hotspots (I'm assuming that can simply query based on a day field due to your TTL requirements, if you need to query on the user_id field things are a bit different). I recently answered to similar questions about this modeling problem here and here, so have a look at these to get more information about the used technique (it's called bucketing).
Then there's the thing_map that is the core of your problem. Pushing a new object in the map should reset the TTL for that map entirely, so that could give you exactly the desired behavior. Note that the TTL will remove the field only, not the entire row, you'll simply need to test if it's null or not.
Finally, the tombstone behavior is a problem you're gonna having to face. If you can afford a complete row rewrite, that instead of updating only the map field you upsert all the row at once you'd get a delete at partition level, and the "reverse time-series" I've modeled with the clustering key should take care of that without too much problems.
How can I update an entire table and set a TTL for every entry?
Current Scenario (Cassandra 2.0.11):
table:
CREATE TABLE external_users (
external_id text,
type int,
user_id text,
PRIMARY KEY (external_id, type)
)
currently there are ~40mio entries in this table and i want to add a TTL for lets say 86 400 seconds (1day).
It's no problem for new entries with USING TTL(86400) or UPDATE current entries, but how do i apply a ttl for every already existing entry?
My idea was to select all data and update every single row with a little script. I was just wondering if there is an easier way to achieve this (because even with batch updates this is gonna take a while and is a big effort)
Thanks in advance
There is no way to alter TTL of existing data in C*. TTL is just an internal column attribute which is written together with all other column data into immutable SSTable. A quote from the docs:
If you want to change the TTL of expiring data, you have to re-insert the data with a new TTL. In Cassandra, the insertion of data is actually an insertion or update operation, depending on whether or not a previous version of the data exists.
Consider the following Insert statement.
INSERT INTO NerdMovies (movie, director, main_actor, year)
VALUES ('Serenity', 'Joss Whedon', 'Nathan Fillion', 2005)
USING TTL 86400;
Does the TTL field specify the time to live for the whole set of columns for a particular primary key or just one particular column. Because i would want to specify a TTL for a whole set of columns that should get deleted after the TTL expires.
Ok, I figured it out my self. It sets the TTL for the whole set of columns. so, all the columns for a particular primary key will be deleted once the TTL expires.
#sayed-jalil
To be more precise, it will set TTL for the columns that you mentioned in the INSERT/UPDATE statement.
So for instance, if at time t you do
INSERT INTO NerdMovies (movie, director, main_actor, year)
VALUES ('Serenity', 'Joss Whedon', 'Nathan Fillion', 2005)
USING TTL 86400;
if you then do the following at time t + 10
UPDATE USING TTL 86400 NerdMovies SET year = 2004;
then columns movie, director, main_actor will have TTL of t+86400 and column year will have TTL of t+10+86400
Hope that makes sense.
Is there any way to select TTL value for an element in a map in Cassandra with CQL3?
I've tried this, but it doesn't work:
SELECT TTL (mapname['element']) FROM columnfamily
Sadly, I'm pretty sure the answer is that it is not possible as of Cassandra 1.2 and CQL3. You can't query individual elements of a collection. As this blog entry says, "You can only retrieve a collection in its entirety". I'd really love to have the capability to query for collection elements, too, though.
You can still set the TTL for individual elements in a collection. I suppose if you wanted to be assured that a TTL is some value for your collection elements, you could read the entire collection and then update the collection (the entire thing or just a chosen few elements) with your desired TTL. Or, if you absolutely needed to know the TTL for individual data, you might just need to change your schema from collections back to good old dynamic columns, for which the TTL query definitely works.
Or, a third possibility could be that you add another column to your schema that holds the TTL of your collection. For example:
CREATE TABLE test (
key text PRIMARY KEY,
data map<text, text>,
data_ttl text
) WITH ...
You could then keep track of the TTL of the entire map column 'data' by always updating column 'data_ttl' whenever you update 'data'. Then, you can query 'data_ttl' just like any other column:
SELECT ttl(data_ttl) FROM test;
I realize none of these solutions are perfect... I'm still trying to figure out what will work best for me, too.