Cassandra ttl on a row - cassandra

I know that there are TTLs on columns in Cassandra. But is it also possible to set a TTL on a row? Setting a TTL on each column doesn't solve my problem as can be seen in the following usecase:
At some point a process wants to delete a complete row with a TTL (let's say row "A" with TTL 1 week). It could do this by replacing all existing columns with the same content but with a TTL of 1 week.
But there may be another process running concurrently on that row "A" which inserts new columns or replaces existing ones without a TTL because that process can't know that the row is to be deleted (it runs concurrently!). So after 1 week all columns of row "A" will be deleted because of the TTL except for these newly inserted ones. And I also want them to be deleted.
So is there or will there be Cassandra support for this use case or do I have to implement something on my own?
Kind Regards
Stefan

There is no way of setting a TTL on a row in Cassandra currently. TTLs are designed for deleting individual columns when their lifetime is known when they are written.
You could achieve what you want by delaying your process - instead of wanting to insert a TTL of 1 week, run it a week later and delete the row. Row deletes have the following semantics: any column inserted just before will get deleted but columns inserted just after won't be.
If columns that are inserted in the future still need to be deleted you could insert a row delete with a timestamp in the future to ensure this but be very careful: if you later wanted to insert into that row you couldn't, columns would just disappear when written to that row (until the tombstone is garbage collected).

You can set ttl for a row in Cassandra 3 using
INSERT INTO Counter(key,eventTime,value) VALUES ('1001',dateof(now()),100) USING ttl 10;

Although I do not recommend such, there is a Cassandra way to fix the problem:
SELECT TTL(value) FROM table WHERE ...;
Get the current TTL of a value first, then use the result to set the TTL in an INSERT or UPDATE:
INSERT ... USING TTL ttl-of-value;
So... I think that the SELECT TTL() is slow (from experience with TTL() and WRITETIME() in some of my CQL commands). Not only that, the TTL is correct at the time the select results are generated on the Cassandra node, but by the time the insert happens, it will be off. Cassandra should have offered a time to delete rather than a time to live...
So as mentioned by Richard, having your own process to delete data after 1 week is probably safer. You should have one column to save the date of creation or the date when the data becomes obsolete. Then a background process can read that date and if the data is viewed as obsolete, drop the entire row.
Other processes can also use that date to know whether that row is considered valid or not! (so even if it was not yet deleted, you can still view the row as invalid if the date is passed.)

Related

Does Cassandra store only the affected columns when updating a record or does it store all columns every time it is updated?

If the answer is yes,
Does that mean unlike Mongo or RDMS, whether we retrieve every column or some column will have big performance impact in Cassandra?(I am not talking about transfer time over network as it will affect all of the above)
Does that mean during compaction, it cannot just stop when it finds the latest row for a primary key, it has to go through the full set in SSTables? (I understand there will be optimisations as previously compacted SSTable will have maximum one occurrence for row)
Please ask only one question per question.
That is entirely up to you. If you write one column value, it'll persist just that one. If you write them all, they will all persist, even if they are the same as the current value.
whether we retrieve every column or some column will have big performance impact
This is definitely the case. Queries for column values that are small or haven't been written to or deleted will be much faster than the opposite.
during compaction, it cannot just stop when it finds the latest row for a primary key, it has to go through the full set in SSTables?
Yes. And not just during compaction, but read queries will also check multiple SSTable files.

Deleting column in cassandra for large dataset

We have a redundant column that we'd like to delete from our Cassandra database (version 2.1.15). This is a text column represents the majority of data on disk (15 nodes X 1.8 TB per node).
The easiest option just seems to be an alter table to remove that column, and then let Cassandra compaction take care of things (also running Cassandra Reaper to manage repairs). However, given the size of the dataset I'm concerned I will knock over the cluster with a massive delete.
Other options I've consider is a process that will run through the keyspace setting the value to null, but I think this will have the same effect as removing the column, but is more under out control (but also requires writing something to do this).
Would anyone have any advice on how to approach this?
Thanks!
Dropping a column does mark the deleted values as tombstones. The column value becomes unavailable immediately and the column data is removed in the next compaction cycle.
If you want to to expedite the removal of the column before the compaction occurs, you can run nodetool upgradesstables to remove the data, after you use the ALTER TABLE command to change the metadata for the column.
See Documentation: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/alter_table_r.html
If I remember correctly, drop of column doesn't really mark the deleted values with tombstone, but instead inserts corresponding entry into system.dropped_columns table, and then code, like, SerializationHelper & BTreeRow, performs filtering on the fly. The data will be deleted when compaction will happen.
Explicitly setting the value to null won't make situation better because you'll add data to the table.
I would recommend to test deletion on small cluster & check how it behaves.

Update the column I searched for

is there any possibility to update a column-value in cassandra that I searched for (is part of my primary key)?
I have a (huge) list of items with a field calld "LastUpdateDateTime" and from time to time I search for columns that haven't updated for a while.
So, the reason i searched for this columns is cause I want to update them and after I update them I want to set the timestamp to the current date.
How to do this with cassandra?
You can't update primary key column, It will insert another record.
That's how cassandra work.
May be you will have to use spark-cassandra connector OR Delete the records with old values and insert new values.
Note: Deleting and inserting is not recommended if you have many records as it will create corresponding number of tombstones

How Cassandra manage insertion, update and Deletion of column and Column data. internally

Actually I am getting confused with some concepts regarding cassandra.
what do we Actually mean by updating Cassandra row? is it mean adding more column or updates in the value of the column. or it is both.?
When we are adding more column to a row. is the previous row in the sstable got invalidate and new row entry is inserted in the SSTABLE with the newly added rows.?
Since SSTable is immutable so each new update in Column data OR addition of Column OR Deletion of Column data will result in invalidating the previous row and inserting a new Row with all the previous column+new Column?
Please Help..
What do we Actually mean by updating Cassandra row? is it mean adding
more column or updates in the value of the column. or it is both.?
In cassandra, updating a row and inserting a row are the same operation, bot lead to adding data to a memtable (in-memory sstable) which is latter flushed to disk and becomes an sstable (also a log line is written to the commit log if persistent writes are enabled). If you insert a column (btw in cassandra terms, a column is the same as a cell, and a row is known as a partition, you might find this useful if you do any further reading) which already exists, e.g:
INSERT INTO db.tbl (id, value) VALUES ('text_id1', 'some text as a value');
INSERT INTO db.tbl (id, value) VALUES ('text_id1', 'some text as a value');
You'll end up with 1 partition, since the first one is overwritten by the second insert. This means that inserting partitions with duplicate keys leads to the previous one being overwritten (and the overwrite is based on the timestamp at the time of insert, last write wins).
When we are adding more column(cell) to a row(partition). is the
previous row in the sstable got invalidate and new row entry is
inserted in the SSTABLE with the newly added rows.?
For cql, the previous columns will just contain a null value. No invalidation will happen, you can alter schemas as you please. If you delete a column, its' data will be removed during the next compaction with the aim of reclaiming back disk space.
Since SSTable is immutable so each new update in Column data OR
addition of Column OR Deletion of Column data will result in
invalidating the previous row and inserting a new Row with all the
previous column+new Column?
Kind of, sstables are merged into larger sstables when necessary, how this is done depends on the compaction strategy that is being used. There are two flavours, size-tiered and levelled compaction. Covering how they work is a whole separate question that has been answered by people who are smarter than me so have a read here.
Updating is covered here:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_write_update_c.html
As you note, SSTables are immutable, so you're probably wondering what happens when a later write supercedes data already in an SSTable. The storage engine reads from all tables that might have data for a requested row (as determined by bloom filters for each table). Understanding the read path might clarify this for you:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_about_reads_c.html
Specifically:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_about_read_path_c.html

Update TTL for entire row when doing CQL update statement

Assume you have a row with 4 columns, that when you created it, you set a TTL of 1 hour.
I need to occasionally update the date column of the row, and at the same time update the TTL of the entire row.
Asusming this doesn't work, whats the correct way to achieve this?
update mytable using ttl 3600
set accessed_on=?
Cassandra supports TTL per column only, which is a nice flexible features, but the ability to TTL a row is a feature that has been requested many times.
Your only option is to update all columns on the row, thereby updating the TTL on all the columns.

Resources