I have a scenario where the application has inserted data into the Cassandra table with a (TTL) of 5 days.I also have (GC_GRACE_SECONDS) to 5 days so that tombstones will get evicted as soon as compaction kicks in.
Now, i have a scenario where for one table i need to keep data for 60 days. I have changed the application write to update the TTL to 60 days for the new Data. But I'm looking for a solution where i could change the TTL for existing data(which has 5 days to 60 Days).
I have tried Instaclustr/TTLRemover for some reason the code didn't work for us.
We are using Apache Cassandra 3.11.3.
Just to provide clarity on the parameters:
default_time_to_live: TTL (Time To Live) in seconds, where zero is disabled. The maximum configurable value is 630720000 (20 years). If the value is greater than zero, TTL is enabled for the entire table and an expiration timestamp is added to each column. A new TTL timestamp is calculated each time the data is updated and the row is removed after all the data expires.
Default value: 0 (disabled).
gc_grace_seconds : Seconds after data is marked with a tombstone (deletion marker) before it is eligible for garbage-collection. Default value: 864000 (10 days). The default value allows time for Cassandra to maximize consistency prior to deletion.
Note: Tombstoned records within the grace period are excluded from hints or batched mutations.
In your case you can update TTL and gc_grace_seconds to 60 days for new data to expire in 60 days time. But as your existing data is already marked with ttl as 5 days, it will not be updated with new ttl and will be deleted in 5 days. As per my knowledge there is no way to update the ttl for existing data.
We can set TTL with two ways 1) from query 2) On table.
Just update TTL on that table or re-insert the query with new TTL value.
Related
I know for C* minimum TTL time is 1 second. But does it consider the millisecond part of of insertion time while expiring the column ?
For e.g. I inserted a record at 11:05:06:320 am with 1 second ttl.
I am expecting it to expire at 11:05:07:320 am
or it will expire at 11:05:07am ?
A record inserted at 11:05:06:320 with 1 second TTL will expire at 11:05:07:000.
Cassandra calculates localExpirationTime for each expiring cell, which is local time in seconds plus TTL [1]. In your example, it will be 11:05:07. When Cassandra decides if a cell is alive, it checks that the current time is strictly less than expiration time [2]. As a result, starting at 11:05:07, our cell will be considered expired.
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/ExpirationDateOverflowHandling.java#L118
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/LivenessInfo.java#L330
I'm seeing the stats below for one of my tables running nodetool cfstats
Maximum tombstones per slice (last five minutes): 23571
Per the Datastax doc:
Maximum number of tombstones scanned by single key queries during the
last five minutes
All my other tables have low numbers like 1 or 2. Should I be worried? Should I try to lower the tombstone creation?
Tombstones can impact on read performance if they are residing in frequently used tables. you should re-work on data modelling part. Also. you can lower the value of gc_grace_seconds so that tombstones clear fast instead of waiting default value 10 days.
I'm working on an event processing system where I have to read my event data from a hbase table.
The events I read are stored based on their timestamp.
When I read in a whole day (24 hours), I find periods on the day where I 1 million events per hour (e.g. during regular business hours) and other periods where I only get several thousand events.
So when I equally partition a day, I will get partitions (and workers) with a lot of work and some with low work.
Is there any concept on how I could partition my day so that in the off time I use more hours per partition to process and for the main hours I use less hours?
This would result in something like:
* from 0-6am use 4 partitions
* from 6am to 6pm use 60 partitions
* from 6pm to 12am use 6 partitions
If you just you timestamp for row key this means you already have problems with region hot spotting, even before any processing. Simple solution is to add sharding key before timestamp.
Row key = (timestamp % number of regions) + timestamp.
This will equally distribute rows accross regions.
I have a table TTL=10days in cassandra, I usually do full compaction on every Monday and Thursday.
I noticed that after compaction on Thursday, Cassandra did not touch/compact the files generated on Monday.
Why is that? Is that possible the file generated on Monday is too big? How can I fix it? BTW, I uses SizeTieredCompactionStrategy.
When you say you do a "full compaction" what exactly are you doing to trigger this?
In general, SizeTieredCompaction will only compact a set number of similarly sized SSTables. This means that if your table (table 1) from Monday is say size X MBs and you have min_threshold on the table set to 4 then it would require 4 tables of ~X Mbs before the table 1 would be compacted again. This means that if you say generate a new compacted table of ~ X MBs every 3 days it would take 9 days before the original table was compacted again.
Let's say I have an application, which receives periodically some measurement data.
I know the exact time the data was measured and i want every piece of data to be deleted in 30 days after it was measured.
I'm not inserting the data immediately to the database, but i want to use the time-to-live functionality of Cassandra.
Is there a way to manipulate the system intern time-stamp of a row in Cassandra so, that I can set time-to-live to 60 days, but it actually measures the lifespan of each row with my time-stamp?
E.g. I measure something at the 27.08.2014 - 19:00. I insert this data at 27.08.2014 - 20:00 into the database and set the time-to-live value to 1 day. I now want the row to be deleted at 28.08.2014 - 19:00 and not at 28.08.2014 - 20:00 like it normally would.
Is something like this possible?
I suggest you the folowing approach based on your example:
before insertion calculate Δx = insertTime - measureTime
set TTL = 1day - Δx for inserting row
Addition on the basis of a comment:
You can use Astyanax-client with Batch mutation "to simultaneously enter multiple values at once". There is possibility to set TTL on each column and on whole row at once.