Does Cassandra considers millisecond while expring a column? - cassandra

I know for C* minimum TTL time is 1 second. But does it consider the millisecond part of of insertion time while expiring the column ?
For e.g. I inserted a record at 11:05:06:320 am with 1 second ttl.
I am expecting it to expire at 11:05:07:320 am
or it will expire at 11:05:07am ?

A record inserted at 11:05:06:320 with 1 second TTL will expire at 11:05:07:000.
Cassandra calculates localExpirationTime for each expiring cell, which is local time in seconds plus TTL [1]. In your example, it will be 11:05:07. When Cassandra decides if a cell is alive, it checks that the current time is strictly less than expiration time [2]. As a result, starting at 11:05:07, our cell will be considered expired.
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/ExpirationDateOverflowHandling.java#L118
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/LivenessInfo.java#L330

Related

TTL Remover on Cassandra Data

I have a scenario where the application has inserted data into the Cassandra table with a (TTL) of 5 days.I also have (GC_GRACE_SECONDS) to 5 days so that tombstones will get evicted as soon as compaction kicks in.
Now, i have a scenario where for one table i need to keep data for 60 days. I have changed the application write to update the TTL to 60 days for the new Data. But I'm looking for a solution where i could change the TTL for existing data(which has 5 days to 60 Days).
I have tried Instaclustr/TTLRemover for some reason the code didn't work for us.
We are using Apache Cassandra 3.11.3.
Just to provide clarity on the parameters:
default_time_to_live: TTL (Time To Live) in seconds, where zero is disabled. The maximum configurable value is 630720000 (20 years). If the value is greater than zero, TTL is enabled for the entire table and an expiration timestamp is added to each column. A new TTL timestamp is calculated each time the data is updated and the row is removed after all the data expires.
Default value: 0 (disabled).
gc_grace_seconds : Seconds after data is marked with a tombstone (deletion marker) before it is eligible for garbage-collection. Default value: 864000 (10 days). The default value allows time for Cassandra to maximize consistency prior to deletion.
Note: Tombstoned records within the grace period are excluded from hints or batched mutations.
In your case you can update TTL and gc_grace_seconds to 60 days for new data to expire in 60 days time. But as your existing data is already marked with ttl as 5 days, it will not be updated with new ttl and will be deleted in 5 days. As per my knowledge there is no way to update the ttl for existing data.
We can set TTL with two ways 1) from query 2) On table.
Just update TTL on that table or re-insert the query with new TTL value.

How to partition unequal distributed events on timeline?

I'm working on an event processing system where I have to read my event data from a hbase table.
The events I read are stored based on their timestamp.
When I read in a whole day (24 hours), I find periods on the day where I 1 million events per hour (e.g. during regular business hours) and other periods where I only get several thousand events.
So when I equally partition a day, I will get partitions (and workers) with a lot of work and some with low work.
Is there any concept on how I could partition my day so that in the off time I use more hours per partition to process and for the main hours I use less hours?
This would result in something like:
* from 0-6am use 4 partitions
* from 6am to 6pm use 60 partitions
* from 6pm to 12am use 6 partitions
If you just you timestamp for row key this means you already have problems with region hot spotting, even before any processing. Simple solution is to add sharding key before timestamp.
Row key = (timestamp % number of regions) + timestamp.
This will equally distribute rows accross regions.

Excel, MIN for overnight times

My current data is like so:
I use the MIN formula to get the minimum of these times. I am measuring a process time, so the time on T is the least, and T+1 is actually greater. How can I alter my MIN formula to account the 11:51 as the min time?
I can use MAX for above problem, but then when times are 3, 4 , 5 AM, it will give me 5 AM when I want MIN throughout.
One way around this is to add the date to the time. You do not need to display it, but you do need the date as part of the time. JNevill is all over this without coming right out and saying it. You can either do as Jnevil suggest and offsetting all your time by an equal amount so it is all in the same day or you have to add +1 to the time when it crosses the midnight threshold. Adding +1 to your time will tell excel that it is on the following day.
In Excel time is stored as a decimal and days are stored as integers. So any time with no date attached will go from 0.xxx to 1.xxx when +1 is added. The cell will still display xxx as time and the 1 does not enter in to the display. However the 1 will be very important in determining MIN or MAX because of that integer of 0 or 1 out front.
You will probably need to do this through a helper column. Without seeing a column of data it is hard to say if you only need to add 1 or if you will need to add 2 or more depending on how many days the data covers

Manipulating Cassandra writetime value

Let's say I have an application, which receives periodically some measurement data.
I know the exact time the data was measured and i want every piece of data to be deleted in 30 days after it was measured.
I'm not inserting the data immediately to the database, but i want to use the time-to-live functionality of Cassandra.
Is there a way to manipulate the system intern time-stamp of a row in Cassandra so, that I can set time-to-live to 60 days, but it actually measures the lifespan of each row with my time-stamp?
E.g. I measure something at the 27.08.2014 - 19:00. I insert this data at 27.08.2014 - 20:00 into the database and set the time-to-live value to 1 day. I now want the row to be deleted at 28.08.2014 - 19:00 and not at 28.08.2014 - 20:00 like it normally would.
Is something like this possible?
I suggest you the folowing approach based on your example:
before insertion calculate Δx = insertTime - measureTime
set TTL = 1day - Δx for inserting row
Addition on the basis of a comment:
You can use Astyanax-client with Batch mutation "to simultaneously enter multiple values at once". There is possibility to set TTL on each column and on whole row at once.

On the SQL tab what is happening during the Offset time?

On the SQL tab in Glimpse, there is the Duration column next to the Records column which I suppose is the execution time of the command, and then the next column is a time period labeled Offset. What is that actually measuring? Then there is the Duration at the far right of the column that I was guessing is the total time, but the two detailed columns to add up to that total.
Thanks!
The first duration column is the duration, in milliseconds, for the command. (Your query).
The offset column is the length of time, in milliseconds, since the beginning of the request.
The second duration column is the duration, in milliseconds, of the open connection time to the database. Often one command will run on one connection, but sometimes you'll see multiple commands happening within the same connection.

Resources