Does Cassandra preserve writetime after a restore? - cassandra

I would like to know if a query like
select id,field,writetime(field) from mykeyspace.table
will return exactly the same values after a backup/restore operation. I'm not sure if the restore operation will change the internal timestamp handled by Cassandra's "writetime" function.

The "writetime" is preserved across Cassandra backup/restore. It can be easily tested if you had TTL on your original data. While you restore, the TTL gets carried from original written time and not the restore time.
Say for testing, you had a short TTL of 5min and you did a backup/restore, the record would get wiped out within 5min of original writetime.

Related

Moving a record after TTL expiry

I have two tables, a normal table and it's archived version. The rows in the normal table need to be moved to the archived version after TTL expires on the row. How can I accomplish this?
Is there a native trigger feature in Cassandra that I can use to move the record over to the audit table?
I know how to do this using code, but I thought that a batch process or even an event driven process to move it is unnecessarily complex.
Short answer, no, there is no way to achieve this without writing code for it.
When TTL is expired and when the record is read after that, the record will be marked as tombstone and once the gc grace period is finished, it is removed from the disk. There is no control over these operations/events and hence there is no way, including triggers, to instruct cassandra to insert this row into some other table.

How do we track the impact expired entries have on a time series table?

We are processing the oldest data as it comes into the time-series table. I am taking care to make sure that the oldest entries expire as soon as they are processed. Expectation is to have all the deletes at the bottom part of the clustering column of TimeUUID. So query will always read time slot without any deleted entries.
Will this scheme work? Are there any impacts of the expired columns that I should be aware of?
So keeping the timeuuid as part of clustering key guarantee the sort order to provide the most recent data.
If Cassandra 3.1 (DSE 5.x) and above :-
Now regarding the deletes, "avoid manual and use TWCS": Here is how
Let's say every X minutes your job process the data. Lets say X = 5min, (hopefully less than 24hours). Set the compaction to TWCS: Time Window Compaction Strategy and lets assume with TTL of 24hours.
WITH compaction= {
'compaction_window_unit': 'HOURS',
'compaction_window_size': '1',
};
Now there are 24buckets created in a day, each with one hour of data. These 24 buckets simply relates to 24 sstables (after compaction) in your Cassandra data directory. Now during the 25hour, the entire 1st bucket/sstable would automatically get dropped by TTL. Hence instead of coding for deletes, let Cassandra take care of the cleanup. The beauty of TWCS is to TTL the entire data within that sstable.
Now the READs from your application always goes to the recent bucket, 24th sstable in this case always. So the reads would never have to scan through the tombstones (caused by TTL).
If Cassandra 2.x or DSE 4.X, if TWCS isn't available yet :-
A way out till you upgrade to Cassandra 3.1 or above is to use artificial buckets. Say you introduce a time bucket variable as part of the partition key and keep the bucket value to be date and hour. This way each partition is different and you could adjust the bucket size to match the job processing interval.
So when you delete, only the processed partition gets deleted and will not come in the way while reading unprocessed ones. So scanning of tombstones could be avoided.
Its an additional effort on application side to start writing to the correct partition based on the current date/time bucket. But its worth it in production scenario to avoid Tombstone scan.
You can use TWCS to easily manage expired data, and perform filtering by some timestamp column on query time, to ensure that your query always getting the last results.
How do you "taking care" about oldest entries expiry? Cassandra will not show records with expired ttl, but they will persist in sstables until next compaction for this sstable. If you are deleting the rows by yourself, you can't make sure that your query will always read latest records, since Cassandra is eventually consistent, and theoretically there's can be some moments, when you will read stale data (or many such moments, based on your consistency settings).

Performance - TTL vs Deleting a row in Cassandra

We have a massive set of data that is written in to millions of rows in cassandra. We also have a scheduler that needs to process these records and remove them after processing them successfully.
Was wondering if Deleting the row after processing vs Marking a row with a TTL (essentially delaying its deletion).
Are there any pros / cons with Deletion vs TTL w.r.t Cassandra performance ?.
Thanks much
_DD
When using TTL the record is not removed from storage immediately, it is marked as tombstone. It gets physically removed only when the compaction occurs. Till that time the data impacts the nodes processing as it consumes the resources till the compaction happens. When you do a range query event the deleted(marked as tombstone) records are scanned by Cassandra. So using TTL to delete too many entries is considered as anti-pattern. The recommendation is to use temporary tables so that individual rows need not be removed. Just drop the entire table.
From what little information you have given here it sounds to me that you are using Cassandra as a queue which is a well known anti-pattern. You can read more about that here:
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
However to answer your basic question there is little difference in performance between using TTL and deletes. TTL's in C* are handled as tombstones which is the same as a delete. The major difference is that a tombstone is not written to a record who's TTL has expired until that record is read again. When a delete is called a tombstone is immediately created. Tombstones in general cause significant performance problems within C* and while there are some methods to mitigate the issues that they create having large numbers of them usually point to a poor data model or poor use case for C*. If you are really looking at using C* as a queue why not look at using something more fit for that purpose such as Redis?
Based on what I've read, TTL will probably be as fast as your fastest delete process could be. The reason for this is that TTL doesn't have to seek the data in order to mark it with a tombstone. The TTL lives on the record and when the record is read and the TTL has expired, then it is marked with a tombstone.
http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html

MemSQL for Last n Days Data

I plan to use memsql to store my last 7 days data for real time analytics using SQL.
I checked the documentation and find out that there is no such TTL / expiration feature in MemSQL
Is there any such feature (in case I missed it)?
Is memsql fit the use case if I do daily delete on >7 days data? I quite curious about the fragmentation
We tried it on postgresql and we need to execute Vacuum command, it takes a long time to run.
There is no TTL/expiration feature. You can do it by running delete queries. Many customer use cases are doing this type of thing, so yes MemSQL does fit the use case. Fragmentation generally shouldn't be too much of a problem here - what kind of fragmentation are you concerned about?
There is No Out of the Box TTL feature in MemSQL.
We achieved TTL by adding an additional TS column in our MemSQL Rowstore table with TIMESTAMP(6) datatype.
This provides automatic current timestamp insertion when you add a new row to the table.
When querying data from this table, you can apply a simple filter based on this TIMESTAMP column to filter older records beyond your TTL value.
https://docs.memsql.com/sql-reference/v6.7/datatypes/#time-and-date
You can always have a batch job which can run one a month which can delete older data.
we have not seen any issues due to fragmentation but you can do below once in a while if fragmentation is a concern for you:
MemSQL’s memory allocators can become fragmented over time (especially if a large table is shrunk dramatically by deleting data randomly). There is no command currently available that will compact them, but running ALTER TABLE ADD INDEX followed by ALTER TABLE DROP INDEX will do it.
Warning
Caution should be taken with this work around. Plans will rebuild and the two ALTER queries are going to move all moves in the table twice, so this should not be used that often.
Reference:
https://docs.memsql.com/troubleshooting/latest/troubleshooting/

remove specified ttl in cassandra

I read about updating ttl and that it is only possible by updating row.
But I want to remove ttl. I fear it is the same process, but I did not found any information about it. Is there a way to remove ttl without updating all rows?
What I do, is saving user information with ttl when user is registrating. So if the user do not validate his/her mail address the entry will automaticly delete.
Here's an excerpt from the official docs here.
If you want to change the TTL of expiring data, you have to re-insert
the data with a new TTL. In Cassandra, the insertion of data is
actually an insertion or update operation, depending on whether or not
a previous version of the data exists.
TTL data has a precision of one second, as calculated on the server.
Therefore, a very small TTL probably does not make much sense.
Moreover, the clocks on the servers should be synchronized; otherwise
reduced precision could be observed because the expiration time is
computed on the primary host that receives the initial insertion but
is then interpreted by other hosts on the cluster.
This is slightly unpleasant in practice, but it's relatively easy to build a very simple migration tool, you would simply iterate through the entire table and re-insert all the records with a new TTL in another table.
If computationally/storage-wise you can afford to do this, it's probably a more compelling idea to store the records twice, once with TTL and once without, simply to go around the limitation: you cannot cancel or change the TTL in Cassandra.

Resources