Cassandra add TTL to existing entries - cassandra

How can I update an entire table and set a TTL for every entry?
Current Scenario (Cassandra 2.0.11):
table:
CREATE TABLE external_users (
external_id text,
type int,
user_id text,
PRIMARY KEY (external_id, type)
)
currently there are ~40mio entries in this table and i want to add a TTL for lets say 86 400 seconds (1day).
It's no problem for new entries with USING TTL(86400) or UPDATE current entries, but how do i apply a ttl for every already existing entry?
My idea was to select all data and update every single row with a little script. I was just wondering if there is an easier way to achieve this (because even with batch updates this is gonna take a while and is a big effort)
Thanks in advance

There is no way to alter TTL of existing data in C*. TTL is just an internal column attribute which is written together with all other column data into immutable SSTable. A quote from the docs:
If you want to change the TTL of expiring data, you have to re-insert the data with a new TTL. In Cassandra, the insertion of data is actually an insertion or update operation, depending on whether or not a previous version of the data exists.

Related

Is it possible to set default time to live for existing Cassandra table and also apply this TTL to all the existing records in the table with CQL

My requirement is to clear out the record after 5 days. I have already created a table, but didn't set this time to live configuration at table level. Now I want to set it up for the table and also to the existing records on the table.
You can set a default TTL for the table, but there is no way to go back and change a record without a TTL to have a TTL (i.e. the default TTL will apply a TTL value to a record that does not have one during insert time). In your case, after the default TTL is set at the table level, you will have to find and delete any rows in the table with code/cql/etc, manually, after they're considered "stale". This will create tombstones, and if there is an "overwhelming" number of rows that have tombstones, you might see performance issues and failures. Compaction will clean them up eventually, or you can clean them up yourself with a manual compaction (and possibly re-split again if the single generated sstable is large).
If this table is an INSERT ONLY type of table that will always have TTLs, you may want to consider TWCS. It will reduce the compaction workload significantly, and it also offers you other options to clean up data that does not have TTLs.
Hopefully this helps.
-Jim

Update the column I searched for

is there any possibility to update a column-value in cassandra that I searched for (is part of my primary key)?
I have a (huge) list of items with a field calld "LastUpdateDateTime" and from time to time I search for columns that haven't updated for a while.
So, the reason i searched for this columns is cause I want to update them and after I update them I want to set the timestamp to the current date.
How to do this with cassandra?
You can't update primary key column, It will insert another record.
That's how cassandra work.
May be you will have to use spark-cassandra connector OR Delete the records with old values and insert new values.
Note: Deleting and inserting is not recommended if you have many records as it will create corresponding number of tombstones

Cassandra CQL SELECT/DELETE issue due to primary key constraints

I need to store latest updates that needs to be pushed to users' newsfeed page in Cassandra table for later retrieval and my table's schema is as follow:
CREATE TABLE newsfeed (user_name text,
post_id bigint,
post_type text,
favorited boolean,
shared boolean,
own boolean,
date timestamp,
PRIMARY KEY (user_name,date,post_id,post_type) );
The first three column (username, postid, and posttype) in combination will build the actual primary-key of the table, however since I wanted to ORDER the SELECT queries on this table based on "date"s of rows I placed the date-column into the primary key fields as the "second" entry (did I have to do this?).
When I want to delete a row by giving only "user_name, post_id, and post_type" as follow:
DELETE FROM newsfeed WHERE user_name='pooria' and post_id=36 and post_type='p';
I will get the following error:
Bad Request: Missing PRIMARY KEY part date since post_id is set
I need the date-column to be part of the primary key since I want to use it in my ORDER BY clauses and on the other hand I have to delete some rows without knowing their "date" values!
So how such problems are tackled in Cassandra? should I be fixing my Data Model and have different schema for job?
DataStax's Chief Evangelist Patrick McFadden posted an article demonstrating a few time series modeling patterns. Definitely makes for a good read, and should be of some help to you: Getting Started with Time Series Data Modeling.
I think your table is just fine. Although, with the way that composite primary keys work in Cassandra, if you cannot skip primary key components in a query. So if you do end up needing to query data by user_name, post_id, and/or post_type differently (without date), you should create a table specifically for that query (which does not include date in the primary key).
I will however say that in-general, creating a table which will process regular delete operations is not a good idea. In fact, I'm pretty sure that has been classified as a Cassandra "anti-pattern." Data really isn't deleted from Cassandra; it is tombstoned. Tombstones are reconciled at compaction time (assuming that the tombstone threshold time has been met), and having too many of them has been known to cause performance issues.
If you read the article I linked above, go down to the section named "Time Series Pattern 3." You will notice that the INSERT statements are run with the USING TTL clause. This gives the data a time-to-live in seconds, after which it will "quietly disappear." For instance, if you wanted to keep your data around for 24 hours (86400 seconds) you could do something like this:
INSERT INTO newsfeed (...) VALUES (...) USING TTL 86400
Using the TTL feature is a preferable alternative to regular cleansing by DELETE.

how to get last n results by updated time in cassandra?

I want to fetch last n, say last 5 updated rows i.e. order by updated_time desc in cassandra. Is there any good way of doing it?
Exact use case is like, I want to update the count of event whenever it occurs in the event table and fetch the last five events by updated time along with the count.
table structure:-
event_name text, updated_time timestamp, count counter
In Cassandra you can retrieve the editing time with writetime (cell_name). But as you have multiple columns and Cassandra is fast-reads only you may consider doing another view providing exactly the data needed in an ordered manner. On that new table you want to limit read results and periodically trim it down.
It may be possible doing it with writetime() -- but this was not the Cassandra way as it is too slow in production. Another table with just your data is the denormalized Cassandra way of solving it.

Select TTL for an element in a map in Cassandra

Is there any way to select TTL value for an element in a map in Cassandra with CQL3?
I've tried this, but it doesn't work:
SELECT TTL (mapname['element']) FROM columnfamily
Sadly, I'm pretty sure the answer is that it is not possible as of Cassandra 1.2 and CQL3. You can't query individual elements of a collection. As this blog entry says, "You can only retrieve a collection in its entirety". I'd really love to have the capability to query for collection elements, too, though.
You can still set the TTL for individual elements in a collection. I suppose if you wanted to be assured that a TTL is some value for your collection elements, you could read the entire collection and then update the collection (the entire thing or just a chosen few elements) with your desired TTL. Or, if you absolutely needed to know the TTL for individual data, you might just need to change your schema from collections back to good old dynamic columns, for which the TTL query definitely works.
Or, a third possibility could be that you add another column to your schema that holds the TTL of your collection. For example:
CREATE TABLE test (
key text PRIMARY KEY,
data map<text, text>,
data_ttl text
) WITH ...
You could then keep track of the TTL of the entire map column 'data' by always updating column 'data_ttl' whenever you update 'data'. Then, you can query 'data_ttl' just like any other column:
SELECT ttl(data_ttl) FROM test;
I realize none of these solutions are perfect... I'm still trying to figure out what will work best for me, too.

Resources