cassandra lastupdated(auto_now), lastaccessed and created(auto_now_add) - cassandra

Is there a way we can auto update the columns creation and last updated/accessed timestamp?
We can use toTimestamp(now()) function to store the creation time. But do we have a function like writetime(name), which is used to get the last modified time? Is there a similar function for reading the creation and accessed-time?
Is there a way I can get all the three timestamps lastupdated/lastaccessed and created timestamp auto-generated and stored?

Yes, there is a writetime function, but it only operates on non-primary key columns.
aploetz#cqlsh:stackoverflow> SELECT name,description,writetime(description)
FROm bookbyname WHERE name='Patriot Games';
name | writetime(description) | description
---------------+------------------------+------------------------------------------------------------------------------------------------
Patriot Games | 1442340092257821 | Jack Ryan saves England's next king, and becomes the target of an IRA splinter terrorism cell.
Cassandra does not keep track of last accessed/read, or anything like that.
In Cassandra the last write wins, so last updated and created are going to be the same. But if you had a column that you know had changed, and one that you know had not changed, you could get the write times of both, and then you'd have your updated and created times.

Related

Remove Duplicates based on latest date in power query

I got a dataset that I am loading into my sheet via power query and wish to transform the data a little bit according to my liking before loading it in.
To give a little more context, I have some ID's and I would like the older rows to be removed and the rows which have the newer date to be loaded in.
Solution is described at https://exceleratorbi.com.au/remove-duplicates-keep-last-record-power-query/
"Remove Duplicates and Keep the Last Record with Power Query"
In short, sort per date in a buffered table and then remove duplicate id
Another way I think would be to group by id and get MAX date but it depends of the data size

How to retrieve a very big cassandra table and delete some unuse data from it?

I hava created a cassandra table with 20 million records. Now I want to delete the expired data decided by one none primary key column. But it doesn't support the operation on the column. So I try to retrieve the table and get the data line by line to delete the data.Unfortunately,it is too huge to retrieve. Otherwise,I couldn't delete the whole table, how could I achieve my goal?
Your question is actually, how to get the data from the table in bulks (also called pagination).
You can do that by selecting different slices from your primary key: For example, if your primary key is some sort of ID, select a range of IDs each time, process the results and do whatever you want to do with them, then get the next range, and so on.
Another way, which depends on the driver you're working with, will be to use fetch_size. You can see a Python example here and a Java example here.

how to get last n results by updated time in cassandra?

I want to fetch last n, say last 5 updated rows i.e. order by updated_time desc in cassandra. Is there any good way of doing it?
Exact use case is like, I want to update the count of event whenever it occurs in the event table and fetch the last five events by updated time along with the count.
table structure:-
event_name text, updated_time timestamp, count counter
In Cassandra you can retrieve the editing time with writetime (cell_name). But as you have multiple columns and Cassandra is fast-reads only you may consider doing another view providing exactly the data needed in an ordered manner. On that new table you want to limit read results and periodically trim it down.
It may be possible doing it with writetime() -- but this was not the Cassandra way as it is too slow in production. Another table with just your data is the denormalized Cassandra way of solving it.

Cassandra ttl on a row

I know that there are TTLs on columns in Cassandra. But is it also possible to set a TTL on a row? Setting a TTL on each column doesn't solve my problem as can be seen in the following usecase:
At some point a process wants to delete a complete row with a TTL (let's say row "A" with TTL 1 week). It could do this by replacing all existing columns with the same content but with a TTL of 1 week.
But there may be another process running concurrently on that row "A" which inserts new columns or replaces existing ones without a TTL because that process can't know that the row is to be deleted (it runs concurrently!). So after 1 week all columns of row "A" will be deleted because of the TTL except for these newly inserted ones. And I also want them to be deleted.
So is there or will there be Cassandra support for this use case or do I have to implement something on my own?
Kind Regards
Stefan
There is no way of setting a TTL on a row in Cassandra currently. TTLs are designed for deleting individual columns when their lifetime is known when they are written.
You could achieve what you want by delaying your process - instead of wanting to insert a TTL of 1 week, run it a week later and delete the row. Row deletes have the following semantics: any column inserted just before will get deleted but columns inserted just after won't be.
If columns that are inserted in the future still need to be deleted you could insert a row delete with a timestamp in the future to ensure this but be very careful: if you later wanted to insert into that row you couldn't, columns would just disappear when written to that row (until the tombstone is garbage collected).
You can set ttl for a row in Cassandra 3 using
INSERT INTO Counter(key,eventTime,value) VALUES ('1001',dateof(now()),100) USING ttl 10;
Although I do not recommend such, there is a Cassandra way to fix the problem:
SELECT TTL(value) FROM table WHERE ...;
Get the current TTL of a value first, then use the result to set the TTL in an INSERT or UPDATE:
INSERT ... USING TTL ttl-of-value;
So... I think that the SELECT TTL() is slow (from experience with TTL() and WRITETIME() in some of my CQL commands). Not only that, the TTL is correct at the time the select results are generated on the Cassandra node, but by the time the insert happens, it will be off. Cassandra should have offered a time to delete rather than a time to live...
So as mentioned by Richard, having your own process to delete data after 1 week is probably safer. You should have one column to save the date of creation or the date when the data becomes obsolete. Then a background process can read that date and if the data is viewed as obsolete, drop the entire row.
Other processes can also use that date to know whether that row is considered valid or not! (so even if it was not yet deleted, you can still view the row as invalid if the date is passed.)

how to implement fixed number of (timeuuid) columns in cassandra (with CQL)?

Here is an example use case:
You need to store last N (let's say 1000 as fixed bucket size) user actions with all details in timeuuid based columns.
Normally, each users' actions are already in "UserAction" column family where user id as row key, and actions in timeuuid columns. You may also have "AllActions" column family which stores all actions with same timeuuid as column name and user id as column value. It's basically a relationship column family but unfortunately without any details of user actions. Querying with this column family is expensive I guess, because of random partioner. On the other hand, if you store all details in "AllActions" CF then cassandra can't handle that big row properly at one point. This is why I want to store last N user actions with all details in fixed number of timeuuid based columns.
Maybe you may have a better design solution for this use case... I like to hear that ...
If not, the question is how to implement fixed number of (timeuuid) columns in cassandra (with CQL) effectively?
After insertion we could delete old (overflow) columns if we had some sort of range support in cql's DELETE. AFAIK there is no support for this.
So, any idea? Thanks in advance...
IMHO, this is something that C* must handle itself like compaction. It's not a good idea to handle this on client side.
Maybe, we need some configuration (storage) options for column families to make them suitable for "most recent data".

Resources