What is the most efficient way to update record(s) value when using SummingCombiner? - accumulo

I have a table with a SummingCombiner on minC and majC. Every day I need to update the value for a small number of records. What is the most efficient way to do so?
My current implementation is to create a new record with value set to amount to increase/decrease (new mutation w/Row,CF,CQ equal to existing record(s)).

Yes, the most efficient way to update the value is to insert a new record and let the SummingCombiner add the new value into the existing value. You probably also want to have the SummingCombiner configured on the scan scope, so that scans will see the updated value right away, before a major compaction has occurred.

Related

How to get last inserted data from DynamoDb in lambda function

I am having table in Dynam0Db with,
timestamp as Partition Key,
Status as normal column.
I am inserting the timestamp and status in DynamoDb when ever new data comes.
I want to retrieve the last added data to the table(here we can refer timestamp).
So how can i do this(I am using lambda function with NodeJs language).
If any queries in question comment below, Thanks.
You can make a query on your table with these parameters :
Limit: 1,
ScanIndexForward : false
But it seems complicated to do the query because of the timestamp as a partition key.
The other way is to generate a stream at every new entry in your table that trigger a lambda function :
Stream->Lambda approach suggested in previous answer is OK, however that will not work if you need to retrieve latest entry more than once. Alternatively, because you always have exactly one newest item, you can simply create another table with exactly one item with constant hash key and overwrite it every time you do update. That second table will always contain 1 record which will be your newest item. You can update that second table via stream from your main table and lambda, or directly from your application that performs original write.
[Update] As noted in the comments, using only 1 record will limit throughput to 1 partition limits (1000 WCU). If you need more, use sharding approach: on write store newest data randomly in one of N records, and on read scan all N records and pick the newest one.

MongoDB API pagination

Imagine situation when a client has feed of objects with limit 10.
When the next 10 are required it sends request with skip 10 and limit 10.
But what if there are some new objects were added (or deleted) to collection since the 1st request with offset == 0.
Then on 2nd request (with offset == 10) response may have wrong objects order.
Sorting on time of their creation does not work here, because I have some feeds which are formed on sorting via some numeric field.
You can add a time field like created_at or updated_at. It must updated when ever the document is created or modified and the field must be unique.
Then query the DB for the range of time using $gte and $lte along with a sort on this time field.
This ensures that any changes made outside the time window will not get reflected in the pagination, provided that the time field does not have duplicates. Most probably if you include microtime, duplicates wont happen.
It really depends on what you want the result to be.
If you want the original objects in their original order regardless of Delete and Add operations then you need to make a copy of the list (or at least of the order) and then page through that. Copy every Id to a new collection that doesn't change once the page has loaded and then paginate through that.
Alternatively, and perhaps more likely, what you want is to see the next 10 after the last one in the current set including any Delete or Add operations that have take place since. For this, you can use the sorted order in which you are viewing them and a filter, $gt whatever the last item was. BUT that doesn't work when there are duplicates in the field on which you are sorting. To get around that you will need to index on that field PLUS some other field which is unique per record, for example, the _id field. Now, you can take the last record in the first set and look for records that are $eq the indexed value and $gt the _id OR are simply $gt the indexed value.

Cassandra add TTL to existing entries

How can I update an entire table and set a TTL for every entry?
Current Scenario (Cassandra 2.0.11):
table:
CREATE TABLE external_users (
external_id text,
type int,
user_id text,
PRIMARY KEY (external_id, type)
)
currently there are ~40mio entries in this table and i want to add a TTL for lets say 86 400 seconds (1day).
It's no problem for new entries with USING TTL(86400) or UPDATE current entries, but how do i apply a ttl for every already existing entry?
My idea was to select all data and update every single row with a little script. I was just wondering if there is an easier way to achieve this (because even with batch updates this is gonna take a while and is a big effort)
Thanks in advance
There is no way to alter TTL of existing data in C*. TTL is just an internal column attribute which is written together with all other column data into immutable SSTable. A quote from the docs:
If you want to change the TTL of expiring data, you have to re-insert the data with a new TTL. In Cassandra, the insertion of data is actually an insertion or update operation, depending on whether or not a previous version of the data exists.

Select TTL for an element in a map in Cassandra

Is there any way to select TTL value for an element in a map in Cassandra with CQL3?
I've tried this, but it doesn't work:
SELECT TTL (mapname['element']) FROM columnfamily
Sadly, I'm pretty sure the answer is that it is not possible as of Cassandra 1.2 and CQL3. You can't query individual elements of a collection. As this blog entry says, "You can only retrieve a collection in its entirety". I'd really love to have the capability to query for collection elements, too, though.
You can still set the TTL for individual elements in a collection. I suppose if you wanted to be assured that a TTL is some value for your collection elements, you could read the entire collection and then update the collection (the entire thing or just a chosen few elements) with your desired TTL. Or, if you absolutely needed to know the TTL for individual data, you might just need to change your schema from collections back to good old dynamic columns, for which the TTL query definitely works.
Or, a third possibility could be that you add another column to your schema that holds the TTL of your collection. For example:
CREATE TABLE test (
key text PRIMARY KEY,
data map<text, text>,
data_ttl text
) WITH ...
You could then keep track of the TTL of the entire map column 'data' by always updating column 'data_ttl' whenever you update 'data'. Then, you can query 'data_ttl' just like any other column:
SELECT ttl(data_ttl) FROM test;
I realize none of these solutions are perfect... I'm still trying to figure out what will work best for me, too.

Cassandra ttl on a row

I know that there are TTLs on columns in Cassandra. But is it also possible to set a TTL on a row? Setting a TTL on each column doesn't solve my problem as can be seen in the following usecase:
At some point a process wants to delete a complete row with a TTL (let's say row "A" with TTL 1 week). It could do this by replacing all existing columns with the same content but with a TTL of 1 week.
But there may be another process running concurrently on that row "A" which inserts new columns or replaces existing ones without a TTL because that process can't know that the row is to be deleted (it runs concurrently!). So after 1 week all columns of row "A" will be deleted because of the TTL except for these newly inserted ones. And I also want them to be deleted.
So is there or will there be Cassandra support for this use case or do I have to implement something on my own?
Kind Regards
Stefan
There is no way of setting a TTL on a row in Cassandra currently. TTLs are designed for deleting individual columns when their lifetime is known when they are written.
You could achieve what you want by delaying your process - instead of wanting to insert a TTL of 1 week, run it a week later and delete the row. Row deletes have the following semantics: any column inserted just before will get deleted but columns inserted just after won't be.
If columns that are inserted in the future still need to be deleted you could insert a row delete with a timestamp in the future to ensure this but be very careful: if you later wanted to insert into that row you couldn't, columns would just disappear when written to that row (until the tombstone is garbage collected).
You can set ttl for a row in Cassandra 3 using
INSERT INTO Counter(key,eventTime,value) VALUES ('1001',dateof(now()),100) USING ttl 10;
Although I do not recommend such, there is a Cassandra way to fix the problem:
SELECT TTL(value) FROM table WHERE ...;
Get the current TTL of a value first, then use the result to set the TTL in an INSERT or UPDATE:
INSERT ... USING TTL ttl-of-value;
So... I think that the SELECT TTL() is slow (from experience with TTL() and WRITETIME() in some of my CQL commands). Not only that, the TTL is correct at the time the select results are generated on the Cassandra node, but by the time the insert happens, it will be off. Cassandra should have offered a time to delete rather than a time to live...
So as mentioned by Richard, having your own process to delete data after 1 week is probably safer. You should have one column to save the date of creation or the date when the data becomes obsolete. Then a background process can read that date and if the data is viewed as obsolete, drop the entire row.
Other processes can also use that date to know whether that row is considered valid or not! (so even if it was not yet deleted, you can still view the row as invalid if the date is passed.)

Resources