Howto avoid cassandra tombstones when inserting NULL values

Howto avoid cassandra tombstones when inserting NULL values - cassandra

My problem is that cassandra creates tombstones when inserting NULL values.
From what I understand, cassandra doesn't support NULLs and when NULL is inserted it just deletes the respective column. On one hand this is very space effective, however on the other hand it creates tombstones which degrades read performance.
This goes agains NoSql phillosophy because cassandra is saving space but degrading read performance. In NoSql world the space is cheap, however performance matters. I beleive this is the phillosophy behind saving tables in denormalized form.
I would like cassandra to use the same technique for inserting NULL as for any other value - use timestamping and during compaction preserve the latest entry - even if the entry is NULL (or we can call it "unset").
Is there any tweak in cassandra config or any approach how I would be able to achieve upserts with nulls without having tombstones ?
I came across this issue however it only allows to ignore NULL values
My use case:
I have stream of events, every event identified by causeID. I'm receiving many events with same causeId and I want to store only the latest event for the same causeID (using upsert). The properties of the event may change from NULL to specific value, but also from specific value to NULL. Unfortunatelly the later case generates tombstones and degrades read performance.
Update
It seems there is no way how I could avoid tombstones. Could you advice me on techniques how to minimize them (set gc_grace_seconds to very low value). What are the risks, what to do when a node goes down for a longer period than gc_grace_seconds ?

You can't insert NULL into Cassandra - it has special meaning there, and lead to creation of tombstones that you observe. If you want to treat NULL as special value, why not to solve this problem on application side - when you get null status, just insert any special value that couldn't be used in your table, and when you read data back, check for that special value and output null to requester...

When we want to just insert or update rows using null for values that are not specified, and even though our intention is to leave the value empty, Cassandra represents it as a tombstone causing unnecessary overhead which degrades performance.
To avoid such tombstones for save operations, cassandra has the concept of unset for a parameter value.
So you can do the following to unset a field value while saving to avoid tombstone overhead for example related to different cases:
1). If you are using express-cassandra then :
const user = new models.instance.User({
user_id: 1235,
user_name: models.datatypes.unset // this will not create tombstone when we want empty user_name or null
});
user.save(function(err){
// user_name value is not set and does not create any unnecessary tombstone overhead
});
2). If you are writing cassandra raw query then for empty or null field when you know say colC will be null, then don't use it in your query.
insert into my_table(id,colA,colB) values(idVal,valA,valB) // Avoid colC
3). If you are using Node.Js Driver, you can even pass undefined on insert or update which will avoid tombstone overhead. For example
const query = 'INSERT INTO my_table (id, colC) VALUES (?, ?)';
client.execute(query, [ id, undefined ]);
4). If you are using c# driver then
// Prepare once in your application lifetime
var ps = session.Prepare("INSERT INTO my_table (id, colC) VALUES (?, ?)");
// Bind the unset value in a prepared statement
session.Execute(ps.Bind(id, Unset.Value));
For more detail on express-cassandra read the sub topic Null and unset values of
https://express-cassandra.readthedocs.io/en/latest/datatypes/#cassandra-to-javascript-datatypes
For more detail on Node.js driver unset feature refer datastax https://docs.datastax.com/en/developer/nodejs-driver/4.6/features/datatypes/nulls/
For more detail on Csharp driver unset feature refer datastax https://docs.datastax.com/en/developer/csharp-driver/3.16/features/datatypes/nulls-unset/
NOTE: I tested this on Node.js cassandra 4.0 But unset feature is introduced after cassandra 2.2
Hope this will help you or somebody else.
Thanks!

You cannot avoid tombstones if you particularly mention NULL in your INSERT. C* does not do a lookup before insert or writing a data which makes the writes very faster. For this purpose, C* just inserts a tombstone to avoid that value later (taking the latest update comparing the timestamp). If you want to avoid tombstone (which is recommended), you've to prepare different combinations of queries to check each one for NULL before adding it to the INSERT. If you have very few fields to check then it'll be easy to just add some IF-ELSE statements. But if there are lots of them, the code will be bigger and less readable. Shortly, you cannot insert NULL which will impact read performance later.
Inserting null values into cassandra

I don't think the other answers address the original question, which is how to overwrite a non-null value in Cassandra with null without creating a tombstone. The nearest is Alex Ott's suggestion to use some special value other than null.
However, with a little bit of trickery you can insert an explicit null into Cassandra by exploiting a FROZEN tuple or user-defined type. The FROZEN keyword effectively serialises the user defined type and stores the serialised representation in the column. Crucially, the serialised representation of a UDT containing null values is not itself null.
> CREATE TYPE test_type(value INT);
> CREATE TABLE test(pk INT, cl INT, data FROZEN<test_type>, PRIMARY KEY (pk, cl));
> INSERT INTO test (pk, cl, data) VALUES (0, 0, {value: 15});
> INSERT INTO test (pk, cl, data) VALUES (0, 0, {value: null});
> INSERT INTO test (pk, cl) VALUES (0, 1);
> SELECT * FROM test;
pk | cl | data
----+----+---------------
0 | 0 | {value: null}
0 | 1 | null
(2 rows)
Here we wrote 15, then overwrote it with null, and finally added a second row to demonstrate that there is a difference between an unset cell and a cell containing a frozen UDT that itself contains null.
Of course the downside of this approach is that in your application you have to delve into the UDT for the actual value.
On the other hand, if you combine several columns into the UDT you do save a little overhead in Cassandra. (But you can't then read or write them individually. You also can't remove fields, though you can add new ones.)

Related

ON CONFLICT operator in Cassandra

I have a table in Cassandra with 2 columns: id and date_proc and plan to insert a lot of inserts. Is it possible to use something like ON CONFLICT in Postgres to get previous value on inserting?
Could you tell me another way to avoid 2 requests to Cassandra (select and insert)? Maybe some solution in DataStax?
ddl:
create table test.date_dict (
id text,
date_proc text,
PRIMARY KEY (id));
example of inserting:
INSERT INTO test.date_dict (id, date_proc) VALUES ('1', '2020-01-01'); // return '2020-01-01'
INSERT INTO test.date_dict (id, date_proc) VALUES ('1', '2020-01-05'); // return '2020-01-01'

"Normal" inserts and updates in Cassandra are just appends into the memtable (and then flushed into SSTables) - no read happens during these operations. And it will just overwrite previous data if it has lower timestamp.
Potentially you can use lightweight transactions (LWTs) to achieve what you need - they return previous value if there is a conflict (row exists already when you use IF NOT EXISTS, or value is different than you specify in the IF condition). But LWTs are very bad for performance, so they should be used carefully.
I would try to reformulate your task such way so it will fit into "normal" inserts/updates behavior.

Cassandra : new rows created with UPDATE IF EXISTS

I have a table - for simplicity, lets say this is its definition:
CREATE TABLE t (pk1 varchar, pk2 varchar, c1 varchar, c2 varchar, PRIMARY KEY(pk1, pk2));
I do multiple actions on it in parallel using the full PK:
INSERT INTO t (pk1, pk2, c1, c2) values (?, ?, ?, ?) IF NOT EXISTS;
DELETE FROM t where pk1 = ? AND pk2 = ?;
UPDATE t set c1 = ? where pk1 = ? AND pk2 = ? IF EXISTS;
Note:
in the INSERT command c2 is never null
in the UPDATE command c2 is not populated
Using these commands I should never have rows with c2 = null. The problem is that every now and then I do see such rows. I can't easily reproduce it but it always happens when I stress the system (multiple parallel clients running: insert, update, delete with the same PK).
Edit: my cluster size is 4 with RF=2 (NetworkTopologyStrategy with 1 DC) and I use CL=QUORUM for all queries.
Am I missing something or is there a bug in LWT?

https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlLtwtTransactions.html
If lightweight transactions are used to write to a row within a partition,only lightweight transactions for both read and write operations should be used. This caution applies to all operations, whether individual or batched. For example, the following series of operations can fail:
DELETE ...
INSERT .... IF NOT EXISTS
SELECT ....
The following series of operations will work:
DELETE ... IF EXISTS
INSERT .... IF NOT EXISTS
SELECT .....
Note - The same is true for INSERT and UPDATE combination as well from a bug we recently encountered. If you use Transactions then use it for the related statements. The recent could be related to the slightly different timestamps, explained here better
https://jira.apache.org/jira/browse/CASSANDRA-14304
doanduyhai DOAN DuyHai added a comment - 24/Mar/18 15:12
Hints:
1) LWT operations are using the ballot based on an agreement of timestamp value between QUORUM of replicas. It can happens that the timestamp is slightly incremented (some microsecs) in case of conflict/contention on the cluster. The consequence is that the timestamp used for LWT can be slightly (again in microsecs) in the future. Read the source code to check the part of the code responsible for ballot agreement with Paxos
2) For the DELETE request:
a) it can use the <current> timestamp, which can belong to the "past" with respect to the one used by LWT, thus SELECT does return a value

In general one option is that the Delete is executed in parallel with the Update yet does not provide any guarantee of being applied.
Example (this is for simplicity - it may be possible in other options as well).
Assume a cluster of 3 nodes with RF=3 (that has a temporary connection issue with node1 and the other node2,node3)
The Delete with CL=ONE is executed toward node1 with timestamp T1 (and not applied on node2, node3).
The Updates is executed toward node2,node3 with timestamp T2 (T2 > T1).
Connection between node1,node2,node3 is restored and now the tombstone that DELETE introduced would remove all the data (including c2) and the UPDATE would only have set pk1,pk2,c1 - leaving c2 as null.
If you would apply the DELETE using LWT - this should not happen as long as TTL is not used.
TTL can be set either directly in the insert statements or by default via a table property, to check this you can execute
Describe table will return the default_time_to_live that is set for this table.
A select ttl(c2) ... will return a value if ttl was set.

Partition DELETE/INSERT concurrency issue in Cassandra

I have a table in Cassandra which stores versions of csv-files. It uses a primary key with a unique id for the version (the partition key) and a row number (the clustering key). When I insert a new version I first execute a delete statement on the partition key I am about to insert, to clean up any incomplete data. Then the data is inserted.
Now here is the issue. Even though the delete and subsequent insert are executed synchronously after one another in the application it seems that some level of concurrency still exist in Cassandra, because when I read afterwards, rows from my insert will be missing occasionally - something like 1 in 3 times. Here are some facts:
Cassandra 3.0
Consistency ALL (R+W)
Delete using the Java Driver
Insert using the Spark-Cassandra connector
Number of nodes: 2
Replication factor: 2
The delete statement I execute looks like this:
"DELETE FROM myTable WHERE version = 'id'"
If I omit it, the problem goes away. If I insert a delay between the delete and the insert the problem is reduced (less rows missing). Initially I used a less restrictive consistency level, and I was sure this was the issue, but it didn't affect the problem. My hypothesis is that for some reason the delete statement is being sent to the replica asynchronously despite the consistency level of ALL, but I can't see why this would be the case or how to avoid it.

All mutations are going to by default get a write time of the coordinator for that write. From the docs
TIMESTAMP: sets the timestamp for the operation. If not specified,
the coordinator will use the current time (in microseconds) at the
start of statement execution as the timestamp. This is usually a
suitable default.
http://cassandra.apache.org/doc/cql3/CQL.html
Since the coordinator for different mutations can be different, a clock skew between coordinators can end up with a mutations to one machine to be skewed relative to another.
Since write time controls C* history this means you can have a driver which synchronously inserts and deletes but depending on the coordinator the delete can happen "before" the insert.
Example
Imagine two nodes A and B, B is operating with a 5 second clock skew behind A.
At time 0: You insert data to the cluster and A is chosen as the coordinator. The mutation arrives at A and A assigns a timestamp (0)
There is now a record in the cluster
INSERT VALUE AT TIME 0
Both nodes contain this message and the request returns confirming the write was successful.
At time 2: You issue a delete for the data previously inserted and B is chosen as the coordinator. B assigns a timestamp of (-3) because it is clock skewed 5 seconds behind the time in A. This means that we end up with a statement like
DELETE VALUE AT TIME -3
We acknowledge that all nodes have received this record.
Now the global consistent timeline is
DELETE VALUE AT TIME -3
INSERT VALUE AT TIME 0
Since the insertion occurs after the delete the value still exists.

I have got similar problem, and I have fixed it by enabling Light-Weight-Transaction for both INSERT and DELETE requests (for all queries actually, including UPDATE). It will make sure all queries to this partition are serialized through one "thread", so DELETE wan't overwrite INSERT. For example (assuming instance_id is a primary key):
INSERT INTO myTable (instance_id, instance_version, data) VALUES ('myinstance', 0, 'some-data') IF NOT EXISTS;
UPDATE myTable SET instance_version=1, data='some-updated-data' WHERE instance_id='myinstance' IF instance_version=0;
UPDATE myTable SET instance_version=2, data='again-some-updated-data' WHERE instance_id='myinstance' IF instance_version=1;
DELETE FROM myTable WHERE instance_id='myinstance' IF instance_version=2
//or:
DELETE FROM myTable WHERE instance_id='myinstance' IF EXISTS
IF clauses enable light-wight-transactions for each row, so all of them are serialized. Warning: LWT is more expensive than normal calls, but sometimes they are needed, like in the case of this concurrency problem.

How cassandra reads counter columns from sstables?

I am trying to convert sstables to json using sstable2json utility. It works fine but for counter columns it gives a very long string value.
My create table statement :
CREATE TABLE counters1
(value counter,
name varchar,
surname varchar,
PRIMARY KEY (name, surname)
);
Sample data :
Now after converting to json what I get is :
[ {"key": "hari",
"cells": [["ram:value","0001800086d46a8fd6cb484e9257a02ddd14fe0600000000000000010000000000000001",1452867057744000,"c",-9223372036854775808]]} ]
Q1) Is there a way to get meaningful value from this? (0001800086d46a8fd6cb484e9257a02ddd14fe0600000000000000010000000000000001)
Q2) How does cassandra reads from the same sstable and displays "1"
Thanks

Counters changed a lot in 2.1, see http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters. Which also has a great explaination of counters in pre-2.0 (what you are looking at). The context in the sstable mostly is made up of a tuple of counter id (timeuuid), shard logical clock, and shard value. (16 byte id, and two longs). This is whats being displayed in the sstable2json. Theres a little more in the header which describes some local/global element index. Check out https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/context/CounterContext.java#L675 for more details.
But I would recommend using 2.1 counters to avoid some issues and have a little more simplicity. Its going to be pretty non-trivial to build your counter value from the sstables manually though.

Why am I reading many tombstones in Cassandra table although my access pattern should avoid them

I know this is not the best way to use Cassandra, but the type of my data requires reading all data from the last week. However when using Collection-types in CQL3, I ran into certain limitations which prevent me from doing normal date-range queries.
So I have set up Cassandra (currently single node, probably more in the future) with the following table
CREATE TABLE cache (tag text, id int, tags map<text,text>,
PRIMARY KEY (tag, id) );
ALTER TABLE cache WITH GC_GRACE_SECONDS = 0;
I am inserting with a TTL of one week to automatically remove the items from the Cache.
I tried to follow the suggestions mentioned in this article to avoid reading many tombstones by selecting by "minimum id", which I persist elsewhere to avoid reading old data:
SELECT * FROM cache WHERE tag = ? AND id >= ?
The id is basically some sort of timestamp which is constantly increasing, i.e. I only insert higher values over time and constantly remove older ids from the table.
But I still get warnings about thresholds being reached
WARN 08:59:06,286 Read 5001 live and 5702 tombstoned cells in cache (see tombstone_warn_threshold)
And if I do not run manual compaction/scrubbing regularly I get exceptions and queries fail.
However based on my understanding from the articles and documentation, I should be avoiding most if not all tombstones here as I query on equality for the tag, which allows Cassandra to only look for those areas and I use a minimum id which allows Cassandra to start reading only after most of the tombstones, so why are there still tombstone warnings/exceptions reported?

Map k/v pair is actually a column (name, value and timestamp): so, if you are issuing a lot of deletions of map elements (expiring by TTL is also the case) -- this is the source of this warning. Because you are still reading full maps (with lots of tombstones in them). Also, TTL setting on map is applied on per-element basis.
Second, this is multiplied by >= predicate in your select query.
If this is the case, you should remodel your data access pattern to use only EQ relations in SELECT query and bump id more often. Also, this access pattern will allow you to get rid of clustering part of your PRIMARY KEY.
So, if you do not issue lots of deletions on that map, you can try to use tag text, time timeuuid, name text, data text model and slice it precisely by time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string