ON CONFLICT operator in Cassandra - cassandra

I have a table in Cassandra with 2 columns: id and date_proc and plan to insert a lot of inserts. Is it possible to use something like ON CONFLICT in Postgres to get previous value on inserting?
Could you tell me another way to avoid 2 requests to Cassandra (select and insert)? Maybe some solution in DataStax?
ddl:
create table test.date_dict (
id text,
date_proc text,
PRIMARY KEY (id));
example of inserting:
INSERT INTO test.date_dict (id, date_proc) VALUES ('1', '2020-01-01'); // return '2020-01-01'
INSERT INTO test.date_dict (id, date_proc) VALUES ('1', '2020-01-05'); // return '2020-01-01'

"Normal" inserts and updates in Cassandra are just appends into the memtable (and then flushed into SSTables) - no read happens during these operations. And it will just overwrite previous data if it has lower timestamp.
Potentially you can use lightweight transactions (LWTs) to achieve what you need - they return previous value if there is a conflict (row exists already when you use IF NOT EXISTS, or value is different than you specify in the IF condition). But LWTs are very bad for performance, so they should be used carefully.
I would try to reformulate your task such way so it will fit into "normal" inserts/updates behavior.

Related

Cassandra - Shall I have to do so many writes?

I have 5 Tables:
users_by_id
users_by_username
users_by_email
users_by_likes
users_by_followers
I have to write 5 Statements every time if a user registered. Is that not expensive or bad ?
INSERT INTO users_by_id (...) values (..)
INSERT INTO users_by_email (...) values (..)
INSERT INTO users_by_username (...) values (..)
INSERT INTO users_by_likes (...) values (..)
INSERT INTO users_by_followers (...) values (..)
The second question: Maybe I update users_by_id I have to write 5 Update statments. Is there another solution? Or is that not this bad ?
Cassandra advocates denormalization of your data and creating data model according to your queries. You will have to write your data model such that it satisfies all the queries with good performance. For performance (due to its architecture and design) Cassandra asks for writing and reading using partition key.
It is not expensive to write 5 insertions for same set of data in 5 different tables. Your reads will perform better and as data size increases to web scale, you will thank your decision of creating 5 tables and writing to them.
You can explore materialized views (Materialized View and Datastax Link for Materialized View but remember it is an experimental feature. So you have to understand it properly and also identify open issues with materialized views.
I would recommend you study Cassandra data model that will make things easier to grasp.
Cassandra is designed to be write intensive database so do not hesitate to duplicate your data. One should always design tables for the read queries. If one table satisfies one query, it is a fine design.
Answer to your second question, you should design your tables such a way that you do not have to update table. Always think about inserting new values.
For example, below table design
CREATE TABLE user_by_email (
email text,
timestamp timestamp,
name text,
fullname text,
userId text,
PRIMARY KEY (email,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
INSERT INTO user_by_email (email, DateTime.Now ........)
In this design, you should get the latest inserted value. Additionally , this design keeps change history for that key.
Think about, how many times we have to update values like user id, email, username? rarely.

Cassandra delete/update a row and get its previous value

How can I delete a row from Cassandra and get the value it had just before the deletion?
I could execute a SELECT and DELETE query in series, but how can I be sure that the data was not altered concurrently between the execution of those two queries?
I've tried to execute the SELECT and DELETE queries in a batch but that seems to be not allowed.
cqlsh:foo> BEGIN BATCH
... SELECT * FROM data_by_user WHERE user = 'foo';
... DELETE FROM data_by_user WHERE user = 'foo';
... APPLY BATCH;
SyntaxException: line 2:4 mismatched input 'SELECT' expecting K_APPLY (BEGIN BATCH [SELECT]...)
In my use case I have one main table that stores data for items. And I've build several tables that allow to lookup items based on those informations.
If I delete an item from the main table, I must also remove it from the other tables.
CREATE TABLE items (id text PRIMARY KEY, owner text, liking_users set<text>, ...);
CREATE TABLE owned_items_by_user (user text, item_id text, PRIMARY KEY ((user), item_id));
CREATE TABLE liked_items_by_user (user text, item_id tect, PRIMARY KEY ((user), item_id));
...
I'm afraid the tables might contain wrong data if I delete an item and at the same time someone e.g. hits the like button of that same item.
The deleteItem method execute a SELECT query to fetch the current row of the item from the main table
The likeItem method that gets executed at the same times runs an UPDATE query and inserts the item into the owned_items_by_user, liked_items_by_user, ... tables. This happens after the SELECT statement was executed and the UPDATE query is executed before the DELETE query.
The deleteItem method deletes the items from the owned_items_by_user, liked_items_by_user, ... tables based on the data just retrieved via the SELECT statement. This data does not yet contain the just added like. The item is therefore deleted, but the just added like remains in the liked_items_by_user table.
You can do a select beforehand, then do a lightweight transaction on the delete to ensure that the data still looks exactly like it did when you selected. If it does, you know the latest state before you deleted. If it does not, keep retrying the whole procedure until it sticks.
Unfortunately you cannot do a SELECT query inside a batch statement. If you read the docs here, only insert, update, and delete statements can be used.
What you're looking for is atomicity on the execution, but batch statements are not going to be the way forward. If the data has been altered, your worst case situation is zombies, or data that could reappear.
Cassandra uses a grade period mechanism to deal with this, you can find the details here. If for whatever reason, this is critical to your business logic, the "best" thing you can do in this situation is to increase the consistency level, or restructure the read pattern at application level to not rely on perfect atomicity, whichever the right trade off is for you. So either you give up some of the performance, or tune down the requirement.
In practice, QUORUM should be more than enough to satisfy most situations most of the time. Alternatively, you can do an ALL, and you pay the performance penalty, but that means all replicas for the given foo partition key will have to acknowledge the write both in the commitlog and the memtable. Note, this still means a flush from the commitlog will need to happen before the delete is complete, but you can tune the consistency to the level you require.
You don't have atomicity in the SQL sense, but depending on throughput it's unlikely that you will need it(touch wood).
TLDR:
USE CONSISTENCY ALL;
DELETE FROM data_by_user WHERE user = 'foo';
That should do the trick. The error you're seeing now is basically the ANTLR3 Grammar parser for CQL 3, which is not designed to accept to SELECT queries inside batches simply because they are not supported, you can see that here.

Howto avoid cassandra tombstones when inserting NULL values

My problem is that cassandra creates tombstones when inserting NULL values.
From what I understand, cassandra doesn't support NULLs and when NULL is inserted it just deletes the respective column. On one hand this is very space effective, however on the other hand it creates tombstones which degrades read performance.
This goes agains NoSql phillosophy because cassandra is saving space but degrading read performance. In NoSql world the space is cheap, however performance matters. I beleive this is the phillosophy behind saving tables in denormalized form.
I would like cassandra to use the same technique for inserting NULL as for any other value - use timestamping and during compaction preserve the latest entry - even if the entry is NULL (or we can call it "unset").
Is there any tweak in cassandra config or any approach how I would be able to achieve upserts with nulls without having tombstones ?
I came across this issue however it only allows to ignore NULL values
My use case:
I have stream of events, every event identified by causeID. I'm receiving many events with same causeId and I want to store only the latest event for the same causeID (using upsert). The properties of the event may change from NULL to specific value, but also from specific value to NULL. Unfortunatelly the later case generates tombstones and degrades read performance.
Update
It seems there is no way how I could avoid tombstones. Could you advice me on techniques how to minimize them (set gc_grace_seconds to very low value). What are the risks, what to do when a node goes down for a longer period than gc_grace_seconds ?
You can't insert NULL into Cassandra - it has special meaning there, and lead to creation of tombstones that you observe. If you want to treat NULL as special value, why not to solve this problem on application side - when you get null status, just insert any special value that couldn't be used in your table, and when you read data back, check for that special value and output null to requester...
When we want to just insert or update rows using null for values that are not specified, and even though our intention is to leave the value empty, Cassandra represents it as a tombstone causing unnecessary overhead which degrades performance.
To avoid such tombstones for save operations, cassandra has the concept of unset for a parameter value.
So you can do the following to unset a field value while saving to avoid tombstone overhead for example related to different cases:
1). If you are using express-cassandra then :
const user = new models.instance.User({
user_id: 1235,
user_name: models.datatypes.unset // this will not create tombstone when we want empty user_name or null
});
user.save(function(err){
// user_name value is not set and does not create any unnecessary tombstone overhead
});
2). If you are writing cassandra raw query then for empty or null field when you know say colC will be null, then don't use it in your query.
insert into my_table(id,colA,colB) values(idVal,valA,valB) // Avoid colC
3). If you are using Node.Js Driver, you can even pass undefined on insert or update which will avoid tombstone overhead. For example
const query = 'INSERT INTO my_table (id, colC) VALUES (?, ?)';
client.execute(query, [ id, undefined ]);
4). If you are using c# driver then
// Prepare once in your application lifetime
var ps = session.Prepare("INSERT INTO my_table (id, colC) VALUES (?, ?)");
// Bind the unset value in a prepared statement
session.Execute(ps.Bind(id, Unset.Value));
For more detail on express-cassandra read the sub topic Null and unset values of
https://express-cassandra.readthedocs.io/en/latest/datatypes/#cassandra-to-javascript-datatypes
For more detail on Node.js driver unset feature refer datastax https://docs.datastax.com/en/developer/nodejs-driver/4.6/features/datatypes/nulls/
For more detail on Csharp driver unset feature refer datastax https://docs.datastax.com/en/developer/csharp-driver/3.16/features/datatypes/nulls-unset/
NOTE: I tested this on Node.js cassandra 4.0 But unset feature is introduced after cassandra 2.2
Hope this will help you or somebody else.
Thanks!
You cannot avoid tombstones if you particularly mention NULL in your INSERT. C* does not do a lookup before insert or writing a data which makes the writes very faster. For this purpose, C* just inserts a tombstone to avoid that value later (taking the latest update comparing the timestamp). If you want to avoid tombstone (which is recommended), you've to prepare different combinations of queries to check each one for NULL before adding it to the INSERT. If you have very few fields to check then it'll be easy to just add some IF-ELSE statements. But if there are lots of them, the code will be bigger and less readable. Shortly, you cannot insert NULL which will impact read performance later.
Inserting null values into cassandra
I don't think the other answers address the original question, which is how to overwrite a non-null value in Cassandra with null without creating a tombstone. The nearest is Alex Ott's suggestion to use some special value other than null.
However, with a little bit of trickery you can insert an explicit null into Cassandra by exploiting a FROZEN tuple or user-defined type. The FROZEN keyword effectively serialises the user defined type and stores the serialised representation in the column. Crucially, the serialised representation of a UDT containing null values is not itself null.
> CREATE TYPE test_type(value INT);
> CREATE TABLE test(pk INT, cl INT, data FROZEN<test_type>, PRIMARY KEY (pk, cl));
> INSERT INTO test (pk, cl, data) VALUES (0, 0, {value: 15});
> INSERT INTO test (pk, cl, data) VALUES (0, 0, {value: null});
> INSERT INTO test (pk, cl) VALUES (0, 1);
> SELECT * FROM test;
pk | cl | data
----+----+---------------
0 | 0 | {value: null}
0 | 1 | null
(2 rows)
Here we wrote 15, then overwrote it with null, and finally added a second row to demonstrate that there is a difference between an unset cell and a cell containing a frozen UDT that itself contains null.
Of course the downside of this approach is that in your application you have to delve into the UDT for the actual value.
On the other hand, if you combine several columns into the UDT you do save a little overhead in Cassandra. (But you can't then read or write them individually. You also can't remove fields, though you can add new ones.)

Cassandra get latest entry for each element contained within IN clause

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);
IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost
Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

Resources