Cassandra - how to update a record with a compound key - cassandra

In the process of learning Cassandra and using it on a small pilot project at work. I've got one table that is filtered by 3 fields:
CREATE TABLE webhook (
event_id text,
entity_type text,
entity_operation text,
callback_url text,
create_timestamp timestamp,
webhook_id text,
last_mod_timestamp timestamp,
app_key text,
status_flag int,
PRIMARY KEY ((event_id, entity_type, entity_operation))
);
Then I can pull records like so, which is exactly the query I need for this:
select * from webhook
where event_id = '11E7DEB1B162E780AD3894B2C0AB197A'
and entity_type = 'user'
and entity_operation = 'insert';
However, I have an update query to set the record inactive (soft delete), which would be most convenient by partition key in the same table. Of course, this isn't possible:
update webhook
set status_flag = 0
where webhook_id = '11e8765068f50730ac964b31be21d64e'
An example of why I'd want to do this, is a simple DELETE from an API endpoint:
http://myapi.com/webhooks/11e8765068f50730ac964b31be21d64e
Naturally, if I update based on the composite key, I'd potentially inactivate more records than I intend to.
Seems like my only choice, doing it the "Cassandra Way", is to use two tables; the one I already have and one to track status_flag by webhook_id, so I can update based on that id. I'd then have to select by webhook_id in the first table and disable it there as well? Otherwise, I'd have to force users to pass all the compound key values in the URL of the API's DELETE request.
Simple things you take for granted in relational data, seem to get complex very quickly in Cassandraland. Is this the case or am I making it more complicated than it really is?

You can add webhook to your primary key.
So your table defination becomes somethign like this.
CREATE TABLE webhook (
event_id text,
entity_type text,
entity_operation text,
callback_url text,
create_timestamp timestamp,
webhook_id text,
last_mod_timestamp timestamp,
app_key text,
status_flag int,
PRIMARY KEY ((event_id, entity_type, entity_operation),webhook_id)
Now lets say you insert 2 records.
INSERT INTO dev_cybs_rtd_search.webhook(event_id,entity_type,entity_operation,status_flag,webhook_id) VALUES('11E7DEB1B162E780AD3894B2C0AB197A','user','insert',1,'web_id');
INSERT INTO dev_cybs_rtd_search.webhook(event_id,entity_type,entity_operation,status_flag,webhook_id) VALUES('12313131312313','user','insert',1,'web_id_1');
And you can update like following
update webhook
set status_flag = 0
where webhook_id = 'web_id' AND event_id = '11E7DEB1B162E780AD3894B2C0AB197A' AND entity_type = 'user'
AND entity_operation = 'insert';
It will only update 1 record.
However you have to send all the things defined in your primary key.

Related

SyntaxException: line 2:10 no viable alternative at input 'UNIQUE' > (...NOT EXISTS books ( id [UUID] UNIQUE...)

I am trying the following codes to create a keyspace and a table inside of it:
CREATE KEYSPACE IF NOT EXISTS books WITH REPLICATION = { 'class': 'SimpleStrategy',
'replication_factor': 3 };
CREATE TABLE IF NOT EXISTS books (
id UUID PRIMARY KEY,
user_id TEXT UNIQUE NOT NULL,
scale TEXT NOT NULL,
title TEXT NOT NULL,
description TEXT NOT NULL,
reward map<INT,TEXT> NOT NULL,
image_url TEXT NOT NULL,
video_url TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
But I do get:
SyntaxException: line 2:10 no viable alternative at input 'UNIQUE'
(...NOT EXISTS books ( id [UUID] UNIQUE...)
What is the problem and how can I fix it?
I see three syntax issues. They are mainly related to CQL != SQL.
The first, is that NOT NULL is not valid at column definition time. Cassandra doesn't enforce constraints like that at all, so for this case, just get rid of all of them.
Next, Cassandra CQL does not allow default values, so this won't work:
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
Providing the current timestamp for created_at is something that will need to be done at write-time. Fortunately, CQL has a few of built-in functions to make this easier:
INSERT INTO books (id, user_id, created_at)
VALUES (uuid(), 'userOne', toTimestamp(now()));
In this case, I've invoked the uuid() function to generate a Type-4 UUID. I've also invoked now() for the current time. However now() returns a TimeUUID (Type-1 UUID) so I've nested it inside of the toTimestamp function to convert it to a TIMESTAMP.
Finally, UNIQUE is not valid.
user_id TEXT UNIQUE NOT NULL,
It looks like you're trying to make sure that duplicate user_ids are not stored with each id. You can help to ensure uniqueness of the data in each partition by adding user_id to the end of the primary key definition as a clustering key:
CREATE TABLE IF NOT EXISTS books (
id UUID,
user_id TEXT,
...
PRIMARY KEY (id, user_id));
This PK definition will ensure that data for books will be partitioned by id, containing multiple user_id rows.
Not sure what the relationship is between books and users is, though. If one book can have many users, then this will work. If one user can have many books, then you'll want to switch the order of the keys to this:
PRIMARY KEY (user_id, id));
In summary, a working table definition for this problem looks like this:
CREATE TABLE IF NOT EXISTS books (
id UUID,
user_id TEXT,
scale TEXT,
title TEXT,
description TEXT,
reward map<INT,TEXT>,
image_url TEXT,
video_url TEXT,
created_at TIMESTAMP,
PRIMARY KEY (id, user_id));

Internal network application data model with Cassandra

I'm working on designing an application which will enable users to send requests to connect with each other, see their sent or received requests, make notes during their interactions for later reference if connected, and remove users from their contact lists.
In a RDBMS, the schema would be:
table User with column
uid (a unique string for each user)
table Request with columns:
from - user id
to - user id Primary Key (from, to)
created - timestamp
message - string
expiry - timestamp
table Connection with columns:
from - user id
to - user id
Primary Key (from, to)
notes - String
created - timestamp
modified - timestamp
isFavourite - to is a favourite of from user, value 0 or 1
isActive - soft delete, value 0 or 1
pairedConnection - shows whether the connection between to and from was deactivated (the to user removed the from user from its contact list), value 0 or 1
The queries I anticipate to be needed are:
find the sent requests for a user
find the received requests for a user
find all the active contacts of a given user
find all the favourites of a user
find all the users who deleted the given from user from their lists
update the notes taken by a user when meeting another user he is connected with
update user as favourite
mark connection for soft deletion
I'm trying to model this in Cassandra, but feel confused about the keys to choose for max efficiency.
So far, I have the following ideas, and would welcome feedback from more experienced Cassandra users:
create table users(
uid text PRIMARY KEY
);
create table requestsByFrom(
from text,
to text,
message text,
created timestamp,
expiry timestamp,
PRIMARY KEY (from,to)
create table requestsByTo(
from text,
to text,
message text,
created timestamp,
expiry timestamp,
PRIMARY KEY (to,from)
);
create table connections(
from text,
to text,
notes text,
created timestamp,
modified timestamp,
isFavourite boolean,
isActive boolean,
pairedConnection boolean,
PRIMARY KEY (from,to)
);
create table activeConnections(
from text,
to text,
isActive boolean,
PRIMARY KEY (from,isActive)
);
create table favouriteConnections(
from text,
to text,
isFavourite boolean,
PRIMARY KEY (from, isFavourite)
);
create table pairedConnection(
from text,
to text,
pairedConnection boolean,
PRIMARY KEY ((from,to), pairedConnection)
);
Cassandra has a different paradigm to RDBMS, and this is more evident with the way that the data modeling has to be done. You need to keep in mind that denormalization is preferred, and that you'll have repeated data.
The tables definition should be based on the queries to retrieve the data, this is partially stated in the definition of the problem, for instance:
find the sent requests for a user
Taking the initial design of the table requestsByFrom, an alternative will be
CREATE TABLE IF NOT EXISTS requests_sent_by_user(
requester_email TEXT,
recipient_email TEXT,
recipient_name TEXT,
message TEXT,
created TIMESTAMP
PRIMARY KEY (requester_email, recipient_email)
) WITH default_time_to_live = 864000;
Note that from is a restricted keyword, the expiry information can be set with the definition of the default_time_to_live clause (TTL) which will remove the record after the time defined; this value is the amount of seconds after the record is inserted, and the example is 10 days (864,000 seconds).
The primary key is suggested to be the email address, but it can also be an UUID, name is not recommended as there can be multiple persons sharing the same name (like James Smith) or the same person can have multiple ways to write the name (following the example Jim Smith, J. Smith and j smith may refer to the same person).
The name recipient_name is also added as it is most likely that you'll want to display it; any other information that will be displayed/used with the query should be added.
find the received requests for a user
CREATE TABLE IF NOT EXISTS requests_received_by_user(
recipient_email TEXT,
requester_email TEXT,
requester_name TEXT,
message TEXT,
created TIMESTAMP
PRIMARY KEY (recipient_email, requester_email)
) WITH default_time_to_live = 864000;
It will be preferred to add records to requests_sent_by_user and requests_received_by_user at the same time using a batch, which will ensure consistency in the information between both tables, also the TTL (expiration of the data) will be the same.
storing contacts
In the question there are 4 tables of connections: connections, active_connections, favourite_connections, paired_connections, what will be the difference between them? are they going to have different rules/use cases? if that is the case, it makes sense to have them as different tables:
CREATE TABLE IF NOT EXISTS connections(
requester_email TEXT,
recipient_email TEXT,
recipient_name TEXT,
notes TEXT,
created TIMESTAMP,
last_update TIMESTAMP,
is_favourite BOOLEAN,
is_active BOOLEAN,
is_paired BOOLEAN,
PRIMARY KEY (requester_email, recipient_email)
);
CREATE TABLE IF NOT EXISTS active_connections(
requester_email TEXT,
recipient_email TEXT,
recipient_name TEXT,
last_update TIMESTAMP,
PRIMARY KEY (requester_email, recipient_email)
);
CREATE TABLE IF NOT EXISTS favourite_connections(
requester_email TEXT,
recipient_email TEXT,
recipient_name TEXT,
last_update TIMESTAMP,
PRIMARY KEY (requester_email, recipient_email)
);
CREATE TABLE IF NOT EXISTS paired_connections(
requester_email TEXT,
recipient_email TEXT,
recipient_name TEXT,
last_update TIMESTAMP,
PRIMARY KEY (requester_email, recipient_email)
);
Note that the boolean flag is removed, the logic is that if the record exists in active_connections, it will be assumed that it is an active connection.
When a new connection is created, it may have several records in different tables; to bundle all those inserts or updates, it is preferred to use batch
find all the active contacts of a given user
Based on the proposed tables, if the requester's email is test#email.com:
SELECT * FROM active_connections WHERE requester_email = 'test#email.com'
update user as favourite
It will be a batch updating the record in connections and adding the new record to favourite_connections:
BEGIN BATCH
UPDATE connections
SET is_favourite = true, last_update = dateof(now())
WHERE requester_email ='test#email.com'
AND recipient_email = 'john.smith#test.com';
INSERT INTO favourite_connections (
requester_email, recipient_email, recipient_name, last_update
) VALUES (
'test#email.com', 'john.smith#test.com', 'John Smith', dateof(now())
);
APPLY BATCH;
mark connection for soft deletion
The information of the connection can be kept in connections with all the flags disabled, as well as the records removed from active_connections, favourite_connections and paired_connections
BEGIN BATCH
UPDATE connections
SET is_active = false, is_favourite = false,
is_paired = false, last_update = dateof(now())
WHERE requester_email ='test#email.com'
AND recipient_email = 'john.smith#test.com';
DELETE FROM active_connections
WHERE requester_email = 'test#email.com'
AND recipient_email = 'john.smith#test.com';
DELETE FROM favourite_connections
WHERE requester_email = 'test#email.com'
AND recipient_email = 'john.smith#test.com';
DELETE FROM paired_connections
WHERE requester_email = 'test#email.com'
AND recipient_email = 'john.smith#test.com';
APPLY BATCH;

How to avoid Cassandra ALLOW FILTERING?

I have Following Data Model :-
campaigns {
id int PRIMARY KEY,
scheduletime text,
SchduleStartdate text,
SchduleEndDate text,
enable boolean,
actionFlag boolean,
.... etc
}
Here i need to fetch the data basing on start date and end data with out ALLOW FILTERING .
I got more suggestions to re-design schema to full fill the requirement But i cannot filter the data basing on id since i need the data in b/w the dates .
Some one give me a good suggestion to full fill this scenario to execute Following Query :-
select * from campaings WHERE startdate='XXX' AND endDate='XXX' ; // With out Allow Filtering thing
CREATE TABLE campaigns (
SchduleStartdate text,
SchduleEndDate text,
id int,
scheduletime text,
enable boolean,
PRIMARY KEY ((SchduleStartdate, SchduleEndDate),id));
You can make the below queries to the table,
slect * from campaigns where SchduleStartdate = 'xxx' and SchduleEndDate = 'xx'; -- to get the answer to above question.
slect * from campaigns where SchduleStartdate = 'xxx' and SchduleEndDate = 'xx' and id = 1; -- if you want to filter the data again for specific ids
Here the SchduleStartdate and SchduleEndDate is used as the Partition Key and the ID is used as the Clustering key to make sure the entries are unique.
By this way, you can filter based on start, end and then id if needed.
One downside with this will be if you only need to filter by id that wont be possible as you need to first restrict the partition keys.

Updating denormalized data in Cassandra

I'm trying to build a news feed system using Cassandra, I was thinking of using a fan out approach wherein if a user posts a new post, I'll write a new record in all of his friends' feed table. The table structure looks like:
CREATE TABLE users (
user_name TEXT,
first_name TEXT,
last_name TEXT,
profile_pic TEXT,
PRIMARY KEY (user_name)
);
CREATE TABLE user_feed (
user_name TEXT,
posted_time TIMESTAMP,
post_id UUID,
posted_by TEXT, //posted by username
posted_by_profile_pic TEXT,
post_content TEXT,
PRIMARY KEY ((user_name), posted_time)
) WITH CLUSTERING ORDER BY(posted_time desc);
Now, I can get a feed for a particular user in a single query all fine. What if the user who has posted a feed updates his profile pic. How do I go about updating the data in user_feed table?
You can use batch statements to achieve atomicity at your updates. So in this case you can create a batch with the update on tables users and user_feed using the same user_name partition key:
BEGIN BATCH
UPDATE users SET profile_pic = ? WHERE user_name = ?;
UPDATE user_feed SET posted_by_profile_pic = ? WHERE user_name = ?;
APPLY BATCH;
Take a look at CQL Batch documentation

Way to handle autoincrement ID with counter on Cassandra?

This is not a question about using an autincrement integer for primary key instead of UUIDs on Cassandra, in this case I want to generate an autoincrement effect like PostgreSQL on Cassandra that doesn't need to be necessarily scalable. I'm using UUID as primary key for entries in a table, but I need to generate a shortid like bitly for those entries. So I came up trying to make an application that grabs an index for a specific entry and generates a shortid based on that index and then set the shortid to the entry.
So I'm trying to do something like this on Cassandra:
CREATE TABLE photo (
id uuid,
shortid text,
title text,
PRIMARY KEY (id)
);
CREATE TABLE shortid (
shortid text,
family text,
longid uuid,
index bigint,
created_at timestamp,
PRIMARY KEY ((shortid, family))
) WITH COMPACT STORAGE;
CREATE TABLE shortid_reverse (
longid uuid,
family text,
shortid text
PRIMARY KEY ((longid, family))
) WITH COMPACT STORAGE;
CREATE TABLE shortid_last_index (
family text,
last_index counter,
last_long_id uuid,
PRIMARY KEY (family)
);
So in this application that will handle the shortid, when the application initiates It'll get the last index for that family, and then it'll increase the value on the application itself, as this application will run on Nodejs and Nodejs can scale that.
Application.js
var index = lastIndexFromCassandra++ //5
, hashids = new Hashids("this is my salt")
, shortid = hashids.encrypt(index); //dDae3KDDj4Q
After the application increase the index and generate the shortid, It'll persist on Cassandra:
UPDATE shortid_last_index SET last_index = last_index+1, last_long_id = fabac1f0-7f88-11e3-baa7-0800200c9a66 WHERE family = 'photo';
INSERT INTO shortid (shortid, family, longid, index, created_at) VALUES ('dDae3KDDj4Q', 'photo', fabac1f0-7f88-11e3-baa7-0800200c9a66, 5, NOW());
INSERT INTO shortid_reverse (longid, family, shortid) VALUES (fabac1f0-7f88-11e3-baa7-0800200c9a66, 'photo', 'dDae3KDDj4Q');
UPDATE photo SET shortid = 'dDae3KDDj4Q' WHERE id = fabac1f0-7f88-11e3-baa7-0800200c9a66;
So, it really there isn't a better way to do this in Cassandra without creating an application that will just do that? Couldn't I just do something like PostgreSQL on Cassandra:
UPDATE shortid_last_index SET last_index = last_index+1, last_long_id = ? WHERE family = 'photo' RETURNING last_index;
In comparison, if the statement above worked it would probably lock the row, but increasing and grabbing the index in the application itself and then safely increase the counter in Cassandra wouldn't lock the row too? How scale would be the application?
If you need short incremental id generation please take a look at Snowflake or one of the other countless clones/inspirations.
What you are attempting to do is a bad idea on multiple counts.

Resources