Cassandra Schema for retrieving date-ordered records - cassandra

Folks,
I would like to solve the following with one table in Cassandra. Said service tracks when users open an asset. On subsequent events to the same asset, we simply over-write the accessDate.
example record:
{ userId: "string", assetId: "string", accessDate: unixTimestamp }
With this said, we need to fulfill the following access requirements (each requirement has its own bulletpoint for readability):
Be able to return all assets a user has opened, and at what time.
This is easy to achieve, table could look like:
CREATE TABLE user_assets_tracker (
userId uuid,
accessDate timestamp,
assetId uuid,
PRIMARY KEY (userid, accessDate, assetId)
);
This allows us to query for all assets, and when each was last accessed.
SELECT *
FROM user_assets_tracker
WHERE userId = 522b1fe2-2e36-4cef-a667-cd4237d08b89
ORDER BY accessDate DESC;
>
Dandy. Now the harder bits, which I am unsure about, was hoping you folks could chime in:
Show me all the assets user added in the past 30 days.
Naturally the LIMIT here is not what we need. Also, we may need to have 2 tables to achieve this.
SELECT *
FROM user_assets_tracker
WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89
ORDER BY accessDate DESC;
LIMIT 10; ?????
Show me the last accessed item for the user. I think this one is easier, the LIMIT 1 solves that.
This is probably straight forward, with this schema:
CREATE TABLE user_assets_tracker (
userId uuid,
accessDate timestamp,
assetId uuid,
PRIMARY KEY (userid, accessDate, assetId)
);
SELECT *
FROM user_assets_tracker
WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89
ORDER BY accessDate DESC;
LIMIT 1;
Retrieve the full record for a particular userId + assetId
Since accessDate comes before assetId in our schema, I am not sure how to do this as well. Another table?
Thanks!!
PS It seems that SASI Index could be the solution

Though you are always selecting assetid orderby accessDate desc.
Define your schema with order by accessDate desc
CREATE TABLE user_assets_tracker (
userid uuid,
accessdate timestamp,
assetid uuid,
PRIMARY KEY (userid, accessdate, assetid)
) WITH CLUSTERING ORDER BY (accessdate DESC, assetid ASC);
Now you don't need to specify order by accessDate desc every time. it will by default order your data by accessDate desc
Show me all the assets user added in the past 30 days.
First get timestamp of 30 day ago.
Let's current timestamp of 30 day ago is : 2017-02-05 12:00:00+0000
Now you can query :
SELECT * FROM user_assets_tracker WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89 AND accessdate >= '2017-02-05 12:00:00+0000'
Retrieve the full record for a particular userId + assetId
If you are using Cassandra 3.0 or above you can use Materialized Views
CREATE a Materialized View :
CREATE MATERIALIZED VIEW user_assets AS
SELECT *
FROM user_assets_tracker
WHERE userid IS NOT NULL AND assetid IS NOT NULL AND accessdate IS NOT NULL
PRIMARY KEY (userid, assetid, accessdate);
Now if you want to get all data with userid and assetid, here is the query
SELECT * FROM user_assets WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89 AND assetid = 1d45e6c2-02a1-11e7-aac5-b9ab92bee74c;
Here is another thing, if huge data is inserted into a single user, you should add time bucket with userid as partition key.For more check the answer https://stackoverflow.com/a/41857183/2320144

Related

SyntaxException: line 2:10 no viable alternative at input 'UNIQUE' > (...NOT EXISTS books ( id [UUID] UNIQUE...)

I am trying the following codes to create a keyspace and a table inside of it:
CREATE KEYSPACE IF NOT EXISTS books WITH REPLICATION = { 'class': 'SimpleStrategy',
'replication_factor': 3 };
CREATE TABLE IF NOT EXISTS books (
id UUID PRIMARY KEY,
user_id TEXT UNIQUE NOT NULL,
scale TEXT NOT NULL,
title TEXT NOT NULL,
description TEXT NOT NULL,
reward map<INT,TEXT> NOT NULL,
image_url TEXT NOT NULL,
video_url TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
But I do get:
SyntaxException: line 2:10 no viable alternative at input 'UNIQUE'
(...NOT EXISTS books ( id [UUID] UNIQUE...)
What is the problem and how can I fix it?
I see three syntax issues. They are mainly related to CQL != SQL.
The first, is that NOT NULL is not valid at column definition time. Cassandra doesn't enforce constraints like that at all, so for this case, just get rid of all of them.
Next, Cassandra CQL does not allow default values, so this won't work:
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
Providing the current timestamp for created_at is something that will need to be done at write-time. Fortunately, CQL has a few of built-in functions to make this easier:
INSERT INTO books (id, user_id, created_at)
VALUES (uuid(), 'userOne', toTimestamp(now()));
In this case, I've invoked the uuid() function to generate a Type-4 UUID. I've also invoked now() for the current time. However now() returns a TimeUUID (Type-1 UUID) so I've nested it inside of the toTimestamp function to convert it to a TIMESTAMP.
Finally, UNIQUE is not valid.
user_id TEXT UNIQUE NOT NULL,
It looks like you're trying to make sure that duplicate user_ids are not stored with each id. You can help to ensure uniqueness of the data in each partition by adding user_id to the end of the primary key definition as a clustering key:
CREATE TABLE IF NOT EXISTS books (
id UUID,
user_id TEXT,
...
PRIMARY KEY (id, user_id));
This PK definition will ensure that data for books will be partitioned by id, containing multiple user_id rows.
Not sure what the relationship is between books and users is, though. If one book can have many users, then this will work. If one user can have many books, then you'll want to switch the order of the keys to this:
PRIMARY KEY (user_id, id));
In summary, a working table definition for this problem looks like this:
CREATE TABLE IF NOT EXISTS books (
id UUID,
user_id TEXT,
scale TEXT,
title TEXT,
description TEXT,
reward map<INT,TEXT>,
image_url TEXT,
video_url TEXT,
created_at TIMESTAMP,
PRIMARY KEY (id, user_id));

Does using all fields as a partitioning keys in a table a drawback in cassandra?

my aim is to get the msgAddDate based on below query :
select max(msgAddDate)
from sampletable
where reportid = 1 and objectType = 'loan' and msgProcessed = 1;
Design 1 :
here the reportid, objectType and msgProcessed may not be unique. To add the uniqueness I have added msgAddDate and msgProcessedDate (an additional unique value).
I use this design because I don't perform range query.
Create table sampletable ( reportid INT,
objectType TEXT,
msgAddDate TIMESTAMP,
msgProcessed INT,
msgProcessedDate TIMESTAMP,
PRIMARY KEY ((reportid ,msgProcessed,objectType,msgAddDate,msgProcessedDate));
Design 2 :
create table sampletable (
reportid INT,
objectType TEXT,
msgAddDate TIMESTAMP,
msgProcessed INT,
msgProcessedDate TIMESTAMP,
PRIMARY KEY ((reportid ,msgProcessed,objectType),msgAddDate, msgProcessedDate))
);
Please advice which one to use and what will be the pros and cons between two based on performance.
Design 2 is the one you want.
In Design 1, the whole primary key is the partition key. Which means you need to provide all the attributes (which are: reportid, msgProcessed, objectType, msgAddDate, msgProcessedDate) to be able to query your data with a SELECT statement (which wouldn't be useful as you would not retrieve any additional attributes than the one you already provided in the WHERE statemenent)
In Design 2, your partition key is reportid ,msgProcessed,objectType which are the three attributes you want to query by. Great. msgAddDate is the first clustering column, which will be automatically sorted for you. So you don't even need to run a max since it is sorted. All you need to do is use LIMIT 1:
SELECT msgAddDate FROM sampletable WHERE reportid = 1 and objectType = 'loan' and msgProcessed = 1 LIMIT 1;
Of course, make sure to define a DESC sorted order on msgAddDate (I think by default it is ascending...)
Hope it helps!

Suggestion for Cassandra Data Model for Chat application

I am currently developing a chat application on top of Cassandra.
A conversation
can happen between one or more users.
can have more than one message.
will be marked read if all the messages are read.
In an extreme case, conversation can have upto 100 users.
I want to solve the following query requirements.
Show top n recent conversations for a given user.
Show count of unread conversations (not messages) for a given user.
Any suggestions on Data Modelling?
You can start with this structure :
CREATE TABLE conversation (
conversation_id timeuuid,
user_from varchar,
user_to varchar,
message text,
message_read boolean,
message_date timestamp,
conversation_read boolean static,
PRIMARY KEY ((conversation_id, user_to), message_date)
)
WITH CLUSTERING ORDER BY (user_from ASC, message_date ASC);
All your queries will be base on conversation_id and user_to. Message will be ordered by creation date. I think this structure can support the main purpose of a chat.
For the two queries, you need to have other denormalized tables like :
1) Show top n recent conversations for a given user.
CREATE TABLE user_message (
user varchar,
message text,
message_date timestamp,,
PRIMARY KEY ((user), message_date)
)
WITH CLUSTERING ORDER BY (message_date DESC);
SELECT message
FROM user_message
WHERE user = 'some user'
LIMIT 10;
2) Show count of unread conversations (not messages) for a given user.
CREATE TABLE user_conversations (
user varchar,
conversation_id timeuuid,
conversation_read boolean,
PRIMARY KEY((user), conversation_read, conversation_id)
);
SELECT COUNT(1)
FROM user_conversations
WHERE user = 'some user'
AND conversation_read = false;
If you can use cassandra 3.X, you can use MATERIALIZED VIEW to manager data denormalization.
Hope this can help you.

Cassandra + Fetch the last records using in query

I am new in this cassandra database using with nodejs.
I have user_activity table. In this table data will insert based on user activity.
Also I have some user list. I need to fetch the data in that particular users and last record.
I don't interest to put the query in for loop. Have any other idea to achieve this?
Example Code:
var userlist = ["12", "34", "56"];
var query = 'SELECT * FROM user_activity WHERE userid IN ?';
server.user.execute(query, [userlist], {
prepare : true
}, function(err, result) {
console.log(results);
});
How to get the user lists for last one ?
Example:
user id = 12 - need to get last record;
user id = 34 - need to get last record;
user id = 56 - need to get last record;
I need to get these 3 records.
Table Schema:
CREATE TABLE test.user_activity (
userid text,
ts timestamp,
clientid text,
clientip text,
status text,
PRIMARY KEY (userid, ts)
)
It is not possible if you use the IN filter.
If it is a single user_id filter you can apply order by. Of course you need a column for inserted/updated time. So query will be like this:
SELECT * FROM user_activity WHERE user_id = 12 ORDER BY updated_at LIMIT 1;
You can put N value to get number of records
SELECT * FROM user_activity WHERE userid IN ? ORDER BY id DESC LIMIT N

Cassandra Schema for a Chat Application

I have gone though this article and here is the schema I have got from it. This is helpful for my application for maintaining statuses of a user, but how can I extend this to maintain one to one chat archive and relations between users, relations mean people belong to specific group for me. I am new to this and need an approach for this.
Requirements :
I want to store messages between user-user in a table.
Whenever a user want to load messages by a user. I want to retrieve them back and send it to user.
I want to retrieve all the messages from different users to the user when user has requested.
And also want to store class of users. I mean for example user1 and user2 belong to "family" user3, user4, user1 belong to friends etc... This group can be custom name given by the user.
This is what I have tried so far:
CREATE TABLE chatarchive (
chat_id uuid PRIMARY KEY,
username text,
body text
)
CREATE TABLE chatseries (
username text,
time timeuuid,
chat_id uuid,
PRIMARY KEY (username, time)
) WITH CLUSTERING ORDER BY (time ASC)
CREATE TABLE chattimeline (
to text,
username text,
time timeuuid,
chat_id uuid,
PRIMARY KEY (username, time)
) WITH CLUSTERING ORDER BY (time ASC)
Below is the schema that I currently have:
CREATE TABLE users (
username text PRIMARY KEY,
password text
)
CREATE TABLE friends (
username text,
friend text,
since timestamp,
PRIMARY KEY (username, friend)
)
CREATE TABLE followers (
username text,
follower text,
since timestamp,
PRIMARY KEY (username, follower)
)
CREATE TABLE tweets (
tweet_id uuid PRIMARY KEY,
username text,
body text
)
CREATE TABLE userline (
username text,
time timeuuid,
tweet_id uuid,
PRIMARY KEY (username, time)
) WITH CLUSTERING ORDER BY (time DESC)
CREATE TABLE timeline (
username text,
time timeuuid,
tweet_id uuid,
PRIMARY KEY (username, time)
) WITH CLUSTERING ORDER BY (time DESC)
With C* you need to store data in the way you'll use it.
So let's see how this would look like for this case:
I want to store messages between user-user in a table.
Whenever a user want to load messages by a user. I want to retrieve them back and send it to user.
CREATE TABLE chat_messages (
message_id uuid,
from_user text,
to_user text,
body text,
class text,
time timeuuid,
PRIMARY KEY ((from_user, to_user), time)
) WITH CLUSTERING ORDER BY (time ASC);
This will allow you to retrieve a timeline of messages between two users. Note that a composite primary key is used so that wide rows are created for each pair of users.
SELECT * FROM chat_messages WHERE from_user = 'mike' AND to_user = 'john' ORDER BY time DESC ;
I want to retrieve all the messages from different users to the user when user has requested.
CREATE INDEX chat_messages_to_user ON chat_messages (to_user);
This allows you to do:
SELECT * FROM chat_messages WHERE to_user = 'john';
And also want to store class of users. I mean for example user1 and user2 belong to "family" user3, user4, user1 belong to friends etc... This group can be custom name given by the user.
CREATE INDEX chat_messages_class ON chat_messages (class);
This will allow you to do:
SELECT * FROM chat_messages WHERE class = 'family';
Note that in this kind of database, DENORMALIZED DATA IS A GOOD PRACTICE. This means that using the name of the class again and again is not a bad practice.
Also note that I haven't used a 'chat_id' nor a 'chats' table. We could easily add this but I feel that your use case didn't require it as it has been put forward. In general, you cannot do joins in C*. So, using a chat id would imply two queries.
EDIT: Secondary indexes are inefficient. A materialised view will be a better implementation with C* 3.0
There is a chat application created by Alan Chandler on github that has the features you request:
MBchat
It uses a 2-phase authentication. First the user is validated in the forums and then, the user is validated on the chat database.
Here's the first validation part of the schema (schema located in inc/user.sql):
BEGIN;
CREATE TABLE users (
uid integer primary key autoincrement NOT NULL,
time bigint DEFAULT (strftime('%s','now')) NOT NULL,
name character varying NOT NULL,
role text NOT NULL DEFAULT 'R', -- A (CEO), L (DIRECTOR), G (DEPT HEAD), H (SPONSOR) R(REGULAR)
cap integer DEFAULT 0 NOT NULL, -- 1 = blind, 2 = committee secretary, 4 = admin, 8 = mod, 16 = speaker 32 = can't whisper( OR of capabilities).
password character varying NOT NULL, -- raw password
rooms character varying, -- a ":" separated list of rooms nos which define which rooms the user can go in
isguest boolean DEFAULT 0 NOT NULL
);
CREATE INDEX userindex ON users(name);
-- Below here you can add the specific users for your set up in the form of INSERT Statements
-- This list is test users to cover the complete range of functions. Note names are converted to lowercase, so only put lowercase names in here
INSERT INTO users(uid,name,role,cap,password,rooms,isguest) VALUES
(1,'alice','A',4,'password','7',0), -- CEO class user alice
(2,'bob','L',3,'password','8',0), -- DIRECTOR class user bob
(3,'carol','G',2,'password','7:8:9',0), -- DEPT HEAD class user carol
And here's the second validation part of the schema (schema located in data/chat.sql):
CREATE TABLE users (
uid integer primary key NOT NULL,
time bigint DEFAULT (strftime('%s','now')) NOT NULL,
name character varying NOT NULL,
role char(1) NOT NULL default 'R',
rid integer NOT NULL default 0,
mod char(1) NOT NULL default 'N',
question character varying,
private integer NOT NULL default 0,
cap integer NOT NULL default 0,
rooms character_varying
);
The following is the schema of the chat rooms you can see the user classes and the examples of it:
CREATE TABLE rooms (
rid integer primary key NOT NULL,
name varchar(30) NOT NULL,
type integer NOT NULL -- 0 = Open, 1 = meeting, 2 = guests can't speak, 3 moderated, 4 members(adult) only, 5 guests(child) only, 6 creaky door
) ;
INSERT INTO rooms (rid, name, type) VALUES
(1, 'The Forum', 0),
(2, 'Operations Gallery', 2), -- Guests Can't Speak
(3, 'Dungeon Club', 6), -- creaky door
(4, 'Auditorium', 3), -- Moderated Room
(5, 'Blue Room', 4), -- Members Only (in Melinda's Backups this is Adults)
(6, 'Green Room', 5), -- Guest Only (in Melinda's Backups this is Juveniles AKA Baby Backups)
(7, 'The Board Room', 1), -- Various meeting rooms - need to be on users room list
The users have another table to indicate the participation of the conversation:
CREATE table wid_sequence ( value integer);
INSERT INTO wid_sequence (value) VALUES (1);
CREATE TABLE participant (
uid integer NOT NULL REFERENCES users (uid) ON DELETE CASCADE ON UPDATE CASCADE,
wid integer NOT NULL,
primary key (uid,wid)
);
And the archives are recorded as follows:
CREATE TABLE chat_log (
lid integer primary key,
time bigint DEFAULT (strftime('%s','now')) NOT NULL,
uid integer NOT NULL REFERENCES user (uid) ON DELETE CASCADE ON UPDATE CASCADE,
name character varying NOT NULL,
role char(1) NOT NULL,
rid integer NOT NULL,
type char(2) NOT NULL,
text character varying
);
Edit: However this type of data modeling is not very suitable for Cassandra. Because, in Cassandra your data does not fit on one machine so joins are not available. So, in Cassandra denormalizing data is the practical choice. Check below for the denormalized version of chat_log table:
CREATE TABLE chat_log (
lid uuid,
time timestamp,
sender text NOT NULL,
receiver text NOT NULL,
room text NOT NULL,
sender_role varchar NOT NULL,
receiver_role varchar NOT NULL,
rid decimal NOT NULL,
status varchar NOT NULL,
message text,
PRIMARY KEY (sender, receiver, room)
-- PRIMARY KEY (sender, receiver) if you don't want the messages to be separated by the rooms
) WITH CLUSTERING ORDER BY (time ASC);
Now in order to retrieve data you'd use the following queries:
Whenever a user want to load messages by a user. I want to retrieve them back and send it to user.
SELECT * FROM chat_log WHERE sender = 'bob' ORDER BY time ASC
I want to retrieve all the messages from different users to the user when user has requested.
SELECT * FROM chat_log WHERE receiver = 'alice' ORDER BY time ASC
I want to store and retrieve class of users.
SELECT * FROM chat_log WHERE sender_role = 'A' ORDER BY time ASC -- messages sent by CEOs
SELECT * FROM chat_log WHERE receiver_role = 'A' ORDER BY time ASC -- messages received by CEOs
After modeling the data. You'd need to create indexes for quick and efficient querying as follows:
For retrieving all messages from different users to the user efficiently
CREATE INDEX chat_log_uid ON chat_log (sender);
CREATE INDEX chat_log_uid ON chat_log (receiver);
For retrieving all messages from user classes efficiently
CREATE INDEX chat_log_class ON chat_log (sender_role);
CREATE INDEX chat_log_class ON chat_log (receiver_role);
I believe these examples will give you the approach you need.
If you'd like to learn more about Cassandra data modeling you can check down below:
Cassandra Data Modeling Best Practices, Part 1
Cassandra Data Modeling Best Practices, Part 2
Cassandra Data Modeling Best Practices Slide
Data Modeling Example

Resources