How to update denormalized tables in Cassandra - cassandra

I'm new to Cassandra world. Actually I'm trying to create a chat application for my users (traders). I have a table as below to find traders by a room_id.
CREATE TABLE IF NOT EXISTS traders_by_room (
room_id uuid,
trader_id uuid,
trader_username text,
trader_mongo_id text,
trader_profile_url text,
trader_profile_image text,
PRIMARY KEY (room_id)
)
CREATE TABLE IF NOT EXISTS traders (
trader_id uuid,
trader_username text,
trader_mongo_id text,
trader_profile_url text,
trader_profile_image text,
PRIMARY KEY (trader_id)
)
Now the issue is lets say that a particular trader updates his/her trader_profile_image in traders table, then the same needs to be updated in trader_by_room table. But for updating the trader_by_room table, we must provide room_id in WHERE clause, which might not be known, as we are just updating the profile pic for the trader. Couldn't really think of a solution.
Any help would mean a lot.

Related

CQL query delete if not in list

I am trying to delete all rows in the table where the partition key is not in a list of guids.
Here's my table definition.
CREATE TABLE cloister.major_user (
user_id uuid,
user_handle text,
avatar text,
created_at timestamp,
email text,
email_verified boolean,
first_name text,
last_name text,
last_updated_at timestamp,
profile_type text,
PRIMARY KEY (user_id, user_handle)
) WITH CLUSTERING ORDER BY (user_handle ASC)
I want to retain certain user_ids and delete the rest. The following options have failed.
delete from juna_user where user_id ! in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2) ALLOW FILTERING
What am I doing wrong?
CQL supports only IN condition (see docs). You need to explicitly specify which primary key or partition keys to delete, you can't use condition not in, because it's potentially could be a huge amount of data. If you need to do that, you need to generate the list of entries to delete - you can do that using Spark Cassandra Connector, for example.

Cassandra Order By Updated At

I'm trying to build a cassandra schema to represent chat.
The one thing i can't seem to figure out is how to query most recently updated rooms (similar to most chat app list view)
Fields desired in list view ordered by updated_at desc
*room id
room title
room image
*user
*updated_at
*message entry
*message type
*metadata
Current Tables
Create TYPE user(
id uuid,
name text,
avatar text
);
CREATE TABLE rooms(
id uuid,
"name" text,
image text,
users set<user>,
archived boolean,
created_at timestampz,
updated_at timestampz,
PRIMARY KEY(id)
);
CREATE TABLE messages(
room_id uuid,
message_id timeuuid,
user user,
message_type int,
entry text,
metadata map<text, text>,
PRIMARY KEY(room_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
CREATE TABLE rooms_by_user(
user_id uuid,
room_id uuid,
PRIMARY KEY(user_id, room_id)
);
Possible solutions that i can come up with.
Duplicate all room details to each message
allows easy query with SELECT * FROM messages PER PARTITION LIMIT 1
this would be a lot of duplicate data per message...
Query latest messages which user belongs to get room ids then query rooms
This doesn't seem to be the cassandra way?
Is there a better way to model my data?
By looking at the schema it looks like you need relational database.
In Cassandra usually you use one table per query, it means you you should design your table by how you will structure query.
Also you can query by partition key or clustering column (second one should be partition key + clustering column).
So in order to query by updater_at, you need to make that column as clustering column. And keep in mind that in Cassandra you cannot alter keys.

Cassandra Update query | timestamp column as clustering key

I have a table in cassandra with following schema:
CREATE TABLE user_album_entity (
userId text,
albumId text,
updateDateTimestamp timestamp,
albumName text,
description text,
PRIMARY KEY ((userId), updateDateTimestamp)
);
The query required to get data would have a where userId = xxx order by updateTimestamp. Hence the schema had updateDateTimestamp.
Problem comes in updating the column of table.The query is: Update the album information for user where user id = xxx. But as per specs,for update query I would need the exact value of updateDateTimestamp which in normal world scenario, an application would never send.
What should be the answer to such problems since I believe this a very common use case where select query requires ordering on timestamp. Any help is much appreciated.
The problem is that your table structure allows the same album to have multiple records with the only difference being the timestamp (the clustering key).
Three possible solutions:
Remove the clustering key and sort your data at application level.
Remove the clustering key and add a Secondary Index to the timestamp field.
Remove the clustering key and create a Materialized View to perform the query.
If your usecase is such that each partition will contain exactly one row,
then you can model your table like:
CREATE TABLE user_album_entity (
userId text,
albumId text static,
updateDateTimestamp timestamp,
albumName text static,
description text static,
PRIMARY KEY ((userId), updateDateTimestamp)
);
modelling the table this way enables Update query to be done in following way:
UPDATE user_album_entity SET albumId = 'updatedAlbumId' WHERE userId = 'xyz'
Hope this helps.

How to model cassandra in this particular situations?

if I have table structure below, how can i query by
"source = 'abc' and created_at >= '2016-01-01 00:00:00'"?
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY (id)
)
I would like to model my system according to this:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Edit:
What we are doing is very similar to what you are proposing. The difference is our primary key doesn't have brackets around source:
PRIMARY KEY (source, created_at, id). We also have two other indexes:
CREATE INDEX articles_id_idx ON crawler.articles (id);
CREATE INDEX articles_url_idx ON crawler.articles (url);
Our system is really slow like this. What do you suggest?
Thanks for your replies!
Given the table structure
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY ((source),created_at, id)
)
You can issue the following queries:
SELECT * FROM articles WHERE source=xxx // Give me all article given the source xxx
SELECT * FROM articles WHERE source=xxx AND created_at > '2016-01-01 00:00:00'; // Give me all articles whose source is xxx and created after 2016-01-01 00:00:00
The couple (created_at,id) in the primary key is here to guarantee article unicity. Indeed, it is possible to have, at the same created_at time, 2 different articles
Given the knowledge from previous question you posted where I said index is slowing down your query you need to solve two things:
Write article only if it does not already exist
Query article based on source and range query on created at
Based on those two I would go with two tables:
Reverse index table
CREATE TABLE article_by_id (
id text,
source text,
created_at timestamp,
PRIMARY KEY (id) ) WITH comment = 'Article by id.';
This table will be used to insert articles when they first arrive. Based on return statement after INSERT ... IF NOT EXISTS you will know if article is existing or new and if it is new you will write to second table. Also this table can serve to find all key parts for second table based on article id. If you need full article data you can add to this table as well all fields (category, channel etc.). This will be skinny row holding only single article in one partition.
Example of INSERT:
INSERT INTO article_by_id(id, source, created_at) VALUES (%s,%s, %s) IF NOT EXISTS;
Java driver returns true or false whether this query was applied or not. Probably it is same in python driver but I did not use it.
Table for range queries and queries by source
As doanduyhai suggested you create a second table:
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY ((source),created_at, id)
)
In this table you will write only if first INSERT returned true meaning you have new article, not existing one. This table will serve range queries and queries by source.
Improvement suggestion
By using timeuuid instead of timestamp for created_at you are sure no two article can have same created_at and you can loose id all together and rely on timeuuid. However from second question I can see you rely on external id so wanted to mention this as a sidenote.

What do you do change in the such a data model Cassandra?

I have task to create a social feed(news feed). I think no need to explain the standard functionality - all are how as FB.
I chose solution apache cassandra and designed a data column Posts for storing information about posts users:
CREATE TABLE Posts (
post_id uuid,
post_at timestamp,
user_id text,
name varchar,
category set<text>,
link varchar,
image set<varchar>,
video set<varchar>,
content map<text, text>,
private boolean,
PRIMARY KEY ((post_id, user_id), post_at)
)
WITH CLUSTERING ORDER BY (post_at DESC) COMPACT STORAGE;
The next table contains id user posts:
CREATE TABLE posts_user (
post_id bigint,
post_at timestamp,
user_id bigint,
PRIMARY KEY ((post_id), post_at, user_id)
)
WITH CLUSTERING ORDER BY (post_at DESC) AND COMPACT STORAGE;
How do you think, is it good? What do you do change in the such a data model?
There are a couple of questions and a couple of improvements that jump out.
COMPACT STORAGE is deprecated now (if you want to take advantage of CQL 3 features). I do not think that you can create your table Posts as you have defined above since it uses CQL 3 features (collections) with COMPACT STORAGE as well as declaring more than one column that is not part of the primary key.
posts_user has completely different key types than Posts does. I am not clear on what the relationship between the two tables is, but I imagine that post_id is supposed to be consistent between them, whereas you have it as a uuid in one table and a bigint in the other. There are also discrepancies with the other fields.
Assuming post_id is unique and represents the id of an individual post, it is strange to have it as the first part of a compound primary key in the Posts table since if you know the post_id then you can already uniquely access the record. Furthermore, as it is part of the partition key it also prevents you from doing wider selects of multiple posts and taking advantage of your post_at ordering.
The common method to fix this is to create a dedicated index table to sort the data the way you want.
E.g.
CREATE TABLE posts (
id uuid,
created timestamp,
user_id uuid,
name text,
...
PRIMARY KEY (id)
);
CREATE TABLE posts_by_user_index (
user_id uuid,
post_id uuid,
post_at timestamp,
PRIMARY KEY (user_id,post_at,post_id)
WITH CLUSTERING ORDER BY (post_at DESC)
);
Or more comprehensively:
CREATE TABLE posts_by_user_sort_index (
user_id uuid,
post_id uuid,
sort_field text,
sort_value text,
PRIMARY KEY ((user_id,sort_field),sort_value,post_id)
);
However, in your case if you only wish to select the data one way, then you can get away with using your posts table to do the sorting:
CREATE TABLE posts (
id uuid,
post_at timestamp,
user_id uuid,
name text,
...
PRIMARY KEY (user_id,post_at,id)
WITH CLUSTERING ORDER BY (post_at DESC)
);
It will just make it more complicated if you wish to add additional indexes later since you will need to index each post not just by its post id, but by its user and post_at fields as well.

Resources