I have 2 questions:
First: I heard cassandra is not good with the SQL statement UPDATE. It should not often happend. Is That truth? If yes WHY ?
Second: A user can create a post where users can like comment etc.
Now I want to order by most liked posts. How I do it in cassandra and how I store it ? I heard updates are not good every time so how can I denormalize it now ?
Table posts
CREATE TABLE posts_by_id (
user_id UUID,
post_id UUID,
post_text TEXT,
comments map<text>
likes map<text>
createdAt timestamp,
PRIMARY KEY ((post_id, user_id) NOW HOW CAN I SORT BY HIGHEST LIKES ?)
)
That statement isn't correct. Updates are a normal operation in Cassandra, just like any other databases.
If you provide your source, I will be happy to review it and update my answer. Cheers!
Related
I want to make a table, where users follow other users. In this table I need an userID for the following user and an userID for the followed user. In the case that there are some users, that get followed by hundreds of thousands of users, so it is too large to store them efficiently in a collection.
How can I model it, that all of the followers of a single user don't end up in the same partition? Before realizing this problem I wanted to use either the following user ID oder followed user ID as the partition key, but then there should be a hot-partition problem.
Could I use a composite partition key with following user ID and followed user ID to solve the hot-partition problem? What else could save this problem?
My 2 main queries are to get all following users of an user and get all followed users of an user.
If i user a composite partition key with those 2 IDs, can I even query the things above? Or do I need to query with both IDs and so it is not possible?
Thank you for your help.
This is a common problem in social graphs where certain personalities have millions more followers than everyone else. I tend to use Taylor Swift and Barrack Obama as examples.
A lot of social platforms handle this "super-nodes" issue by isolating them in a separate data store so that the main store doesn't queue up when someone traverses a super-node.
This does mean that you need to handle this in your app such that your app needs to check the user against a reference table of super-users/super-nodes so it then does lookups in sub-table(s) instead of the main table. Cheers!
You should look at bucketing solution. In Bucketing, you introduce an additional key to be a part of your partition key. For example, you can have your data model like this
CREATE TABLE user_followers (
user_id int,
bucket_id int,
follower_id int,
user_name text,
PRIMARY KEY ((user_id, bucket_id)));
Here user_id and bucket_id is the partition key. To find all the partitions you should know your bucket_ids beforehand.
I'm trying to create a chat app with use the Cassandra database. so I want to get messages that send between to users (private chat) . so I want to get the best and optimal way to do CQL query? What are your idea?
my user message table structure.
Everything that has been said is true.
Based you data model on your queries
DS2020 on academy.datastax.com is a full fledge course to explain how to do.
Now regarding your sample chat here is some proposition
DROP TABLE IF EXISTS sample_chat;
CREATE TABLE IF NOT EXISTS sample_chat (
fromuser text,
touser text,
message_id timeuuid,
body text,
PRIMARY KEY ((fromuser, touser), message_id)
) WITH CLUSTERING ORDER BY (msg_time DESC);
INSERT INTO sample_chat(fromuser, touser, message_id, body) VALUES('Cedrick', 'Hasan', now(), 'Hi Hasan do you like Cassandra');
INSERT INTO sample_chat(fromuser, touser, message_id, body) VALUES('Hasan', 'Cedrick', now(), 'Yeah Cassandra rocks');
INSERT INTO sample_chat(fromuser, touser, message_id, body) VALUES('Cedrick', 'Hasan', now(), 'Take ds220 and give us some feedback');
select fromuser, touser, body from sample_chat;
Rational:
You want to retrieve a chat based on fromuser and touser and as such this is a good partition key. Chat between 2 users should not have more than 100.000 lines which is the recommended max number of rows for a partition.
You want your items to be ordered by time with latest in first (display only last messages in the chat). You want your messages to be unique. As such timeuuid is a good type for message_id you can extract time from it easily and it ensures unicity.
You want to avoid using time as a column name or any term that collide with existing keyword of cql.
In Cassandra, the table design should be done based on the query/queries that you are going to execute; in this case, how are you going to request the data? based on the fromuser or the touser column, both?
Datastax Academy has the course DS220 that can be a good starting point to learn data modeling for Cassandra
In Cassandra, you have to work on data model part and optimise your queries based on Datastax or Cassandra recommendations. As per you app you should also work on Cassandra configurations part what would be good for heavy write and read. To get good performance on your app you should consider not only database part but also need to consider OS, network etc.
I am making a website and I want to store all users posts in one table ordered by the time they post it. the cassandra data model that I made is this
CREATE TABLE Posts(
ID uuid,
title text,
insertedTime timestamp,
postHour int,
contentURL text,
userID text,
PRIMARY KEY (postHour, insertedTime)
) WITH CLUSTERING ORDER BY (insertedTime DESC);
The question I'm facing is, when a user visits the posts page, it fetches the most recent ones by querying
SELECT * FROM Posts WHERE postHour = ?;
? = current hour
so far when the user scrolls down ajax requests are made to get more posts from the server. Javascript keeps track of postHour of the lastFetched item and sends back to the server along with the cassandra PagingState when requesting for new posts.
but this approach will query more than 1 partition when user scrolls down.
I want to know whether this model would perform without a problem, is there any other model that I can follow.
Someone please point me in the right direction.
Thank You.
That's a good start but a few pointers:
You'll probably need more than just the postHour as the partition key. I'm guessing you don't want to store all the posts regardless of the day together and then page through them. What you're probably are after here is:
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime)
But there's still a problem. Your PRIMARY KEY has to uniquely identify a row (in this case a post). I'm going to guess it's possible, although not likely, that two users might make a post with the same insertedTime value. What you really need then is to add the ID to make sure they are unique:
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime, ID)
At this point, I'd consider just combining your ID and insertedTime columns into a single ID column of type timeuuid. With those changes, your final table looks like:
CREATE TABLE Posts(
ID timeuuid,
postYear int,
postMonth int,
postDay int,
postHour int,
title text,
contentURL text,
userID text,
PRIMARY KEY ((postYear, postMonth, postDay, postHour), ID)
) WITH CLUSTERING ORDER BY (ID DESC);
Whatever programming language you're using should have a way to generate a timeuuid from the inserted time and then extract that time from a timeuuid value if you want to show it in the UI or something. (Or you could use the CQL timeuuid functions for doing the converting.)
As to your question about querying multiple partitions, yes, that's totally fine to do, but you could run into trouble if you're not careful. For example, what happens if there is a 48 hour period with no posts? Do you have to issue 48 queries that return empty results before finally getting some back on your 49th query? (That's probably going to be really slow and a crappy user experience.)
There are a couple things you could do to try and mitigate that:
Make your partitions less granular. For example, instead of doing posts by hour, make it posts by day, or posts by month. If you know that those partitions won't get too large (i.e. users won't make so many posts that the partition gets huge), that's probably the easiest solution.
Create a second table to keep track of which partitions actually have posts in them. For example, if you were to stick with posts by hour, you could create a table like this:
CREATE TABLE post_hours (
postYear int,
postMonth int,
postDay int,
postHour int,
PRIMARY KEY (postYear, postMonth, postDay, postHour)
);
You'd then insert into this table (using a Batch) anytime a user adds a new post. You can then query this table first before you query the Posts table to figure out which partitions have posts and should be queried (and thus avoid querying a whole bunch of empty partitions).
What is the best approach for updating an un-indexed regular column (not a primary key related) throughout the tables containing it as a duplicate ?
i.e the user posts something and that post is duplicated in many tables for fast retrieval. But when that post changes (with an edit) it needs to be updated throughout the database, in all tables that contain that post (in tables that have different and unknown primary keys).
Solutions I'm thinking of:
Have a mapper table to track down the primary keys in all those tables, but it seems to lead to tables explosion (post is not the only property that might change).
Use Solr to do the mapping, but I fear I would be using it for the wrong purpose.
Any enlightenments would be appreciated.
EDIT (fictional schema).
What if the post changes? or even the user's display_name?
CREATE TABLE users (
id uuid,
display_name text,
PRIMARY KEY ((id))
);
CREATE TABLE posts (
id uuid,
post text,
poster_id uuid,
poster_display_name text
tags set<text>,
statistics map<int, bigint>,
PRIMARY KEY ((id))
);
CREATE TABLE posts_by_user (
user_id uuid,
created timeuuid,
post text,
post_id uuid,
tags set<text>,
statistics map<int, bigint>,
PRIMARY KEY ((id), created)
);
It depends on the frequency of the updates. For instance, if users only update their names infrequently (a handful of time per user account), then it may be ok to use a secondary index. Just know that using a 2i is a scatter gather, so you'll see performance issues if it's a common operation. In those cases, you'll want to use a materialized view (either the ones in 3.0 or manage it yourself) to be able to get the list of all the posts for a given user, then update the user's display name.
I recommend doing this in a background job, and giving the user a message like "it may take [some unit of time] for the change in your name to be reflected everywhere".
i new for use apache cassandra, i have install cassandra and use cqlsh in my laptop
i used to create table using :
create table userpageview( created_at timestamp, hit int, userid int, variantid int, primary key (created_at, hit, userid, variantid) );
and insert several data into table, but when i tried to select using condition for all column (i mean one by one) it's error
maybe my data modelling wrong, maybe anyone can tell me how create data modelling in cassandra
thx
You need to read about partition keys and clustering keys. Cassandra works much differently than relational databases and the types of queries you can do are much more restricted.
Some information to get you started: here and here.