Am I using cassandra efficiently? - cassandra

I have these table
CREATE TABLE user_info (
userId uuid PRIMARY KEY,
userName varchar,
fullName varchar,
sex varchar,
bizzCateg varchar,
userType varchar,
about text,
joined bigint,
contact text,
job set<text>,
blocked boolean,
emails set<text>,
websites set<text>,
professionTag set<text>,
location frozen<location>
);
create table publishMsg
(
rowKey uuid,
msgId timeuuid,
postedById uuid,
title text,
time bigint,
details text,
tags set<text>,
location frozen<location>,
blocked boolean,
anonymous boolean,
hasPhotos boolean,
esIndx boolean,
PRIMARY KEY(rowKey, msgId)
) with clustering order by (msgId desc);
create table publishMsg_by_user
(
rowKey uuid,
msgId timeuuid,
title text,
time bigint,
details text,
tags set<text>,
location frozen<location>,
blocked boolean,
anonymous boolean,
hasPhotos boolean,
PRIMARY KEY(rowKey, msgId)
) with clustering order by (msgId desc);
CREATE TABLE followers
(
rowKey UUID,
followedBy uuid,
time bigint,
PRIMARY KEY(rowKey, orderKey)
);
I doing 3 INSERT statement in BATCH to put data in publishMsg publishMsg_by_user followers table.
To show a single message I have to query three SELECT query on different table:
publishMsg - to get a publish message details where rowkey & msgId given.
userInfo - to get fullName based on postedById
followers - to know whether a postedById is following a given topic or not
Is this a fit way of using cassandra ? will that be efficient because the given scanerio data can't fit in single table.

Sorry to ask this in an answer but I don't have the rep to comment.
Ignoring the tables for now, what information does your application need to ask for? Ideally in Cassandra, you will only have to execute one query on one table to get the data you need to return to the client. You shouldn't need to have to execute 3 queries to get what you want.
Also, your followers table appears to be missing the orderkey field.

Related

Cassandra Schema for Reddit Posts,Top posts,new posts

I am new to Cassandra and trying to implement Reddit mock with limited functionalities. I am not considering subreddits and comments as of now. There is a single home page that displays 'Top' posts and 'New' posts. By clicking any post I can navigate into the post.
1)Is this a correct schema design?
2)If I want to show all-time top posts how can that be achieved?
Table for Post Details
CREATE TABLE main.post (
user_id text,
post_id text,
timeuuid timeuuid,
downvoted_user_id list<text>,
img_ids list<text>,
islocked boolean,
isnsfw boolean,
post_date date,
score int,
upvoted_user_id list<text>,
PRIMARY KEY ((user_id, post_id), timeuuid)
) WITH CLUSTERING ORDER BY (timeuuid DESC)
Table for Top & New Posts
CREATE TABLE main.posts_by_year (
post_year text,
timeuuid timeuuid,
score int,
img_ids list<text>,
islocked boolean,
isnsfw boolean,
post_date date,
post_id text,
user_id text,
PRIMARY KEY (post_year, timeuuid, score)
) WITH CLUSTERING ORDER BY (timeuuid DESC, score DESC)

cassandra how would the query look like?

I have seen this data model:
CREATE TABLE IF NOT EXISTS social_media.posts_by_user (
user_id uuid,
post_id uuid,
message_text text,
created_on timestamp,
deleted boolean,
user_full_name text,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.user_timeline (
follower_id uuid,
post_id uuid,
user_id uuid,
location_name text,
user_full_name text,
created_on timestamp,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.post_counts (
likes_count counter,
view_count counter,
comments_count counter,
post_id uuid,
PRIMARY KEY (post_id)
);
My Question is now:
If I want to show a post with likes. How I query it ? I cant join the post_counts table so how I do it ? It should be in the posts_by_user query or I am wrong ?
Output as User Interface:
--username
--profilimage
--likes
--follow-user

Is cassandra suitable for analytics storing?

I'm willing to develop an open-source analytics project which will store visits, referers, devices (by kind, family etc.).
I'm fairly new to the cassandra world so I'm asking a lot of questions about modeling with it.
I have read a lot of documentation about it, here is a part of my datamodel:
create table visits(
id UUID,
remote_addr VARCHAR,
method VARCHAR,
user_agent VARCHAR,
status_code INT,
host VARCHAR,
protocol VARCHAR,
path VARCHAR,
data VARCHAR,
headers VARCHAR,
query_string VARCHAR,
referer_id UUID,
device_id UUID,
browser_id UUID,
platform_id UUID,
created_at TIMEUUID,
PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
create table referers(
id UUID PRIMARY KEY,
host VARCHAR,
path VARCHAR,
first_seen TIMESTAMP,
last_seen TIMESTAMP,
seen_count INT );
create table browsers(
id UUID PRIMARY KEY,
key VARCHAR,
version VARCHAR,
first_seen TIMESTAMP,
last_seen TIMESTAMP,
seen_count INT );
create table platforms(
id UUID PRIMARY KEY,
key VARCHAR,
version VARCHAR,
first_seen TIMESTAMP,
last_seen TIMESTAMP,
seen_count INT );
With this model, if I want for example "all visits from status_code 200" I will have to create a secondary index, same for referers, devices, etc.
So do I need to create individual tables "visits_by_referers", "visits_by_devices" like so:
create table visits_by_referers(
visit_id UUID,
device_id UUID,
PRIMARY KEY (visit_id, device_id)
);
or am I completely wrong and cassandra is not suitable for this?
Thank you :)
Until 3.0 comes out with Materialized Views (https://issues.apache.org/jira/browse/CASSANDRA-6477), which will be HUGE for this type of use case, you need to create individual tables for things like 'visits by referrer' if you plan on doing direct querying.
What a lot of people tend to do is use a single large table, and then overlay something like Spark to actually read the data into memory and do much more complicated querying.

How to design the cassandra table for one query with a ordering and limit?

Now I created a table:
CREATE TABLE posts_by_user(
user_id bigint,
post_id uuid,
post_at timestamp,
PRIMARY KEY (user_id,post_id)
);
I want to select last 10 rows with operator IN for user_id and ordering by post_at field.
Also I read a good article:
http://planetcassandra.org/blog/the-in-operator-in-cassandra-cql/
I can nit use query: WHERE post_at = time AND user_id IN (1,2) because I need all notes, not for a concrete date.
How i can change my design schema? Thank you.
I change on:
CREATE TABLE posts_by_user (
user_id bigint,
post_id uuid,
post_at timestamp,
PRIMARY KEY (user_id, post_at)
) WITH CLUSTERING ORDER BY (post_at DESC);
Think it is a good...
How about using this approach: http://www.datastax.com/documentation/cql/3.1/cql/cql_using/use-slice-partition.html

Non-EQ relation error Cassandra - how fix primary key?

I created a one table posts. When I make request SELECT:
return $this->db->query('SELECT * FROM "posts" WHERE "id" IN(:id) LIMIT '.$this->limit_per_page, ['id' => $id]);
I get error:
PRIMARY KEY column "id" cannot be restricted (preceding column
"post_at" is either not restricted or by a non-EQ relation)
My table dump is:
CREATE TABLE posts (
id uuid,
post_at timestamp,
user_id bigint,
name text,
category set<text>,
link varchar,
image set<varchar>,
video set<varchar>,
content map<text, text>,
private boolean,
PRIMARY KEY (user_id,post_at,id)
)
WITH CLUSTERING ORDER BY (post_at DESC);
I read some article about PRIMARY AND CLUSTER KEYS, and understood, when there are some primary keys - I need use operator = with IN. In my case, i can not use a one PRIMARY KEY. What you advise me to change in table structure, that error will disappear?
My dummy table structure
CREATE TABLE posts (
id timeuuid,
post_at timestamp,
user_id bigint,
PRIMARY KEY (id,post_at,user_id)
)
WITH CLUSTERING ORDER BY (post_at DESC);
And after inserting some dummy data
I ran query select * from posts where id in (timeuuid1,timeuuid2,timeuuid3);
I was using cassandra 2.0 with cql 3.0

Resources