Is cassandra suitable for analytics storing? - cassandra

I'm willing to develop an open-source analytics project which will store visits, referers, devices (by kind, family etc.).
I'm fairly new to the cassandra world so I'm asking a lot of questions about modeling with it.
I have read a lot of documentation about it, here is a part of my datamodel:
create table visits(
id UUID,
remote_addr VARCHAR,
method VARCHAR,
user_agent VARCHAR,
status_code INT,
host VARCHAR,
protocol VARCHAR,
path VARCHAR,
data VARCHAR,
headers VARCHAR,
query_string VARCHAR,
referer_id UUID,
device_id UUID,
browser_id UUID,
platform_id UUID,
created_at TIMEUUID,
PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
create table referers(
id UUID PRIMARY KEY,
host VARCHAR,
path VARCHAR,
first_seen TIMESTAMP,
last_seen TIMESTAMP,
seen_count INT );
create table browsers(
id UUID PRIMARY KEY,
key VARCHAR,
version VARCHAR,
first_seen TIMESTAMP,
last_seen TIMESTAMP,
seen_count INT );
create table platforms(
id UUID PRIMARY KEY,
key VARCHAR,
version VARCHAR,
first_seen TIMESTAMP,
last_seen TIMESTAMP,
seen_count INT );
With this model, if I want for example "all visits from status_code 200" I will have to create a secondary index, same for referers, devices, etc.
So do I need to create individual tables "visits_by_referers", "visits_by_devices" like so:
create table visits_by_referers(
visit_id UUID,
device_id UUID,
PRIMARY KEY (visit_id, device_id)
);
or am I completely wrong and cassandra is not suitable for this?
Thank you :)

Until 3.0 comes out with Materialized Views (https://issues.apache.org/jira/browse/CASSANDRA-6477), which will be HUGE for this type of use case, you need to create individual tables for things like 'visits by referrer' if you plan on doing direct querying.
What a lot of people tend to do is use a single large table, and then overlay something like Spark to actually read the data into memory and do much more complicated querying.

Related

Am I using cassandra efficiently?

I have these table
CREATE TABLE user_info (
userId uuid PRIMARY KEY,
userName varchar,
fullName varchar,
sex varchar,
bizzCateg varchar,
userType varchar,
about text,
joined bigint,
contact text,
job set<text>,
blocked boolean,
emails set<text>,
websites set<text>,
professionTag set<text>,
location frozen<location>
);
create table publishMsg
(
rowKey uuid,
msgId timeuuid,
postedById uuid,
title text,
time bigint,
details text,
tags set<text>,
location frozen<location>,
blocked boolean,
anonymous boolean,
hasPhotos boolean,
esIndx boolean,
PRIMARY KEY(rowKey, msgId)
) with clustering order by (msgId desc);
create table publishMsg_by_user
(
rowKey uuid,
msgId timeuuid,
title text,
time bigint,
details text,
tags set<text>,
location frozen<location>,
blocked boolean,
anonymous boolean,
hasPhotos boolean,
PRIMARY KEY(rowKey, msgId)
) with clustering order by (msgId desc);
CREATE TABLE followers
(
rowKey UUID,
followedBy uuid,
time bigint,
PRIMARY KEY(rowKey, orderKey)
);
I doing 3 INSERT statement in BATCH to put data in publishMsg publishMsg_by_user followers table.
To show a single message I have to query three SELECT query on different table:
publishMsg - to get a publish message details where rowkey & msgId given.
userInfo - to get fullName based on postedById
followers - to know whether a postedById is following a given topic or not
Is this a fit way of using cassandra ? will that be efficient because the given scanerio data can't fit in single table.
Sorry to ask this in an answer but I don't have the rep to comment.
Ignoring the tables for now, what information does your application need to ask for? Ideally in Cassandra, you will only have to execute one query on one table to get the data you need to return to the client. You shouldn't need to have to execute 3 queries to get what you want.
Also, your followers table appears to be missing the orderkey field.

How to design the cassandra table for one query with a ordering and limit?

Now I created a table:
CREATE TABLE posts_by_user(
user_id bigint,
post_id uuid,
post_at timestamp,
PRIMARY KEY (user_id,post_id)
);
I want to select last 10 rows with operator IN for user_id and ordering by post_at field.
Also I read a good article:
http://planetcassandra.org/blog/the-in-operator-in-cassandra-cql/
I can nit use query: WHERE post_at = time AND user_id IN (1,2) because I need all notes, not for a concrete date.
How i can change my design schema? Thank you.
I change on:
CREATE TABLE posts_by_user (
user_id bigint,
post_id uuid,
post_at timestamp,
PRIMARY KEY (user_id, post_at)
) WITH CLUSTERING ORDER BY (post_at DESC);
Think it is a good...
How about using this approach: http://www.datastax.com/documentation/cql/3.1/cql/cql_using/use-slice-partition.html

Non-EQ relation error Cassandra - how fix primary key?

I created a one table posts. When I make request SELECT:
return $this->db->query('SELECT * FROM "posts" WHERE "id" IN(:id) LIMIT '.$this->limit_per_page, ['id' => $id]);
I get error:
PRIMARY KEY column "id" cannot be restricted (preceding column
"post_at" is either not restricted or by a non-EQ relation)
My table dump is:
CREATE TABLE posts (
id uuid,
post_at timestamp,
user_id bigint,
name text,
category set<text>,
link varchar,
image set<varchar>,
video set<varchar>,
content map<text, text>,
private boolean,
PRIMARY KEY (user_id,post_at,id)
)
WITH CLUSTERING ORDER BY (post_at DESC);
I read some article about PRIMARY AND CLUSTER KEYS, and understood, when there are some primary keys - I need use operator = with IN. In my case, i can not use a one PRIMARY KEY. What you advise me to change in table structure, that error will disappear?
My dummy table structure
CREATE TABLE posts (
id timeuuid,
post_at timestamp,
user_id bigint,
PRIMARY KEY (id,post_at,user_id)
)
WITH CLUSTERING ORDER BY (post_at DESC);
And after inserting some dummy data
I ran query select * from posts where id in (timeuuid1,timeuuid2,timeuuid3);
I was using cassandra 2.0 with cql 3.0

CQL: Bad Request: Missing CLUSTERING ORDER for column

What is the problem with this CQL query
cqlsh> create table citybizz.notifications(
... userId varchar,
... notifId UUID,
... notification varchar,
... time bigint,read boolean,
... primary key (userId, notifId,time)
... ) with clustering order by (time desc);
It throws Bad Request: Missing CLUSTERING ORDER for column notifid. I am using cassandra 1.2.2
You need to specify the order for notifId too:
create table citybizz.notifications(
userId varchar,
notifId UUID,
notification varchar,
time bigint,read boolean,
primary key (userId, notifId,time)
) with clustering order by (notifId asc, time desc);
Cassandra doesn't assume default ordering (asc) for the other clustering keys so you need to specify it.

Can a Cassandra / CQL3 column family have a composite partition key?

CQL 3 allows for a "compound" primary key using a definition like this:
CREATE TABLE timeline (
user_id varchar,
tweet_id uuid,
author varchar,
body varchar,
PRIMARY KEY (user_id, tweet_id)
);
With a schema like this, the partition key (storage engine row key) will consist of the user_id value, while the tweet_id will be compounded into the column name. What I am looking for, instead, is for the partition key (storage engine row key) to have a composite value like user_id:tweet_id. Obviously I could do something like key = user_id + ':' + tweet_id in my application, but is there any way to have CQL 3 do this for me?
Actually, yes you can. That functionality was added in this ticket:
https://issues.apache.org/jira/browse/CASSANDRA-4179
The format for you would be:
CREATE TABLE timeline (
user_id varchar,
tweet_id uuid,
author varchar,
body varchar,
PRIMARY KEY ((user_id, tweet_id))
);
Until 1.2 comes out, the answer is no. The partition key will always be the first component. As you said, the way to do this would be to create the composite key yourself. You shouldn't shy away from this as it's actually quite common.

Resources