Some queries in Cassandra - cassandra

I created database "Movies" with three column families:
CREATE TABLE movies (
movie_id int primary key,
title text,
avg_rating decimal,
total_ratings int,
genres set<text>
);
# shows all ratings for specific movie
CREATE TABLE ratings_by_movie (
movie_id int,
user_id int,
rating decimal,
ts int,
primary key(movie_id, user_id)
);
# show all ratings of specific user
CREATE TABLE ratings_by_user (
user_id int,
movie_id int,
rating decimal,
ts int,
primary key(user_id, movie_id)
);
Is it possible to make the following queries?
Show the movie with the most reviews
Show all movies with the average rating >= 4
Show 100 best movies based on their ratings

Cassandra = No Joins. Your model is 100% relational. You need to rethink it for Cassandra. I would advice you take a look at these slides. They dig deep into how to model data for cassandra. Also here is a webinar covering the topic. But stop thinking foreign keys and joining tables, because if you need relations cassandra isn't the tool for the job.
But Why?
Because then you need to check consistency and do many other things that relational databases do and so you loose the performance and scalability that cassandra offers.
What can I do?
DENORMALIZE! Lots of data in one table? But the table will have too many columns!
So? Cassandra can handle a very large number of columns in a table.
For more details check: How to do a join queries with 2 or more tables in cassandra cql

Related

How do I retrieve the ranking of a user from a materialized view?

I am using Cassandra for storing contest data.
Currently I have a contest table like this (table contest_score):
And I created a materialized views for ranking users in a contest (table contest_ranking):
For get top 10 users of a contest I can simple query select top 10 from contest_ranking;
But how can I get ranking of specific user. For example: user_id = 4 will have rank 2.
The principal philosophy of data modelling in Cassandra is that you need to design a CQL table for each application query. It is a one-to-one mapping between app queries and CQL tables.
Since you have a completely different application query, you need to create a separate table for it. Here's an example schema:
CREATE TABLE rank_by_userid (
user_id int,
rank int,
PRIMARY KEY(user_id)
)
You can then get the rank of a user with this query:
SELECT rank FROM rank_by_userid WHERE user_id = ?
You have to manually create and maintain this new table because you won't be able to populate it with materialized views. Cheers!

Cassandra seconday index vs materialized view

I'm modeling my table for Cassandra 3.0+. The objective is to build a table that store user's activities, here what i've done so far:
(userid come from another database Mysql)
CREATE TABLE activity (
userid int,
type int,
remoteid text,
time timestamp,
imported timestamp,
visibility int,
title text,
description text,
img text,
customfields MAP<text,text>,
PRIMARY KEY (userid, type, remoteid, time, imported))
This are the main queries that i use:
SELECT * FROM activity WHERE userid = ? AND remoteid = ?;
SELECT * FROM activity WHERE userid = ? AND type = ? AND LIMIT 10;
Now i need to add the column visibility on the second query. So, from what i've learned around, i can choose between a secondary index or a materialized view.
This are the facts:
Here i've one partition per user and inside there are thousands of rows (activities).
I use always the partition key (userid) in all my query to access the data.
the global number of activities are 30 millions, growing up.
visibility column has low cardinality (just 3 value) and could be updated, but rarely.
So what should i choose? materialized view or index? I know that index with low cardinality are bad choice, but my query include always the partition key and a limit, so maybe is not that bad.
If you are always going to use the partition key I recommend using secondary indexes.
Materialized views are better when you do not know the partition key
References:
Principal Article!
• Cassandra Secondary Index Preview #1
Here is a comparison with the Materialized Views and the secondary indices
• Materialized View Performance in Cassandra 3.x
And here is where the PK is known is more effective to use an index
• Cassandra Native Secondary Index Deep Dive

Apache Ignite with Apache Cassandra

I am exploring Apache Ignite on top of Cassandra as a possible tool to be able to give ad-hoc queries on cassandra tables. Using Ignite is it
possible to able to search or query on any column in the underlying cassandra tables, like a RDBMS? Or can the join columns and search
columns only be partition and clustering columns ?
If using Ignite, is there still need to create indexes on cassandra ? Also how does ignite treat materialized views ? Will there be a need
to create materialized views ?
Also any insights into how updates to cassandra release can/will be handled by Ignite would be very helpful.
I will elaborate my question further:
Customer table:
CREATE TABLE customer (
customer_id INT,
joined_date date,
name text,
address TEXT,
is_active boolean,
created_by text,
updated_by text,
last_updated timestamp,
PRIMARY KEY(customer_id, joined_date)
);
Product table:
CREATE TABLE PDT_BY_ID (
device_id uuid,
desc text,
serial_number text,
common_name text,
customer_id int,
manu_name text,
last_updated timestamp,
model_number text,
price double,
PRIMARY KEY((device_id), serial_number)
) WITH CLUSTERING ORDER BY (serial_number ASC);
A join is possible on these tables using apache Ignite.
But is the join possible on non-primary keys ?
Is it possible for example, to give queries on product table like 'where customer_id = ... AND model_number like = '%ABC%' ' etc. ?
Is it possible to give RDBMS like queries where one can give conditions on any columns ?
Run ad-hoc queries on the tables ?
This is discussed on Apache Ignite forum: http://apache-ignite-users.70518.x6.nabble.com/Newbie-Questions-on-Ignite-over-cassandra-td10264.html

Cassandra data modeling for a social network

We are using Datastax Cassandra for our social network and we are designing/data modeling tables we need, it is confusing for us and we don't know how to design some tables and we have some little problems!
As we understood for every query we have to have different tables, and for example user A is following user C and B.
Now, in Cassandra we have a table that is posts_by_user:
user_id | post_id | text | created_on | deleted | view_count
likes_count | comments_count | user_full_name
And we have a table according to the followers of users, we insert the post's info to the table called user_timeline that when the follower users are visiting the first web page we get the post from database from user_timeline table.
And here is user_timeline table:
follower_id | post_id | user_id (who posted) | likes_count |
comments_count | location_name | user_full_name
First, Is this data modeling correct for follow base (follower, following actions) social network?
And now we want to count likes of a post, as you see we have number of likes in both tables (user_timeline, posts_by_user), and imagine one user has 1000 followers then by each like action we have to update all 1000 rows in user_timeline and 1 row in posts_by_users; And this is not logical!
Then, my second question is How should it be? I mean how should like (favorite) table be?
Think of using posts_by_user as metadata for a post's information. This would allow you to house user_id, post_id, message_text, etc, but you would abstract the view_count, likes_count, and comments_count into a counter table. This would allow you to fetch either a post's metadata or counters as long as you had the post_id, but you would only have to update the counter_record once.
DSE Counter Documentation:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html
However,
The article below is a really good starting point in relation to data modeling for Cassandra. Namely, there are a few things to take into consideration when answering this question, many of which will depend on the internals of your system and how your queries are structured.
The first two rules are stated as:
Rule 1: Spread Data Evenly Around the Cluster
Rule 2: Minimize the Number of Partitions Read
Taking a moment to consider the "user_timeline" table.
user_id and created_on as a COMPOUND KEY* - This would be ideal if
You wanted to query for posts by a certain user and with the assumption that you would have a decent number of users. This would
distribute records evenly, and your queries would only be hitting a
partition at a time.
user_id and a hash_prefix as a COMPOUND KEY* - This would be ideal
if
You had a small number of users with a large number of posts, which would allow your data to be evenly spread across the
cluster. However you run the risk of having to query across
multiple partitions.
follower_id and created_on as a COMPOUND KEY* - This would be ideal
if
You wanted to query for posts being followed by a certain follower. The records would be distributed and you would minimize
queries across partitions
These were 3 examples for 1 table, and the point I wanted to convey is to design your tables around the queries you want to execute. Also don't be afraid to duplicate your data across multiple tables that are setup to handle various queries, this is the way Cassandra was meant to be modeled. Take a bit to read the article below and watch the DataStax Academy Data Modeling Course, to familiarize yourself with the nuances. I also included an example schema below to cover the basic counter schema I was pointing out earlier.
* The reason for the compound key is due to the fact that your PRIMARY KEY has to be unique, otherwise an INSERT with an existing PRIMARY KEY will become an UPDATE.
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
https://academy.datastax.com/courses
CREATE TABLE IF NOT EXISTS social_media.posts_by_user (
user_id uuid,
post_id uuid,
message_text text,
created_on timestamp,
deleted boolean,
user_full_name text,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.user_timeline (
follower_id uuid,
post_id uuid,
user_id uuid,
location_name text,
user_full_name text,
created_on timestamp,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.post_counts (
likes_count counter,
view_count counter,
comments_count counter,
post_id uuid,
PRIMARY KEY (post_id)
);

How to design cassandra schema so additional columns can be easily added later?

I have defined table structure as defined below,
CREATE TABLE sensor_data (
asset_id text,
event_time timestamp,
sensor_type int,
temperature int,
humidity int,
voltage int,
co2_percent int
PRIMARY KEY(asset_id ,event_time)
) WITH CLUSTERING ORDER BY (event_time ASC)
this table captures data coming from a sensor and depending on type of sensor -- column sensor_type, some columns will have a value some others will not. Example temperature only applies to temperature sensor, humidity sensor applies to humidity sensor etc.
Now as I work with more and more sensor my intention is I will simply add additional columns using alter table command. Is this a correct strategy to follow or are there better ways to design this table for future use?
I've answered to a similar question few hours ago: here
Assuming you're Cassandra 2.X ready your situation is easier to handle, to perform what you need I'd use a Map
CREATE TABLE sensor_data (
asset_id text,
event_time timestamp,
sensor_type int,
sensor_info map<text, int>,
PRIMARY KEY(asset_id ,event_time)
) WITH CLUSTERING ORDER BY (event_time ASC)
Advantages is that your schema will remain the same even if new sensors come into your world. Disadvantage is that you won't be able to retrieve a specific data from your collection, you will always retrieve the collection in its entirely. If you're in Cassandra 2.1 secondary indexes on collections might help.
HTH,
Carlo

Resources