How to create an efficient Cassandra Data model?

How to create an efficient Cassandra Data model? - cassandra

I'm new to Cassandra and trying to create an application. In which I have an entity 'student' consist of 4 columns as given below:
student_id
student_name
dob
course_name
create table student(student_id uuid, student_name text, dob date, course_name text, PRIMARY KEY(student_id));
I have to search students by course_name. Now according to Cassandra data modeling for searching student by course name I need to create another table as student_by_course_name which consist of two columns:
course_name
student_id
where course_name will be the partition key and student_id will be the cluster key as given below:
create table student_by_course_name(course_name text, student_id uuid PRIMARY KEY(course_name, student_id));
The problem arises when a student changes his course. Now I want to update the course name in the student_by_course_name table but it throws an error as the course_name column is a partition key. How to resolve this or pls suggest if i'm using Cassandra data modeling wrongly??

In this case you have to delete the old entry first and then add a new entry to student_by_course_name with the new course.
Your model looks good

The best way is indeed as Alex suggested. Delete and then update.
There are a couple of problems than you might need to be aware.
If your course have a LOT of students, it will generate big partitions (for this specific case might not be a issue)
Deleting entries will cause tombstones, and as such you should be prepared to handle them (Ex: Use low GC_GRACE, if you think a lot will be generated set unchecked_tombstones in the table)

Cassandra isn't the best for deleting data or updating data in-place. I believe that you have to use a batch statement to keep the tables in sync.
You can take two approaches. The first would be to delete the existing student ID/course name combination. This will create a tombstone but if it doesn't happen often, it won't be a big deal. The second option would be to use the original table and to create a secondary index on course name. This will allow both for the course name to be updated and queried by but may not preform well over time.

Related

Understanding Cassandra Data Model

I have recently started learning No-SQL and Cassandra through this article. The author explains the data model through this diagram:
The author also gives the below column family example:
Book {
key: 9352130677{ name: “Hadoop The Definitive Guide”, author:” Tom White”, publisher:”Oreilly”, priceInr;650, category: “hadoop”, edition:4},
key: 8177228137{ name”” Hadoop in Action”, author: “Chuck Lam”, publisher:”manning”, priceInr;590, category: “hadoop”},
key: 8177228137{ name:” Cassandra: The Definitive Guide”, author: “Eben Hewitt”, publisher:” Oreilly”, priceInr:600, category: “cassandra”},
}
But in that tutorial and every other tutorial I have gone through, then end up creating regular tables in cassandra. I am unable to connect the Cassandar model with what I am creating.
For example, I created a column family called Employee as below:
create columnfamily Employee(empid int primary key,empName text,age int);
Now I inserted some data and my column family looks as this:
For me this looks like a regular relational table and not like the data model the author has explained. How do I create a Employee column family where each row represents an employee with different attributes? Something like:
Employee{
101:{name:Emp1,age:20}
102:{name:Emp2,salary:1000}
102:{manager_name:Emp3,age:45}
}
}

You need to understand that in the representation using cql, is may look like regular relational table, but the internal structure of the rows in Cassandra is completely different. It is saving different set of attributes for each employee, and the nulls you can see while querying with cql is just a representation of empty/nonexistent cells.
What you trying to achieve, is unstructured data model. Cassandra started with this model, and all was working as described in the tutorial you've read, but there is an opinion that unstructured data design is unhealthy to development and makes more problems than it solves. So, after sometime, Cassandra moved to the "structured" data structure (and from thrift to cql). It doesn't mean that you have to store all attributes for all keys/rows, it doesn't mean that all the rows are have same number of attributes, it just means that you have to declare attributes before you use them.
You can achieve some kind of unstructured data modeling using Map, List, Set, etc. data types, UDT (User defined types) or just saving your data as json string and parsing it on the application side.

What you have understood is correct. Just believe it. Internally cassandra stores columns exactly like the image in your question.
Now, what you are expecting is to insert a column which is not defined while creating the Employee table. For dynamic columns, you can always use Map data types .
For example
create table Employee(
empid int primary key,
empName text,
age int,
attributes Map<text,text>);
To add new attributes you can use below queries.
UPDATE Employee SET attributes = { manager_name : Emp3, age:45 } WHERE empid = 102;
Update -
another way to to create a dynamic column model is as below
create table Employee(
empid int primary key,
empName text,
attribute text,
attributevalue text,
primary key (empid,empName,attribute)
);
Lets take few inserts -
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','age','25') ;
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','manager','emp2') ;
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','department','hr') ;
this data structure will create a wide row, and behaves like dynamic column. you can see primary key empid and name is common for all three rows, only attribute and value will change.
Hope this will help

Cassandra uses a special primary key called compositie key. This is the representation of the partitions. This is also one reason why cassandra scales well. The composite key is used to determine the nodes on which the rows are stored.
The result in your console may be a result set of rows, but the intern organization of cassandra is differnt from that. Have you ever tried to query a table without an primary key? You will quickly see that you can't query that flexible (because of the partitioning).
After that you will understand why we have to use a query-first design aproach for cassandra. This is completely different from RDBBS.

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.

According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

Cassandra data modeling

So I'm designing this data model for product price tracking.
A product can be followed by many users and an user can follow many products, so it's a many to many relation.
The products are under constant tracking, but a new price is inserted only if it has varied from the previous one.
The users have set an upper price limit for their followed products, so every time a price varies, the preferences are checked and the users will be notified if the price has dropped below their treshold.
So initially I thought of the following product model:
However "subscriberEmails" is a list collection that will handle up to 65536 elements. But being a big data solution, it's a boundary that we don't want to have. So we end up writing a separate table for that:
So now "usersByProduct" can have up to 2 billion columns, fair enough. And the user preferences are stored in a "Map" which is again limited but we think it's a good maximum number of products to follow by user.
Now the problem we're facing is the following:
Every time we want to update a product's price we would have to make a query like this:
INSERT INTO products("Id", date, price) VALUES (7dacedd2-c09b-46c5-8686-00c2a03c71dd, dateof(now()), 24.87); // Example only
But INSERT operations don't admit other conditional clauses than (IF NOT EXISTS) and that isn't what we want. We need to update the price only if it's different from the previous one, so this forces us to make two queries (one for reading current value and another to update it if necessary).
PD. UPDATE operations do have IF conditions but it's not our case because we need an INSERT.
UPDATE products SET date = dateof(now()) WHERE "Id" = 7dacedd2-c09b-46c5-8686-00c2a03c71dd IF price != 20.3; // example only

Don't try to apply a normal model on a cassandra database. It may work but you'll end up with terrible performance and scalability.
The recommended approach to Cassandra data modeling is to first figure out your read queries against the database and structure your data so that these reads are cheap. You'll probably need to duplicate writes somewhat but it's OK because writes are pretty cheap in Cassandra.
For your specific use case, the key query seems to be able to get all users interested in a price change in a product, so you create a table for this, for example:
create table productSubscriptions (
productId uuid,
priceLimit float,
createdAt timestamp,
email text,
primary key (productId,priceLimit,createdAt)
);
but since you also need to know all product subscriptions for a user, you all need a user-keyed table of the same data:
create table userProductSubscriptions (
email text,
productId uuid,
priceLimit float,
primary key (email, productId)
)
With these 2 tables, I guess you can see that all your main queries can be done with a single-row select and your insert/delete are straightforward but will require you to modify both tables in sync.
Obviously, you'll need to flesh out a bit more the schema for your complete need but this should give you an example on how to think about your cassandra schema.
Conditional update issue
For your conditional insert issue, the easiest answer is: do it with an UPDATE if you really need it (update and insert are nearly identical in CQL) but it's a very expensive operation so avoid it if you can.
For your use case, I would split your product table in three :
create table products (
category uuid,
productId uuid,
url text,
price float,
primary key (category, productId)
)
create table productPricingAudit (
productId uuid,
date timestamp,
price float,
primary key (productId, date)
)
create table priceScheduler (
day text,
checktime timestamp,
productId uuid,
url text,
primary key (day, checktime)
)
products table can hold for full catalog, optionally split in categories (so that listing all products in a single category is a single-row select)
productPricingAudit would have an insert with the latest price retrieved whatever it is since this will let you debug any pricing issue you may have
priceScheduler holds all the check to be made for a given day, ordered by check time. Your scheduler simply has to make a column range query on single row whenever it runs.
With such a schema, you don't care about the conditional updates, you simply issue 3 inserts whenever you update a product price even it doesn't change.

Okay, I will try to answer my own question: conditional inserts other than "IF NOT EXISTS" are not supported in Cassandra by the date, period.
The closest thing is a conditional update, but that doesn't work in our scenario. So there's one simple option left: application side logic. This means that you have to read the previous entry and perform the decision on your application. The obvious downside is that 2 queries are performed (one SELECT and one INSERT) which obviously adds latency.
However this suits our application because every time we perform a query to enqueue all items that should be checked, we can select the items urls and their current prices too. So the workers that check the latest price can then make the decision of inserting or not because they have the current price to compare with.
So... A query similar to this would be performed every X minutes:
SELECT id, url, price FROM products WHERE "nextCheckTime" < now();
// example only, wouldn't even work if nextCheckTime is not part of the PK or index
This is a very costly operation to perform on a Cassandra cluster because it has to go through all rows that are stored randomly in different nodes by default. Another downside is that we need some advanced and specific statistics regarding products and users.
So we've decided that a relational database will serve us better than Cassandra in this particular case.
We sadly leave all of Cassandra's advantages (fast inserts, easy scaling, built in sharding...) and look towards a MySQL Cluster or master-slave implementation.

Cassandra Hierachy Data Model

I'm newbie design cassandra data model and I need some help to think out the box.
Basically I need a hierarchical table, something pretty standard when talking about Employee.
You have a employee, say Big Boss, that have a list of employee under him.
Something like:
create table employee(id timeuuid, name text, employees list<employee>, primary key(id));
So, is there a way to model a hierarchical model in Cassandra adding the table type itself, or even another approach?
When trying this line above it give me
Bad Request: line 1:61 no viable alternative at input 'employee'
EDITED
I was thinking about 2 possibilities:
Add an uuid instead and in my java application find each uuid Employee when bringing up the "boss".
Working with Map, where the uuid is the id itself and my text would be the entire Row, then in my java application get the maps, convert each "text" employee into a Employee entity and finally return the whole object;

It really depends on your queries...one particular model would only be good for a set of queries, but not others.
You can store ids, and look them up again at the client side. This means n extra queries for each "query". This may or may not be a problem, as queries that hit a partition are fast. Using a map from id to name is also an option. This means you do extra work and denormalise the names into the map values. That's also valid. A third option is to use a UDT (user defined type). You could then have a list or set or even map. In cassandra 2.1, you could index the map keys/ values as well, allowing for some quite flexible querying.
https://www.datastax.com/documentation/cql/3.1/cql/cql_using/cqlUseUDT.html
One more approach could be to store a person's details as id, static columns for their attributes, and have "children" as columns in wide row format.
This could look like
create table person(
id int primary key,
name text static,
age int static,
employees map<int, employeeudt>
);
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
Querying this will give you rows with the static properties repeated, but on disk, it's still held once. You can resolve the rest client side.

Cassandra CQL SELECT/DELETE issue due to primary key constraints

I need to store latest updates that needs to be pushed to users' newsfeed page in Cassandra table for later retrieval and my table's schema is as follow:
CREATE TABLE newsfeed (user_name text,
post_id bigint,
post_type text,
favorited boolean,
shared boolean,
own boolean,
date timestamp,
PRIMARY KEY (user_name,date,post_id,post_type) );
The first three column (username, postid, and posttype) in combination will build the actual primary-key of the table, however since I wanted to ORDER the SELECT queries on this table based on "date"s of rows I placed the date-column into the primary key fields as the "second" entry (did I have to do this?).
When I want to delete a row by giving only "user_name, post_id, and post_type" as follow:
DELETE FROM newsfeed WHERE user_name='pooria' and post_id=36 and post_type='p';
I will get the following error:
Bad Request: Missing PRIMARY KEY part date since post_id is set
I need the date-column to be part of the primary key since I want to use it in my ORDER BY clauses and on the other hand I have to delete some rows without knowing their "date" values!
So how such problems are tackled in Cassandra? should I be fixing my Data Model and have different schema for job?

DataStax's Chief Evangelist Patrick McFadden posted an article demonstrating a few time series modeling patterns. Definitely makes for a good read, and should be of some help to you: Getting Started with Time Series Data Modeling.
I think your table is just fine. Although, with the way that composite primary keys work in Cassandra, if you cannot skip primary key components in a query. So if you do end up needing to query data by user_name, post_id, and/or post_type differently (without date), you should create a table specifically for that query (which does not include date in the primary key).
I will however say that in-general, creating a table which will process regular delete operations is not a good idea. In fact, I'm pretty sure that has been classified as a Cassandra "anti-pattern." Data really isn't deleted from Cassandra; it is tombstoned. Tombstones are reconciled at compaction time (assuming that the tombstone threshold time has been met), and having too many of them has been known to cause performance issues.
If you read the article I linked above, go down to the section named "Time Series Pattern 3." You will notice that the INSERT statements are run with the USING TTL clause. This gives the data a time-to-live in seconds, after which it will "quietly disappear." For instance, if you wanted to keep your data around for 24 hours (86400 seconds) you could do something like this:
INSERT INTO newsfeed (...) VALUES (...) USING TTL 86400
Using the TTL feature is a preferable alternative to regular cleansing by DELETE.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to create an efficient Cassandra Data model? - cassandra

In this case you have to delete the old entry first and then add a new entry to student_by_course_name with the new course. Your model looks good

Related

Understanding Cassandra Data Model

Cassandra column family design

Cassandra data modeling

Cassandra Hierachy Data Model

Cassandra CQL SELECT/DELETE issue due to primary key constraints

Categories

Resources