Cassandra Table Modeling - cassandra

Imagine a table with thousands of columns, where most data in the row record is null. One of the columns is an ID, and this ID is known upfront.
select id,SomeRandomColumn
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
SomeRandomColumn is one of thousands, and in most cases the only column with data. SomeRandomColumn is NOT known upfront as the one that contains data.
Is there a CQL query that can do something like this.
select {Only Columns with data}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
I was thinking of putting in a "hint" column that points to the column with data, but that feels wrong unless there is a CQL query that looks something like this with one query;
select ColumnHint.{DataColumnName}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
In MongoDB I would just have a collection and the document I got back would have a "Type" attribute describing the data. So perhaps my real question is how do I replicate what I can do with MondoDB in Cassandra. My Cassandra journey so far is to create UDT's for each unique document, followed by altering the table to add this new UDT as a column. My starter table looks like this where ColumnDataName is the hint;
CREATE TABLE IF NOT EXISTS WideProductInstance (
Id uuid,
ColumnDataName text
PRIMARY KEY (Id)
);
Thanks

Is there a CQL query that can do something like this.
select {Only Columns with data}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
No, you cannot do that. And it's pretty easy to explain. To be able to know that a column contains data, Cassandra will need to read it. And if it has to read the data, since the effort is already spent on disk, it will just return this data to the client.
The only saving you'll get if Cassandra was capable of filtering out null column is on the network bandwidth ...
I was thinking of putting in a "hint" column that points to the column with data, but that feels wrong unless there is a CQL query that looks something like this with one query;
Your idea is like storing in another table a list of all column that actually contains real data and not null. It sounds like a JOIN which is bad and not supported. And if you need to read this reference table before reading the original table, you'll have to read at many places and it's going to be expensive
So perhaps my real question is how do I replicate what I can do with MondoDB in Cassandra.
Don't try to replicate the same feature from Mongo to Cassandra. The two database have fundamentally different architecture. What you have to do is to reason about your functional use-case. "How do I want to fetch my data from Cassandra ?" and from this point design a proper data model. Cassandra data model is designed by query.
The best advice for you is to watch some Cassandra Data Model videos (it's free) at http://academy.datastax.com

Related

Understanding Cassandra Data Model

I have recently started learning No-SQL and Cassandra through this article. The author explains the data model through this diagram:
The author also gives the below column family example:
Book {
key: 9352130677{ name: “Hadoop The Definitive Guide”, author:” Tom White”, publisher:”Oreilly”, priceInr;650, category: “hadoop”, edition:4},
key: 8177228137{ name”” Hadoop in Action”, author: “Chuck Lam”, publisher:”manning”, priceInr;590, category: “hadoop”},
key: 8177228137{ name:” Cassandra: The Definitive Guide”, author: “Eben Hewitt”, publisher:” Oreilly”, priceInr:600, category: “cassandra”},
}
But in that tutorial and every other tutorial I have gone through, then end up creating regular tables in cassandra. I am unable to connect the Cassandar model with what I am creating.
For example, I created a column family called Employee as below:
create columnfamily Employee(empid int primary key,empName text,age int);
Now I inserted some data and my column family looks as this:
For me this looks like a regular relational table and not like the data model the author has explained. How do I create a Employee column family where each row represents an employee with different attributes? Something like:
Employee{
101:{name:Emp1,age:20}
102:{name:Emp2,salary:1000}
102:{manager_name:Emp3,age:45}
}
}
You need to understand that in the representation using cql, is may look like regular relational table, but the internal structure of the rows in Cassandra is completely different. It is saving different set of attributes for each employee, and the nulls you can see while querying with cql is just a representation of empty/nonexistent cells.
What you trying to achieve, is unstructured data model. Cassandra started with this model, and all was working as described in the tutorial you've read, but there is an opinion that unstructured data design is unhealthy to development and makes more problems than it solves. So, after sometime, Cassandra moved to the "structured" data structure (and from thrift to cql). It doesn't mean that you have to store all attributes for all keys/rows, it doesn't mean that all the rows are have same number of attributes, it just means that you have to declare attributes before you use them.
You can achieve some kind of unstructured data modeling using Map, List, Set, etc. data types, UDT (User defined types) or just saving your data as json string and parsing it on the application side.
What you have understood is correct. Just believe it. Internally cassandra stores columns exactly like the image in your question.
Now, what you are expecting is to insert a column which is not defined while creating the Employee table. For dynamic columns, you can always use Map data types .
For example
create table Employee(
empid int primary key,
empName text,
age int,
attributes Map<text,text>);
To add new attributes you can use below queries.
UPDATE Employee SET attributes = { manager_name : Emp3, age:45 } WHERE empid = 102;
Update -
another way to to create a dynamic column model is as below
create table Employee(
empid int primary key,
empName text,
attribute text,
attributevalue text,
primary key (empid,empName,attribute)
);
Lets take few inserts -
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','age','25') ;
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','manager','emp2') ;
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','department','hr') ;
this data structure will create a wide row, and behaves like dynamic column. you can see primary key empid and name is common for all three rows, only attribute and value will change.
Hope this will help
Cassandra uses a special primary key called compositie key. This is the representation of the partitions. This is also one reason why cassandra scales well. The composite key is used to determine the nodes on which the rows are stored.
The result in your console may be a result set of rows, but the intern organization of cassandra is differnt from that. Have you ever tried to query a table without an primary key? You will quickly see that you can't query that flexible (because of the partitioning).
After that you will understand why we have to use a query-first design aproach for cassandra. This is completely different from RDBBS.

How we can do CRUD operations on complex data models in Cassandra?

How we can do CRUD operations on complex data models in Cassandra?
I have a project using NOSQL.
I have a column family for my customers.
The column family has just "id" at first.
Then it will be updated by altering new columns.
Count and type of columns for each customer could be different.
Also, each column can include sub columns with ids again and it would be altered, too. So, they should be indexed. And documents are not useful for this issue.
I've read about NOSQL, and I've decided to use Cassandra. I will be thankful if you would answer this questions:
Is the above that possible?
How we can create and use CRUD operations on this column family?
If the answer of last question is true, what is the type of result of a query?
It will return some rows for each primary key (id)?
How we can manage that, to access a table like with no redundancy? because I don't now this summarizing should be manage in DBside or in code side.
Thank you for your help.

Query in Cassandra that will sort the whole table by a specific field

I have a table like this
CREATE TABLE my_table(
category text,
name text,
PRIMARY KEY((category), name)
) WITH CLUSTERING ORDER BY (name ASC);
I want to write a query that will sort by name through the entire table, not just each partition.
Is that possible? What would be the "Cassandra way" of writing that query?
I've read other answers in the StackOverflow site and some examples created single partition with one id (bucket) which was the primary key but I don't want that because I want to have my data spread across the nodes by category
Cassandra doesn't support sorting across partitions; it only supports sorting within partitions.
So what you could do is query each category separately and it would return the sorted names for each partition. Then you could do a merge of those sorted results in your client (which is much faster than a full sort).
Another way would be to use Spark to read the table into an RDD and sort it inside Spark.
Always model cassandra tables through the access patterns (relational db / cassandra fill different needs).
Up to Cassandra 2.X, one had to model new column families (tables) for each access pattern. So if your access pattern needs a specific column to be sorted then model a table with that column in the partition/clustering key. So the code will have to insert into both the master table and into the projection table. Note depending on your business logic this may be difficult to synchronise if there's concurrent update, especially if there's update to perform after a read on the projections.
With Cassandra 3.x, there is now materialized views, that will allow you to have a similar feature, but that will be handled internally by Cassandra. Not sure it may fit your problem as I didn't play too much with 3.X but that may be worth investigation.
More on materialized view on their blog.

Cassandra Hierachy Data Model

I'm newbie design cassandra data model and I need some help to think out the box.
Basically I need a hierarchical table, something pretty standard when talking about Employee.
You have a employee, say Big Boss, that have a list of employee under him.
Something like:
create table employee(id timeuuid, name text, employees list<employee>, primary key(id));
So, is there a way to model a hierarchical model in Cassandra adding the table type itself, or even another approach?
When trying this line above it give me
Bad Request: line 1:61 no viable alternative at input 'employee'
EDITED
I was thinking about 2 possibilities:
Add an uuid instead and in my java application find each uuid Employee when bringing up the "boss".
Working with Map, where the uuid is the id itself and my text would be the entire Row, then in my java application get the maps, convert each "text" employee into a Employee entity and finally return the whole object;
It really depends on your queries...one particular model would only be good for a set of queries, but not others.
You can store ids, and look them up again at the client side. This means n extra queries for each "query". This may or may not be a problem, as queries that hit a partition are fast. Using a map from id to name is also an option. This means you do extra work and denormalise the names into the map values. That's also valid. A third option is to use a UDT (user defined type). You could then have a list or set or even map. In cassandra 2.1, you could index the map keys/ values as well, allowing for some quite flexible querying.
https://www.datastax.com/documentation/cql/3.1/cql/cql_using/cqlUseUDT.html
One more approach could be to store a person's details as id, static columns for their attributes, and have "children" as columns in wide row format.
This could look like
create table person(
id int primary key,
name text static,
age int static,
employees map<int, employeeudt>
);
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
Querying this will give you rows with the static properties repeated, but on disk, it's still held once. You can resolve the rest client side.

Handling the following use case in Cassandra?

I've been given the task of modelling a simple in Cassandra. Coming from an almost solely SQL background, though, I'm having a bit of trouble figuring it out.
Basically, we have a list of feeds that we're listening to that update periodically. This can be in RSS, JSON, ATOM, XML, etc (depending on the feed).
What we want to do is periodically check for new items in each feed, convert the data into a few formats (i.e. JSON and RSS) and store that in a Cassandra store.
So, in an RBDMS, the structure would be something akin to:
Feed:
feedId
name
URL
FeedItem:
feedItemId
feedId
title
json
rss
created_time
I'm confused as to how to model that data in Cassandra to facilitate simple things such as getting x amount of items for a specific feed in descending created order (which is probably the most common query).
I've heard of one strategy that mentions having a composite key storing, in this example, the the created_time as a time-based UUID with the feed item ID but I'm still a little confused.
For example, lets say I have a series of rows whose key is basically the feedId. Inside each row, I store a range of columns as mentioned above. The question is, where does the actual data go (i.e. JSON, RSS, title)? Would I have to store all the data for that 'record' as the column value?
I think I'm confusing wide rows and narrow (short?) rows as I like the idea of the composite key but I also want to store other data with each record and I'm not sure how to meld the two together...
You can store everything in one column family. However If the data for each FeedItem is very large, you can split the data for each FeedItem into another column family.
For example, you can have 1 column familyfor Feed, and the columns of that key are FeedItem ids, something like,
Feeds # column family
FeedId1 #key
time-stamp-1-feed-item-id1 #columns have no value, or values are enough info
time-stamp-2-feed-item-id2 #to show summary info in a results list
The Feeds column allows you to quickly get the last N items from a feed, but querying for the last N items of a Feed doesn't require fetching all the data for each FeedItem, either nothing is fetched, or just a summary.
Then you can use another column family to store the actual FeedItem data,
FeedItems # column family
feed-item-id1 # key
rss # 1 column for each field of a FeedItem
title #
...
Using CQL should be easier to understand to you as per your SQL background.
Cassandra (and NoSQL in general) is very fast and you don't have real benefits from using a related table for feeds, and anyway you will not be capable of doing JOINs. Obviously you can still create two tables if that's comfortable for you, but you will have to manage linking data inside your application code.
You can use something like:
CREATE TABLE FeedItem (
feedItemId ascii PRIMARY KEY,
feedId ascii,
feedName ascii,
feedURL ascii,
title ascii,
json ascii,
rss ascii,
created_time ascii );
Here I used ascii fields for everything. You can choose to use different data types for feedItemId or created_time, and available data types can be found here, and depending on which languages and client you are using it can be transparent or require some more work to make them works.
You may want to add some secondary indexes. For example, if you want to search for feeds items from a specific feedId, something like:
SELECT * FROM FeedItem where feedId = '123';
To create the index:
CREATE INDEX FeedItem_feedId ON FeedItem (feedId);
Sorting / Ordering, alas, it's not something easy in Cassandra. Maybe reading here and here can give you some clues where to start looking for, and also that's really depending on the cassandra version you're going to use.

Resources