Am I violating the data modelling rule in Cassandra? - cassandra

I understand that we should not create 'N' number of partition under a single table because in this case, it tries to query from N number of nodes where the partitions are available.
(Modifying the example for understanding and security)
If I have a table like 'user'
CREATE TABLE user(
user_id int PRIMARY KEY,
user_name text,
user_phone varint
);
where user_id is unique.
Example - To get all the users from the table, I use the query :
select * from user;
So which means It goes to all the nodes where the partitions for the 'user_id' are available. Since I used the user_id as partition / primary key here, It will be scattered to all the nodes based on the partition_id.
Is it fine? Or Is there a better way to design this in Cassandra?
Edited :
By Keeping a single partition as 'uniquekey' and sorted by user_name will have the advantage that uniquekey will make a single partition. Is it the better design compare to the above one?
CREATE TABLE user(
user_id int,
user_name text,
user_phone varint,
primary key ('uniquekey', user_name));
select * from user where user_id = 'uniquekey';

A fundamental table design rule in Cassandra is called Query-Driven, which means you usually understand what are you trying to query on before you make the table schema.
If you just want to simply return all the rows (select * ) in the database (which is not a common use case for Cassandra since Cassandra aims to store very, very large amount of data), whatever you designed is fine. But Cassandra might not be the best choice in this case.
How to ensure a good table design in Cassandra?
Ref: Basic Rules of Cassandra Data Modeling

Related

Cassandra - Shall I have to do so many writes?

I have 5 Tables:
users_by_id
users_by_username
users_by_email
users_by_likes
users_by_followers
I have to write 5 Statements every time if a user registered. Is that not expensive or bad ?
INSERT INTO users_by_id (...) values (..)
INSERT INTO users_by_email (...) values (..)
INSERT INTO users_by_username (...) values (..)
INSERT INTO users_by_likes (...) values (..)
INSERT INTO users_by_followers (...) values (..)
The second question: Maybe I update users_by_id I have to write 5 Update statments. Is there another solution? Or is that not this bad ?
Cassandra advocates denormalization of your data and creating data model according to your queries. You will have to write your data model such that it satisfies all the queries with good performance. For performance (due to its architecture and design) Cassandra asks for writing and reading using partition key.
It is not expensive to write 5 insertions for same set of data in 5 different tables. Your reads will perform better and as data size increases to web scale, you will thank your decision of creating 5 tables and writing to them.
You can explore materialized views (Materialized View and Datastax Link for Materialized View but remember it is an experimental feature. So you have to understand it properly and also identify open issues with materialized views.
I would recommend you study Cassandra data model that will make things easier to grasp.
Cassandra is designed to be write intensive database so do not hesitate to duplicate your data. One should always design tables for the read queries. If one table satisfies one query, it is a fine design.
Answer to your second question, you should design your tables such a way that you do not have to update table. Always think about inserting new values.
For example, below table design
CREATE TABLE user_by_email (
email text,
timestamp timestamp,
name text,
fullname text,
userId text,
PRIMARY KEY (email,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
INSERT INTO user_by_email (email, DateTime.Now ........)
In this design, you should get the latest inserted value. Additionally , this design keeps change history for that key.
Think about, how many times we have to update values like user id, email, username? rarely.

How to model data using Cassandra and Ignite together?

I'm researching how to model data having both Cassandra and Ignite together. So far the basic recommendation of data modeling in Cassandra (coming from this article) is clear: "model data around your queries". An author gives an example of "user lookup". We want to look up for users by their username or their email and according to him the best approach would be having two tables:
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
However things get confusing with Ignite on the top of Cassandra. Unfortunately I could not find any helpful examples or answers to the following questions:
Does having multiple tables that store user information mean having Ignite cache for each of these tables?
Does having compound primary key mean introducing new type for each key and use it as Ignite cache key?
Having Ignite means not having direct reads from Cassandra. Does it even make scene to bother modeling data following NoSql best practices? Would it be ok to just have one user table and let Ignite take care of queries by username or email.
CREATE TABLE users (
id uuid PRIMARY KEY,
username text,
email text,
age int
)
You should probably have one cache per Cassandra table.
If your original key is compound, so should Ignite key be.
You will need to use secondary indexes in Ignite to query by more than one field, and this means you will have to hold all data in Ignite (which is NOT necessary for pure caching scenario). This means enabling readThrough and writeThrough, doing loadCache and always doing all updates through Ignite. You will have to choose between "Ignite as cache for Cassandra" (stick to Cassandra's data layout, can hold partial data) and "Ignite as DB backed by Cassandra" (you can use layout optimal for Ignite, secondary indexes).

Avoiding filtering with a compound partition key in Cassandra

I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries

Can a cassandra table be queried using only a part of the composite partition key?

Consider a table like this to store a user's contacts -
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARYKEY ((user, contact_name), contact_id)
// ^-- Note the composite partition key
}
The composite partition key results in a row per contact.
Let's say there are a 100 million users and every user has a few hundred contacts.
I can look up a particular user's particular contact's data by using
SELECT contact_data FROM contacts WHERE user_name='foo' AND contact_name='bar'
However, is it also possible to look up all contact names for a user using something like,
SELECT contact_name FROM contacts WHERE user_name='foo'
? could the WHERE clause contain only some of all the columns that form the primary key?
EDIT -- I tried this and cassandra doesn't allow it. So my question now is, how would you model the data to support two queries -
Get data for a specific user & contact
Get all contact names for a user
I can think of two options -
Create another table containing user_name and contact_name with only user_name as the primary key. But then if a user has too many contacts, could that be a wide row issue?
Create an index on user_name. But given 100M users with only a few hundred contacts per user, would user_name be considered a high-cardinality value hence bad for use in index?
In a RDBMS the query planner might be able to create an efficient query plan for that kind of query. But Cassandra can not. Cassandra would have to do a table scan. Cassandra tries hard not to allow you to make those kinds of queries. So it should reject it.
No You cannot. If you look at the mechanism of how cassandra stores data, you will understand why you cannot query by part of composite partition key.
Cassandra distributes data across nodes based on partition key. The co-ordinator of a write request generates hash token using murmur3 algorithm on partition key and sends the write request to the token's owner.(each node has a token range that it owns). During a read, a co-ordinator again calculates the hash token based on partition key and sends the read request to the token's owner node.
Since you are using composite partition key, during a write request, all components of key (user, contact_name) will be used to generate the hash token. The owner node of this token has the entire row. During a read request, you have to provide all components of the key to calculate the token and issue the read request to the correct owner of that token. Hence, Cassandra enforces you to provide the entire partition key.
You could use two different tables with same structure but not the same partition key :
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name, contact_name), contact_id)
}
CREATE TABLE contacts_by_users {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name), contact_id)
}
With this structure you have data duplication and you have to maintain both tables manually.
If you are using cassandra > 3.0, you can also use materialized views :
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name, contact_name), contact_id)
}
CREATE MATERIALIZED VIEW contracts_by_users
AS
SELECT *
FROM contracts
WHERE user_name IS NOT NULL
AND contract_name IS NOT NULL
AND contract_id IS NOT NULL
PRIMARY KEY ((user_name), contract_name, contract_id)
WITH CLUSTERING ORDER BY contract_name ASC
In this case, you only have to maintain table contracts, the view will be automaticlly update

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

Resources