How to understandr primary key in Apache cassandra? - cassandra

i new for use apache cassandra, i have install cassandra and use cqlsh in my laptop
i used to create table using :
create table userpageview( created_at timestamp, hit int, userid int, variantid int, primary key (created_at, hit, userid, variantid) );
and insert several data into table, but when i tried to select using condition for all column (i mean one by one) it's error
maybe my data modelling wrong, maybe anyone can tell me how create data modelling in cassandra
thx

You need to read about partition keys and clustering keys. Cassandra works much differently than relational databases and the types of queries you can do are much more restricted.
Some information to get you started: here and here.

Related

CassandraQL allow filtering

I am creating a table in cassandra database but I am getting an allow filtering error:
CREATE TABLE device_check (
id int,
checked_at bigint,
is_power boolean,
is_locked boolean,
PRIMARY KEY ((device_id), checked_at)
);
When I make a query
SELECT * FROM device_check where checked_at > 1234432543
But it is giving an allow filtering error. I tried removing brackets from device_id but it gives the same error. Even when I tried setting only the checked_at as the primary key it still wont work with the > operator. With the = operator it works.
PRIMARY KEY in Cassandra contains two type of keys
Partition key
Clustering Key
It is expressed as 'PRIMARY KEY((Partition Key), Clustering keys)`
Cassandra is a distributed database where data can be present on any of the node depending on the partition key. So to search data fast Cassandra asks users to send a partition key to identify the node where the data resides and query that node. So if you don't give parition key in your query then Cassandra complains that you are not querying the right way. Cassandra has to search all the nodes if you dont give it partition key. Thus Cassandra gives a error ALLOW FILTERING if you want to query without partition key.
With respect to > not supported for partition key, answer remains same as when you give a range search in your query then Cassandra has to scan all the nodes for responding which is not the right way to use Cassandra.

How to model data using Cassandra and Ignite together?

I'm researching how to model data having both Cassandra and Ignite together. So far the basic recommendation of data modeling in Cassandra (coming from this article) is clear: "model data around your queries". An author gives an example of "user lookup". We want to look up for users by their username or their email and according to him the best approach would be having two tables:
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
However things get confusing with Ignite on the top of Cassandra. Unfortunately I could not find any helpful examples or answers to the following questions:
Does having multiple tables that store user information mean having Ignite cache for each of these tables?
Does having compound primary key mean introducing new type for each key and use it as Ignite cache key?
Having Ignite means not having direct reads from Cassandra. Does it even make scene to bother modeling data following NoSql best practices? Would it be ok to just have one user table and let Ignite take care of queries by username or email.
CREATE TABLE users (
id uuid PRIMARY KEY,
username text,
email text,
age int
)
You should probably have one cache per Cassandra table.
If your original key is compound, so should Ignite key be.
You will need to use secondary indexes in Ignite to query by more than one field, and this means you will have to hold all data in Ignite (which is NOT necessary for pure caching scenario). This means enabling readThrough and writeThrough, doing loadCache and always doing all updates through Ignite. You will have to choose between "Ignite as cache for Cassandra" (stick to Cassandra's data layout, can hold partial data) and "Ignite as DB backed by Cassandra" (you can use layout optimal for Ignite, secondary indexes).

Am I violating the data modelling rule in Cassandra?

I understand that we should not create 'N' number of partition under a single table because in this case, it tries to query from N number of nodes where the partitions are available.
(Modifying the example for understanding and security)
If I have a table like 'user'
CREATE TABLE user(
user_id int PRIMARY KEY,
user_name text,
user_phone varint
);
where user_id is unique.
Example - To get all the users from the table, I use the query :
select * from user;
So which means It goes to all the nodes where the partitions for the 'user_id' are available. Since I used the user_id as partition / primary key here, It will be scattered to all the nodes based on the partition_id.
Is it fine? Or Is there a better way to design this in Cassandra?
Edited :
By Keeping a single partition as 'uniquekey' and sorted by user_name will have the advantage that uniquekey will make a single partition. Is it the better design compare to the above one?
CREATE TABLE user(
user_id int,
user_name text,
user_phone varint,
primary key ('uniquekey', user_name));
select * from user where user_id = 'uniquekey';
A fundamental table design rule in Cassandra is called Query-Driven, which means you usually understand what are you trying to query on before you make the table schema.
If you just want to simply return all the rows (select * ) in the database (which is not a common use case for Cassandra since Cassandra aims to store very, very large amount of data), whatever you designed is fine. But Cassandra might not be the best choice in this case.
How to ensure a good table design in Cassandra?
Ref: Basic Rules of Cassandra Data Modeling

Issue with NoSql data model

As being newbie, facing issues with the data modelling on the Cassandra data model. We are planning to use the Cassandra for the reporting purpose. In the reporting we need to filter data by multiple parameters. Let's say We have a column family
Create table cf_data
(
Date varchar,
Attribute1 varchar,
Attribute2 varchar,
Attribute3 varchar,
Attribute4 varchar,
Attribute5 varchar,
Attribute6 varchar,
Primary Key(Date)
)
We need to support query like
Select * from cf_date where date = '2015-02-02' and Attribute1 in ('asdf','assf','asdf') and Attribute1 in ('wewer','werwe') and Attribute2 in ('sdfsd','werwe') and Attribute3 in ('weryewu','ghjghjh')
I know we need to respect the primary key restrictions while querying the column family. Cassandra internal storage works like
SortedMap<String,SortedMap<Key,Value>>
NoSQL works on the principle of storing denormalized data as per the access pattern. If I need to satisfy the above query how should I model the column family. From report UI, user can select the values from Attribute1, Attribute2, Attribute3.... etc as a drop down. One option could be using Spark on top of the Cassandra node to support SQL queries but it's better the model the column family as Cassandra expects.
Any pointers ??
From the Datastax CQL documentation:
"Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried."
If you need to use Spark to support SQL queries, you may be better off using a proper SQL database. Just because NoSQL is a fad, you don't need to follow it. Not all data can be efficiently modeled in all NoSQL DBs.
One other inefficient option for you is to query without the attributes itself and code the filtering in the application, at the risk of creating a large latency in response. If the reports are not to be created in real time or near real-time, then you should be good.

Hector support for CQL3 specific features (Partition & Clustering keys) and Compact Storage option

I'm trying to leverage a specific feature of Apache Cassandra CQL3, which is partition and clustering keys for tables which are created with compact storage option.
For Eg.
CREATE TABLE EMPLOYEE(id uuid, name text, field text, value text, primary key(id, name , field )) with compact storage;
I've created the table via CQL3 and i;m able to insert rows successfully using the Hector API.
But I couldn't find right set of options in the hector api to create the table itself as i require.
To elaborate a little bit more:
In ColumnFamilyDefinition.java i couldnt see an option for setting storage option (as compact storage) and In ColumnDefinition.java, i couldnt find the option to say that this column is part of the Partition and Clustering Keys
Could you please give me an idea of whether i can use Hector for this (i.e. Creating table) or not and if i can do that, what are the options that i need to provide?
If you are not tied to Hector, you could look into the DataStax Java Driver which was created to use CQL3 and Cassandra's binary protocol.

Resources