Issue with NoSql data model - apache-spark

As being newbie, facing issues with the data modelling on the Cassandra data model. We are planning to use the Cassandra for the reporting purpose. In the reporting we need to filter data by multiple parameters. Let's say We have a column family
Create table cf_data
(
Date varchar,
Attribute1 varchar,
Attribute2 varchar,
Attribute3 varchar,
Attribute4 varchar,
Attribute5 varchar,
Attribute6 varchar,
Primary Key(Date)
)
We need to support query like
Select * from cf_date where date = '2015-02-02' and Attribute1 in ('asdf','assf','asdf') and Attribute1 in ('wewer','werwe') and Attribute2 in ('sdfsd','werwe') and Attribute3 in ('weryewu','ghjghjh')
I know we need to respect the primary key restrictions while querying the column family. Cassandra internal storage works like
SortedMap<String,SortedMap<Key,Value>>
NoSQL works on the principle of storing denormalized data as per the access pattern. If I need to satisfy the above query how should I model the column family. From report UI, user can select the values from Attribute1, Attribute2, Attribute3.... etc as a drop down. One option could be using Spark on top of the Cassandra node to support SQL queries but it's better the model the column family as Cassandra expects.
Any pointers ??

From the Datastax CQL documentation:
"Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried."
If you need to use Spark to support SQL queries, you may be better off using a proper SQL database. Just because NoSQL is a fad, you don't need to follow it. Not all data can be efficiently modeled in all NoSQL DBs.
One other inefficient option for you is to query without the attributes itself and code the filtering in the application, at the risk of creating a large latency in response. If the reports are not to be created in real time or near real-time, then you should be good.

Related

Cassandra - Shall I have to do so many writes?

I have 5 Tables:
users_by_id
users_by_username
users_by_email
users_by_likes
users_by_followers
I have to write 5 Statements every time if a user registered. Is that not expensive or bad ?
INSERT INTO users_by_id (...) values (..)
INSERT INTO users_by_email (...) values (..)
INSERT INTO users_by_username (...) values (..)
INSERT INTO users_by_likes (...) values (..)
INSERT INTO users_by_followers (...) values (..)
The second question: Maybe I update users_by_id I have to write 5 Update statments. Is there another solution? Or is that not this bad ?
Cassandra advocates denormalization of your data and creating data model according to your queries. You will have to write your data model such that it satisfies all the queries with good performance. For performance (due to its architecture and design) Cassandra asks for writing and reading using partition key.
It is not expensive to write 5 insertions for same set of data in 5 different tables. Your reads will perform better and as data size increases to web scale, you will thank your decision of creating 5 tables and writing to them.
You can explore materialized views (Materialized View and Datastax Link for Materialized View but remember it is an experimental feature. So you have to understand it properly and also identify open issues with materialized views.
I would recommend you study Cassandra data model that will make things easier to grasp.
Cassandra is designed to be write intensive database so do not hesitate to duplicate your data. One should always design tables for the read queries. If one table satisfies one query, it is a fine design.
Answer to your second question, you should design your tables such a way that you do not have to update table. Always think about inserting new values.
For example, below table design
CREATE TABLE user_by_email (
email text,
timestamp timestamp,
name text,
fullname text,
userId text,
PRIMARY KEY (email,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
INSERT INTO user_by_email (email, DateTime.Now ........)
In this design, you should get the latest inserted value. Additionally , this design keeps change history for that key.
Think about, how many times we have to update values like user id, email, username? rarely.

How to model data using Cassandra and Ignite together?

I'm researching how to model data having both Cassandra and Ignite together. So far the basic recommendation of data modeling in Cassandra (coming from this article) is clear: "model data around your queries". An author gives an example of "user lookup". We want to look up for users by their username or their email and according to him the best approach would be having two tables:
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
However things get confusing with Ignite on the top of Cassandra. Unfortunately I could not find any helpful examples or answers to the following questions:
Does having multiple tables that store user information mean having Ignite cache for each of these tables?
Does having compound primary key mean introducing new type for each key and use it as Ignite cache key?
Having Ignite means not having direct reads from Cassandra. Does it even make scene to bother modeling data following NoSql best practices? Would it be ok to just have one user table and let Ignite take care of queries by username or email.
CREATE TABLE users (
id uuid PRIMARY KEY,
username text,
email text,
age int
)
You should probably have one cache per Cassandra table.
If your original key is compound, so should Ignite key be.
You will need to use secondary indexes in Ignite to query by more than one field, and this means you will have to hold all data in Ignite (which is NOT necessary for pure caching scenario). This means enabling readThrough and writeThrough, doing loadCache and always doing all updates through Ignite. You will have to choose between "Ignite as cache for Cassandra" (stick to Cassandra's data layout, can hold partial data) and "Ignite as DB backed by Cassandra" (you can use layout optimal for Ignite, secondary indexes).

Am I violating the data modelling rule in Cassandra?

I understand that we should not create 'N' number of partition under a single table because in this case, it tries to query from N number of nodes where the partitions are available.
(Modifying the example for understanding and security)
If I have a table like 'user'
CREATE TABLE user(
user_id int PRIMARY KEY,
user_name text,
user_phone varint
);
where user_id is unique.
Example - To get all the users from the table, I use the query :
select * from user;
So which means It goes to all the nodes where the partitions for the 'user_id' are available. Since I used the user_id as partition / primary key here, It will be scattered to all the nodes based on the partition_id.
Is it fine? Or Is there a better way to design this in Cassandra?
Edited :
By Keeping a single partition as 'uniquekey' and sorted by user_name will have the advantage that uniquekey will make a single partition. Is it the better design compare to the above one?
CREATE TABLE user(
user_id int,
user_name text,
user_phone varint,
primary key ('uniquekey', user_name));
select * from user where user_id = 'uniquekey';
A fundamental table design rule in Cassandra is called Query-Driven, which means you usually understand what are you trying to query on before you make the table schema.
If you just want to simply return all the rows (select * ) in the database (which is not a common use case for Cassandra since Cassandra aims to store very, very large amount of data), whatever you designed is fine. But Cassandra might not be the best choice in this case.
How to ensure a good table design in Cassandra?
Ref: Basic Rules of Cassandra Data Modeling

How to understandr primary key in Apache cassandra?

i new for use apache cassandra, i have install cassandra and use cqlsh in my laptop
i used to create table using :
create table userpageview( created_at timestamp, hit int, userid int, variantid int, primary key (created_at, hit, userid, variantid) );
and insert several data into table, but when i tried to select using condition for all column (i mean one by one) it's error
maybe my data modelling wrong, maybe anyone can tell me how create data modelling in cassandra
thx
You need to read about partition keys and clustering keys. Cassandra works much differently than relational databases and the types of queries you can do are much more restricted.
Some information to get you started: here and here.

How to do a join queries with 2 or more tables in cassandra cql

I am new to cassandra. Here I have two tables EVENTS and TOWER. I need to join those for some queries. But I'm not enable to do it.
Structure of EVENTS table:
eid int PRIMARY KEY,
a_end_tow_id text,
a_home_circle text,
a_home_operator text,
a_imei text,
a_imsi text,
Structure of TOWER table:
tid int PRIMARY KEY,
tower_address_1 text,
tower_address_2 text,
tower_azimuth text,
tower_cgi text,
tower_circle text,
tower_id_no text,
tower_lat_d text,
tower_long_d text,
tower_name text,
Now, I want to join these table with respect to EID and TID so that I can fetch the data of both tables.
Cassandra = No Joins. Your model is 100% relational. You need to rethink it for Cassandra. I would advice you take a look at these slides. They dig deep into how to model data for cassandra. Also here is a webinar covering the topic. But stop thinking foreign keys and joining tables, because if you need relations cassandra isn't the tool for the job.
But Why?
Because then you need to check consistency and do many other things that relational databases do and so you loose the performance and scalability that cassandra offers.
What can I do?
DENORMALIZE! Lots of data in one table? But the table will have too many columns!So? Cassandra can handle a very large number of columns in a table.
The other thing you can do is to simulate the join in your client application. Match the two datasets in your code, but this will be very slow because you'll have to iterate over all your information.
Another way is to carry out multiple queries. Select the event you want, then the matching tower.
There are a couple of ways that you can join tables together in Cassandra and query them. But of course you have to rethink the data model part.
Use Apache Spark’s SparkSQL™ with Cassandra (either open source or in DataStax Enterprise – DSE).
Use DataStax provided ODBC connectors with Cassandra and DSE.
PlayOrm is a good option for doing joins on scalable systems with a special Scalable SQL language in which you can join partitions (ie. you never want to join 1 billion rows with another billion rows). It has tons of noSQL patterns and is a complete break from hibernate and JPA to mimic noSQL patterns with client side joins when needed.

Resources