One to many mapping in Cassandra - cassandra

I am new to Cassandra and would like to do One to many mapping of User and its vehicle. One user may have multiple Vehicles. My User table will contain User details like name, surname, etc. And Vehicle table will have Vehicle details.
My select query will fetch all Vehicle details for particular User.
How should I design this in Cassandra?

You can easily model this in a single table:
CREATE TABLE userVehicles (
userid text,
vehicleid text,
name text static,
surname text static,
vehicleMake text,
vehicleModel text,
vehicleYear text,
PRIMARY KEY (userid,vehicleid)
);
This way you can query vehicles for a single user in one shot, and your user data can be static so that it is stored at the partition key level. As long as the cardinality of user to vehicle isn't too big (as-in, like a user has 1000 vehicles) this should work just fine.
The case I have considered above is very simple. But what if my User has lot of details around 20 to 30 fields and same for Vehicle. Still you would suggest to have a single table and copying User data for all vehicle?
It depends. Would your use case require returning all of them? If so, then "yes" I would still recommend this approach. The way to get the best query performance out of Cassandra, is to model your tables to fit your queries. Cassandra works best when it can read a single row by a specific key, or a range of rows (stored sequentially). You want to avoid performing multiple queries or writing queries that force Cassandra to perform random reads.
What are the consequences of having 2 different tables like User and Vehicle and Vehicle table will have primary key as User_Id and Vehicle_Id?
In a distributed system network time is the enemy. By having two tables, you are now making two queries...assuming a 1 to 1 ratio of users to vehicles. But if your user has 8 vehicles, you now need 9 queries to achieve your result. With the design above you can build your result set in 1 query (minimizing network time). Also with userid as a partition key, that query is guaranteed to be served by one node, as opposed to additional queries for vehicle data which will most likely require contacting multiple nodes.

This seems as simple as having two tables, one holding all of your vehicles data and another one for satisfying your query:
CREATE TABLE vehicles (
vehicle_id bigint,
vehicle_type int,
vehicle_name text,
...
PRIMARY KEY (vehicle_type)
)
CREATE TABLE vehicles_to_users (
user_id bigint,
vehicle_id bigint,
vehicle_type int,
vehicle_name text,
...
PRIMARY KEY (user_id, vehicle_type)
)
Then you would query by:
SELECT * FROM vehicles_to_users WHERE user_id = 9;
or something like that to get all specific vehicle type belonging to a particular user:
SELECT * FROM vehicles_to_users WHERE user_id = 9 AND vehicle_type = 1;
This is a solution with denormalized data, and you should always consider that approach instead of having something like:
CREATE TABLE vehicles (
vehicle_id bigint,
vehicle_type int,
vehicle_name text,
...
PRIMARY KEY (vehicle_type)
)
CREATE TABLE vehicles_to_users (
user_id bigint,
vehicle_id bigint,
PRIMARY KEY (user_id)
)
because it belongs to the relational databases world and you'd have to run N+1 queries to satisfy your requirements: one to get all the ids belonging to a particular user, and then N queries to get all the information for each vehicle:
SELECT * FROM vehicles_to_users WHERE user_id = 9;
SELECT * FROM vehicles WHERE vehicle_id = 115;
SELECT * FROM vehicles WHERE vehicle_id = 116;
SELECT * FROM vehicles WHERE vehicle_id = ...;
And don't be tempted to use the IN clausole like this:
SELECT * FROM vehicles WHERE vehicle_id IN (115,116,....);
because it would perform even worse due to extra work that a coordinator node have to do.

Related

Why am I getting this error when I run the query?

When attempting to perform this query:
select race_name from sport_app.month_category_runner where race_type = 'URBAN RACE 10K' and club = 'CORNELLA ATLETIC';
I get the following error:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
It is an exercise, so I am not allowed to use ALLOW FILTERING.
So I have created two indexes in this way:
create index raceTypeIndex ON sport_app.month_category_runner(race_type);
create index clubIndex ON sport_app.month_category_runner(club);
But I keep getting the same error, am I missing something, or is there an alternative?
Table Structure:
CREATE TABLE month_category_runner (month text,
category text,
runner_id text,
club text,
race_name text,
race_type text,
race_date timestamp,
total_runners int,
net_time time,
PRIMARY KEY (month, category, runner_id, race_name, net_time));
Note if you add the "ALLOW FILTERING" the query will run on all the nodes of Cassandra cluster and can have a large impact on all nodes.
The recommendation is to add the partition as condition of your query, to allow the query to be executed on needed nodes only.
Example:
select race_name from month_category_runner where month = 'may' and club = 'CORNELLA ATLETIC';
select race_name from month_category_runner where month = 'may' and race_type = 'URBAN RACE 10K';
select race_name from month_category_runner where month = 'may' and race_type = 'URBAN RACE 10K' and club = 'CORNELLA ATLETIC' ALLOW FILTERING;
Your primary key is composed by (month, category, runner_id, race_name, net_time) and the column month is the partition, so this column must be on your query filter as i showed in example.
The query that you want to do using two columns that are not in primary key despite the index column exist, you need to use the ALLOW FILTERING that can have performance impact;
The other option is create a new table where the primary key contains theses columns.

How should I design the schema to get the last 2 records of each clustering key in Cassandra?

Each row in my table has 4 values product_id, user_id, updated_at, rating.
I'd like to create a table to find out how many users changed rating during a given period.
Currently my schema looks like:
CREATE TABLE IF NOT EXISTS ratings_by_product (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((product_id ), updated_at , user_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, user_id ASC);
but I couldn't figure out the way to only get the last 2 rows of each user in a given time window.
Any advice on query or changing the schema would be appreciated.
Cassandra requires a query-based approach to table design. Which means that typically one table will serve one query. So to serve the query you are talking about (last two updated rows per user) you should build a table specifically designed to serve it:
CREATE TABLE ratings_by_user_by_time (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((user_id ), updated_at, product_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, product_id ASC );
Then you will be able to get the last two updated ratings for a user by doing the following:
SELECT * FROM ratings_by_user_by_time
WHERE user_id = 'Bob' LIMIT 2;
Note that you'll need to keep the two ratings tables in-sync yourself, and using a batch statement is a good way to accomplish that.

How to model for word search in cassandra

my model design to save word search from checkbox and it must have update word search and status, delete(fake). my old model set pk is uuid(id of word search) and set index is status (enable, disable, deleted)
but I don't want to set index at status column(I think its very bad to set index at update column) and I don't change database
Is it have better way for model this?
sorry for my english grammar
You should not create index on very low cardinality column status
Avoid very low cardinality index e.g. index where the number of distinct values is very low. A good example is an index on the gender of an user. On each node, the whole user population will be distributed on only 2 different partitions for the index: MALE & FEMALE. If the number of users per node is very dense (e.g. millions) we’ll have very wide partitions for MALE & FEMALE index, which is bad
Source : https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive
Best way to handle this type of case :
Create separate table for each type of status
Or Status with a known parameter (year, month etc) as partition key
Example of 2nd Option
CREATE TABLE save_search (
year int,
status int,
uuid uuid,
category text,
word_search text,
PRIMARY KEY((year, status), uuid)
);
Here you can see that i have made a composite partition key with year and status, because of low cardinality issue. If you think huge data will be in a single status then you should also add month as the part of composite partition key
If your dataset is small you can just remove the year field.
CREATE TABLE save_search (
status int,
uuid uuid,
category text,
word_search text,
PRIMARY KEY(status, uuid)
);
Or
If you are using cassandra version 3.x or above then you can use materialized view
CREATE MATERIALIZED VIEW search_by_status AS
SELECT *
FROM your_main_table
WHERE uuid IS NOT NULL AND status IS NOT NULL
PRIMARY KEY (status, uuid);
You can query with status like :
SELECT * FROM search_by_status WHERE status = 0;
All the deleting, updating and inserting you made on your main table cassandra will sync it with the materialized view

Modeling MultiTenant in Cassandra

I have several customers each represented by a "tenant"
I would like to know what is the best way to modelize this concept. I did a lot of research and found this topic : http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Modeling-multi-tenanted-Cassandra-schema-td7591311.html
I know there are several possibilities
One keyspace by tenant
One table (column family) by tenant
One field represented the tenant in all tables
I choose the solution 3 but I'm not sure to have the best schema for the best performances
This is my profile schema
CREATE TABLE profiles (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY(id, tenant)
);
CREATE INDEX ON profiles(datasources);
CREATE INDEX ON profiles(email);
My PARTITION KEY is "id" for the uniqueness and CLUSTERING KEY "tenant".
My need is to be able to execute this queries as quickly as possible
SELECT * FROM profiles WHERE id = x
SELECT * FROM profiles WHERE tenant = x
SELECT * FROM profiles WHERE email = x
SELECT * FROM profiles WHERE datasources CONTAINS x
Queries are OK but I wondered if it would be better to have "tenant" as PARTITION KEY instead of "id", and use "id" as CLUSTERING KEY
CREATE TABLE profiles (
...
PRIMARY KEY(tenant, id)
);
In my application "tenant" is always a required field so make the same queries in this way would not be a problem (but is it faster or slower ?)
SELECT * FROM profiles WHERE tenant = y
SELECT * FROM profiles WHERE tenant = y AND id = x
SELECT * FROM profiles WHERE tenant = y AND email = x
SELECT * FROM profiles WHERE tenant = y AND datasources CONTAINS x
Bonus advantage: the ability to sort profiles by creation date (ORDER BY id)
Using tenant as PARTITION KEY if I understand well, Cassandra will physically store all elements of the same tenant in the same row and would be potentially able to store up to 2 billion data in this row, in this case what would happen if one of my customers in excess of that number ? I also read we could use a composite key for example by putting the current date (20150313) in the second part of the key to group in one row only all new profiles of the day for the tenant
CREATE TABLE profiles (
...
date text,
PRIMARY KEY((tenant, date), id)
);
but with this solution no query is possible to query all data (without date in query).
Also as you can see in my schema I use secondary index for "email" and "datasources" fields. But I read here http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html that using secondary index on a huge table that returns a small number of results (one in my case) was a bad practice. In my schema "datasources" is a set containing for exemple facebookId, twitterId etc
If you have any ideas I'm really interested :) ! I'm pretty new with Cassandra if there are things I do not understand please tell me
thanks,
Donovan
Data duplication with Cassandra is not a problem, so you have to think the data modelling process starting with your queries.
So, I'm thinking about something like this:
CREATE TABLE profiles (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((id, tenant))
);
Assuming that tenant is known at the application level, this mode will give you the following queries run fast:
SELECT * FROM profiles WHERE id = x and tenant = y
CREATE TABLE profiles_emails (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((email, tenant))
);
SELECT * FROM profiles WHERE email = x and tenant = y
CREATE TABLE profiles_tenants (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((tenant, id))
);
SELECT * FROM profiles WHERE tenant = x and id = y
CREATE TABLE tenants (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((tenant, date))
);
SELECT * FROM profiles WHERE tenant = x and date < y
or you may look to http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html
For "datasources" based search, you may use a different system like elasticsearch or solr. Or if the set is limited in values, then you may maintain a separate table for each of it.
Cassandra is fast at write operation, data duplication is not a problem, so you may write to all those tables in a batch.
You have also to take in consideration the consistency level, it has an impact on READ performance. Really depending on your use-case.

Cassandra data model with obsolete data removal possibility

I'm new to cassandra and would like to ask what would be correct model design pattern for such tasks.
I would like to model data with future removal possibility.
I have 100,000,000 records per day of this structure:
transaction_id <- this is unique
transaction_time
transaction_type
user_name
... some other information
I will need to fetch data by user_name (I have about 5,000,000 users).
Also I will need to find transaction details by its id.
All the data will be irrelevant after say about 30 days, so need to find a way to delete outdated rows.
As much I have found, TTL-s expire column values, not rows.
So far I came across with this model, and as I understand it will imply really wide rows:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY ((transaction_date, user_name), transaction_id)
);
CREATE INDEX idx_user_transactions_uname ON USER_TRANSACTIONS(user_name);
CREATE INDEX idx_user_transactions_tid ON USER_TRANSACTIONS(transaction_id);
but this model does not allow deletions by transaction_date.
this also builds indexes with high cardinality, what cassandra docs strongly discourages
So what will be the correct model for this task?
EDIT:
Ugly workaround I came with so far is to create single table per date partition. Mind you, I call this workaround and not a solution. I'm still looking for right data model
CREATE TABLE user_transactions_YYYYMMDD (
user_name text,
transaction_id text,
transaction_time timestamp,
transaction_type int,
PRIMARY KEY (user_name)
);
YYYYMMDD is date part of transaction. we can create similar table with transaction_id for transaction lookup. obsolete tables can be dropped or truncated.
Maybe you should denormalized your data model. For example to query by user_name you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (user_name, transaction_id)
);
So you can query using the partition key directly like this:
SELECT * FROM user_transactions WHERE user_name = 'USER_NAME';
And for the id you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (transaction_id)
);
so the query could be something like this:
SELECT * FROM user_transactions WHERE transaction_id = 'ID';
By this way you dont need indexes.
About the TTL, maybe you could programatically ensure that you update all the columns in the row at the same time (same cql sentence).
Perhaps my answer will be a little useful.
I would have done so:
CREATE TABLE user_transactions (
date timestamp,
user_name text,
id text,
type int,
PRIMARY KEY (id)
);
CREATE INDEX idx_user_transactions_uname ON user_transactions (user_name);
No need in 'transaction_time timestamp', because this time will be set by Cassandra to each column, and can be fetched by WRITETIME(column name) function. Because you write all the columns simultaneously, then you can call this function on any column.
INSERT INTO user_transactions ... USING TTL 86400;
will expire all columns simultaneously. So do not worry about deleting rows. See here: Expiring columns.
But as far as I know, you can not delete an entire row - key column still remains, and in the other columns will be written NULL.
If you want to delete the rows manually, or just want to have an estimate of rows to be deleted by a TTL, then I recommend driver Astyanax: AllRowsReader All rows query.
And indeed as a driver to work with Cassandra I recommend you use Astyanax.

Resources