I have several customers each represented by a "tenant"
I would like to know what is the best way to modelize this concept. I did a lot of research and found this topic : http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Modeling-multi-tenanted-Cassandra-schema-td7591311.html
I know there are several possibilities
One keyspace by tenant
One table (column family) by tenant
One field represented the tenant in all tables
I choose the solution 3 but I'm not sure to have the best schema for the best performances
This is my profile schema
CREATE TABLE profiles (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY(id, tenant)
);
CREATE INDEX ON profiles(datasources);
CREATE INDEX ON profiles(email);
My PARTITION KEY is "id" for the uniqueness and CLUSTERING KEY "tenant".
My need is to be able to execute this queries as quickly as possible
SELECT * FROM profiles WHERE id = x
SELECT * FROM profiles WHERE tenant = x
SELECT * FROM profiles WHERE email = x
SELECT * FROM profiles WHERE datasources CONTAINS x
Queries are OK but I wondered if it would be better to have "tenant" as PARTITION KEY instead of "id", and use "id" as CLUSTERING KEY
CREATE TABLE profiles (
...
PRIMARY KEY(tenant, id)
);
In my application "tenant" is always a required field so make the same queries in this way would not be a problem (but is it faster or slower ?)
SELECT * FROM profiles WHERE tenant = y
SELECT * FROM profiles WHERE tenant = y AND id = x
SELECT * FROM profiles WHERE tenant = y AND email = x
SELECT * FROM profiles WHERE tenant = y AND datasources CONTAINS x
Bonus advantage: the ability to sort profiles by creation date (ORDER BY id)
Using tenant as PARTITION KEY if I understand well, Cassandra will physically store all elements of the same tenant in the same row and would be potentially able to store up to 2 billion data in this row, in this case what would happen if one of my customers in excess of that number ? I also read we could use a composite key for example by putting the current date (20150313) in the second part of the key to group in one row only all new profiles of the day for the tenant
CREATE TABLE profiles (
...
date text,
PRIMARY KEY((tenant, date), id)
);
but with this solution no query is possible to query all data (without date in query).
Also as you can see in my schema I use secondary index for "email" and "datasources" fields. But I read here http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html that using secondary index on a huge table that returns a small number of results (one in my case) was a bad practice. In my schema "datasources" is a set containing for exemple facebookId, twitterId etc
If you have any ideas I'm really interested :) ! I'm pretty new with Cassandra if there are things I do not understand please tell me
thanks,
Donovan
Data duplication with Cassandra is not a problem, so you have to think the data modelling process starting with your queries.
So, I'm thinking about something like this:
CREATE TABLE profiles (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((id, tenant))
);
Assuming that tenant is known at the application level, this mode will give you the following queries run fast:
SELECT * FROM profiles WHERE id = x and tenant = y
CREATE TABLE profiles_emails (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((email, tenant))
);
SELECT * FROM profiles WHERE email = x and tenant = y
CREATE TABLE profiles_tenants (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((tenant, id))
);
SELECT * FROM profiles WHERE tenant = x and id = y
CREATE TABLE tenants (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((tenant, date))
);
SELECT * FROM profiles WHERE tenant = x and date < y
or you may look to http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html
For "datasources" based search, you may use a different system like elasticsearch or solr. Or if the set is limited in values, then you may maintain a separate table for each of it.
Cassandra is fast at write operation, data duplication is not a problem, so you may write to all those tables in a batch.
You have also to take in consideration the consistency level, it has an impact on READ performance. Really depending on your use-case.
Related
Hi I am new to Cassandra.
We are working on IOT project where car sensor data will be stored in cassandra.
Here is the example of one table where I am going to store one of the sensor data.
This is some sample data.
The way I want to partition the data is based on the organization_id so that different organization data is partitioned.
Here is the create table command:
CREATE TABLE IF NOT EXISTS engine_speed (
id UUID,
engine_speed_rpm text,
position int,
vin_number text,
last_updated timestamp,
organization_id int,
odometer int,
PRIMARY KEY ((id, organization_id), vin_number)
);
This works fine. However all my queries will be as bellow:
select * from engine_speed
where vin_number='xyz'
and organization_id = 1
and last_updated >='from time stamp' and last_updated <='to timestamp'
Almost all queries in all the table will have similar / same where clause.
I am getting error and it is asking to add "Allow filtering".
Kindly let me know how do I partition the table and define right primary key and indexs so that I don't have to add "allow filtering" in the query.
Apologies for this basic question but I'm just starting using cassandra.(using apache cassandra:3.11.12 )
The order of where clause should match with the order of partition and clustering keys you have defined in your DDL and you cannot skip any part of primary key while applying the WHERE clause before using the next key. So as per the query pattern u have defined, you can try the below DDL:
CREATE TABLE IF NOT EXISTS autonostix360.engine_speed (
vin_number text,
organization_id int,
last_updated timestamp,
id UUID,
engine_speed_rpm text,
position int,
odometer int,
PRIMARY KEY ((vin_number, organization_id), last_updated)
);
But remember,
PRIMARY KEY ((vin_number, organization_id), last_updated)
PRIMARY KEY ((vin_number), organization_id, last_updated)
above two are different in Cassandra, In case 1 your data will be partitioned by combination of vin_number and organization_id while last_updated will act as ordering key. In case 2, your data will be partitioned only by vin_number while organization_id and last_updated will act as ordering key. So you need to figure out which case suits your use case.
I am new to Cassandra and would like to do One to many mapping of User and its vehicle. One user may have multiple Vehicles. My User table will contain User details like name, surname, etc. And Vehicle table will have Vehicle details.
My select query will fetch all Vehicle details for particular User.
How should I design this in Cassandra?
You can easily model this in a single table:
CREATE TABLE userVehicles (
userid text,
vehicleid text,
name text static,
surname text static,
vehicleMake text,
vehicleModel text,
vehicleYear text,
PRIMARY KEY (userid,vehicleid)
);
This way you can query vehicles for a single user in one shot, and your user data can be static so that it is stored at the partition key level. As long as the cardinality of user to vehicle isn't too big (as-in, like a user has 1000 vehicles) this should work just fine.
The case I have considered above is very simple. But what if my User has lot of details around 20 to 30 fields and same for Vehicle. Still you would suggest to have a single table and copying User data for all vehicle?
It depends. Would your use case require returning all of them? If so, then "yes" I would still recommend this approach. The way to get the best query performance out of Cassandra, is to model your tables to fit your queries. Cassandra works best when it can read a single row by a specific key, or a range of rows (stored sequentially). You want to avoid performing multiple queries or writing queries that force Cassandra to perform random reads.
What are the consequences of having 2 different tables like User and Vehicle and Vehicle table will have primary key as User_Id and Vehicle_Id?
In a distributed system network time is the enemy. By having two tables, you are now making two queries...assuming a 1 to 1 ratio of users to vehicles. But if your user has 8 vehicles, you now need 9 queries to achieve your result. With the design above you can build your result set in 1 query (minimizing network time). Also with userid as a partition key, that query is guaranteed to be served by one node, as opposed to additional queries for vehicle data which will most likely require contacting multiple nodes.
This seems as simple as having two tables, one holding all of your vehicles data and another one for satisfying your query:
CREATE TABLE vehicles (
vehicle_id bigint,
vehicle_type int,
vehicle_name text,
...
PRIMARY KEY (vehicle_type)
)
CREATE TABLE vehicles_to_users (
user_id bigint,
vehicle_id bigint,
vehicle_type int,
vehicle_name text,
...
PRIMARY KEY (user_id, vehicle_type)
)
Then you would query by:
SELECT * FROM vehicles_to_users WHERE user_id = 9;
or something like that to get all specific vehicle type belonging to a particular user:
SELECT * FROM vehicles_to_users WHERE user_id = 9 AND vehicle_type = 1;
This is a solution with denormalized data, and you should always consider that approach instead of having something like:
CREATE TABLE vehicles (
vehicle_id bigint,
vehicle_type int,
vehicle_name text,
...
PRIMARY KEY (vehicle_type)
)
CREATE TABLE vehicles_to_users (
user_id bigint,
vehicle_id bigint,
PRIMARY KEY (user_id)
)
because it belongs to the relational databases world and you'd have to run N+1 queries to satisfy your requirements: one to get all the ids belonging to a particular user, and then N queries to get all the information for each vehicle:
SELECT * FROM vehicles_to_users WHERE user_id = 9;
SELECT * FROM vehicles WHERE vehicle_id = 115;
SELECT * FROM vehicles WHERE vehicle_id = 116;
SELECT * FROM vehicles WHERE vehicle_id = ...;
And don't be tempted to use the IN clausole like this:
SELECT * FROM vehicles WHERE vehicle_id IN (115,116,....);
because it would perform even worse due to extra work that a coordinator node have to do.
I am writting messaging chat system, similar to FB messaging. I did not find the way, how to effectively store conversation list (each row different user with last sent message most recent on top). If I list conversations from this table:
CREATE TABLE "conversation_list" (
"user_id" int,
"partner_user_id" int,
"last_message_time" time,
"last_message_text" text,
PRIMARY KEY ("user_id", "partner_user_id")
)
I can select from this table conversations for any user_id. When new message is sent, we can simply update the row:
UPDATE conversation_list SET last_message_time = '...', last_message_text='...' WHERE user_id = '...' AND partner_user_id = '...'
But it is sorted by clustering key of course. My question: How to create list of conversations, which is sorted by last_message_time, but partner_user_id will be unique for given user_id?
If last_message_time is clustering key and we delete the row and insert new (to keep partner_user_id unique), I will have many so many thumbstones in the table.
Thank you.
A slight change to your original model should do what you want:
CREATE TABLE conversation_list (
user_id int,
partner_user_id int,
last_message_time timestamp,
last_message_text text,
PRIMARY KEY ((user_id, partner_user_id), last_message_time)
) WITH CLUSTERING ORDER BY (last_message_time DESC);
I combined "user_id" and "partner_user_id" into one partition key. "last_message_time" can be the single clustering column and provide sorting. I reversed the default sort order with the CLUSTERING ORDER BY to make the timestamps descending. Now you should be able to just insert any time there is a message from a user to a partner id.
The select now will give you the ability to look for the last message sent. Like this:
SELECT last_message_time, last_message_text
FROM conversation_list
WHERE user_id= ? AND partner_user_id = ?
LIMIT 1
I'm new to cassandra and would like to ask what would be correct model design pattern for such tasks.
I would like to model data with future removal possibility.
I have 100,000,000 records per day of this structure:
transaction_id <- this is unique
transaction_time
transaction_type
user_name
... some other information
I will need to fetch data by user_name (I have about 5,000,000 users).
Also I will need to find transaction details by its id.
All the data will be irrelevant after say about 30 days, so need to find a way to delete outdated rows.
As much I have found, TTL-s expire column values, not rows.
So far I came across with this model, and as I understand it will imply really wide rows:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY ((transaction_date, user_name), transaction_id)
);
CREATE INDEX idx_user_transactions_uname ON USER_TRANSACTIONS(user_name);
CREATE INDEX idx_user_transactions_tid ON USER_TRANSACTIONS(transaction_id);
but this model does not allow deletions by transaction_date.
this also builds indexes with high cardinality, what cassandra docs strongly discourages
So what will be the correct model for this task?
EDIT:
Ugly workaround I came with so far is to create single table per date partition. Mind you, I call this workaround and not a solution. I'm still looking for right data model
CREATE TABLE user_transactions_YYYYMMDD (
user_name text,
transaction_id text,
transaction_time timestamp,
transaction_type int,
PRIMARY KEY (user_name)
);
YYYYMMDD is date part of transaction. we can create similar table with transaction_id for transaction lookup. obsolete tables can be dropped or truncated.
Maybe you should denormalized your data model. For example to query by user_name you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (user_name, transaction_id)
);
So you can query using the partition key directly like this:
SELECT * FROM user_transactions WHERE user_name = 'USER_NAME';
And for the id you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (transaction_id)
);
so the query could be something like this:
SELECT * FROM user_transactions WHERE transaction_id = 'ID';
By this way you dont need indexes.
About the TTL, maybe you could programatically ensure that you update all the columns in the row at the same time (same cql sentence).
Perhaps my answer will be a little useful.
I would have done so:
CREATE TABLE user_transactions (
date timestamp,
user_name text,
id text,
type int,
PRIMARY KEY (id)
);
CREATE INDEX idx_user_transactions_uname ON user_transactions (user_name);
No need in 'transaction_time timestamp', because this time will be set by Cassandra to each column, and can be fetched by WRITETIME(column name) function. Because you write all the columns simultaneously, then you can call this function on any column.
INSERT INTO user_transactions ... USING TTL 86400;
will expire all columns simultaneously. So do not worry about deleting rows. See here: Expiring columns.
But as far as I know, you can not delete an entire row - key column still remains, and in the other columns will be written NULL.
If you want to delete the rows manually, or just want to have an estimate of rows to be deleted by a TTL, then I recommend driver Astyanax: AllRowsReader All rows query.
And indeed as a driver to work with Cassandra I recommend you use Astyanax.
I want to query data filtering by composite keys other than Row Key in CQL3.
These are my queries:
CREATE TABLE grades (id int,
date timestamp,
subject text,
status text,
PRIMARY KEY (id, subject, status, date)
);
When I try and access the data,
SELECT * FROM grades where id = 1098; //works fine
SELECT * FROM grades where subject = 'English' ALLOW FILTERING; //works fine
SELECT * FROM grades where status = 'Active' ALLOW FILTERING; //gives an error
Bad Request: PRIMARY KEY part status cannot be restricted (preceding part subject is either not restricted or by a non-EQ
relation)
Just to experiment, I shuffled the keys around keeping 'id' as my Primary Row Key always. I am always ONLY able to query using either the Primary Row key or the second key, considering above example, if I swap subjects and status in Primary Key list, I can then query with status but I get similar error if I try to do by subject or by time.
Am I doing something wrong? Can I not query data using any other composite key in CQL3?
I'm using Cassandra 1.2.6 and CQL3.
That looks all normal behavior according to Cassandra Composite Key model (http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT). Cassandra data model aims (and this is a general NoSQL way of thinking) at granting that queries are performant, that comes to the expense of "restrictions" on the way you store and index your data, and then how you query it, namely you "always need to restrict the preceding part of subject" on the primary key.
You cannot swap elements on the primary key list on the queries (that is more a SQL way of thinking). You always need to "Constraint"/"Restrict" the previous element of the primary key if you are to use multiple elements of the composite key. This means that if you have composite key = (id, subject, status, date) and want to query "status", you will need to restrict "id" and/or "subject" ("or" is possible in case you use "allow filtering", i.e., you can restrict only "subject" and do not need to restrict "id"). So, if you want to query on "status" you will b able to query in two different ways:
select * from grades where id = '1093' and subject = 'English' and status = 'Active';
Or
select * from grades where subject = 'English' and status = 'Active' allow filtering;
The first is for a specific "student", the second for all the "students" on the subject in status = "Active".