Hi I'm new in Apache Cassandra and I found article about Basic Rules of Cassandra Data Modeling. In example 1 are created 2 tables
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
This tables contains same data (username, email and age). Here I don't understand how to insert data into two tables. I think, that I have to execute two separate inserts. One for table users_by_username and one for table users_by_email. But how to maintain data consistency between tables. For example when I insert data into first table and I forgot to insert data to second table ... or the other way
It's your job as developer to make sure that data is in sync. Although, you can use things like materialized views to generate another "table" with slightly different primary key (there are some rules on what could be changed). For your case, for example, you can have following:
CREATE TABLE users_by_username (username text PRIMARY KEY,
email text, age int);
create MATERIALIZED VIEW users_by_email as SELECT * from
users_by_username where email is not null and
username is not null primary key (email, username);
and if you insert data as
insert into users_by_username (username, email, age)
values ('test', 'test#domain.com', 30);
you can query the materialized view for data in addition to query by username
SELECT * from users_by_username where username = 'test' ;
username | age | email
----------+-----+-----------------
test | 30 | test#domain.com
SELECT * from users_by_email where email = 'test#domain.com';
email | username | age
-----------------+----------+-----
test#domain.com | test | 30
Related
For example. I have a map under the column 'users' in a table called 'table' with primary key 'Id'.
If the map looks like this, {{'Phone': '1234567899'}, {'City': 'Dublin'}}, I want to get the value from key 'Phone' for specific 'Id', in Cassandra database.
Yes, that's possible to do with CQL when using a MAP collection.
To test this, I created a simple table using the specifications and data you mentioned above:
> CREATE TABLE stackoverflow.usermap (
id text PRIMARY KEY,
users map<text, text>);
> INSERT INTO usermap (id,users)
VALUES ('1a',{'Phone': '1234567899','City': 'Dublin'});
> SELECT * FROM usermap WHERE id='1a';
id | users
----+-------------------------------------------
1a | {'City': 'Dublin', 'Phone': '1234567899'}
(1 rows)
Then, I queried with the same WHERE clause, but altering my SELECT to pull back the user's phone only:
> SELECT users['Phone'] FROM usermap WHERE id='1a';
users['Phone']
----------------
1234567899
(1 rows)
Let's say I have this table:
CREATE TABLE "users" (
username text,
created_at timeuuid,
email text,
firstname text,
groups list<text>,
is_active boolean,
lastname text,
"password" text,
roles list<text>,
PRIMARY KEY (username, created_at)
)
I want to order users by username, which is not possible as ordering is only possible via the clustering column. How can I order my users by username?
I need to query users by username, so that is the reason, why username is the indexing column.
What is the right approach here?
If you absolutely must have the username sorted, and return all usernames in one query then you will need to create another table for this effect:
CREATE TABLE "users" (
field text,
value text,
PRIMARY KEY (field, value)
)
Unfortunately, this will put all the usernames in just one partition, but it's the only way of keeping them sorted. On the other hand, you could expand the table to store different values that you need to retrieve in the same way. So for instance, the partition field="username" would have all the usernames, but you could create another partition field="Surname" to store all the usernames sorted.
Cassandra is NoSQL, so duplication of data can be expected.
Cassandra stores the partition key data by hashing the value.
So when the data is returned, the order is done by the hash values and not order of the data itself. Thus, you can't order on the partition key.
Coming back to your question, I'm not sure about what kind of data it is and what kind of query you would want to run. Assuming multiple users per email I'd create the following table:
CREATE TABLE "users" (
username text,
created_at timeuuid,
email text,
firstname text,
groups list<text>,
is_active boolean,
lastname text,
"password" text,
roles list<text>,
PRIMARY KEY (email, username)
)
I have several customers each represented by a "tenant"
I would like to know what is the best way to modelize this concept. I did a lot of research and found this topic : http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Modeling-multi-tenanted-Cassandra-schema-td7591311.html
I know there are several possibilities
One keyspace by tenant
One table (column family) by tenant
One field represented the tenant in all tables
I choose the solution 3 but I'm not sure to have the best schema for the best performances
This is my profile schema
CREATE TABLE profiles (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY(id, tenant)
);
CREATE INDEX ON profiles(datasources);
CREATE INDEX ON profiles(email);
My PARTITION KEY is "id" for the uniqueness and CLUSTERING KEY "tenant".
My need is to be able to execute this queries as quickly as possible
SELECT * FROM profiles WHERE id = x
SELECT * FROM profiles WHERE tenant = x
SELECT * FROM profiles WHERE email = x
SELECT * FROM profiles WHERE datasources CONTAINS x
Queries are OK but I wondered if it would be better to have "tenant" as PARTITION KEY instead of "id", and use "id" as CLUSTERING KEY
CREATE TABLE profiles (
...
PRIMARY KEY(tenant, id)
);
In my application "tenant" is always a required field so make the same queries in this way would not be a problem (but is it faster or slower ?)
SELECT * FROM profiles WHERE tenant = y
SELECT * FROM profiles WHERE tenant = y AND id = x
SELECT * FROM profiles WHERE tenant = y AND email = x
SELECT * FROM profiles WHERE tenant = y AND datasources CONTAINS x
Bonus advantage: the ability to sort profiles by creation date (ORDER BY id)
Using tenant as PARTITION KEY if I understand well, Cassandra will physically store all elements of the same tenant in the same row and would be potentially able to store up to 2 billion data in this row, in this case what would happen if one of my customers in excess of that number ? I also read we could use a composite key for example by putting the current date (20150313) in the second part of the key to group in one row only all new profiles of the day for the tenant
CREATE TABLE profiles (
...
date text,
PRIMARY KEY((tenant, date), id)
);
but with this solution no query is possible to query all data (without date in query).
Also as you can see in my schema I use secondary index for "email" and "datasources" fields. But I read here http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html that using secondary index on a huge table that returns a small number of results (one in my case) was a bad practice. In my schema "datasources" is a set containing for exemple facebookId, twitterId etc
If you have any ideas I'm really interested :) ! I'm pretty new with Cassandra if there are things I do not understand please tell me
thanks,
Donovan
Data duplication with Cassandra is not a problem, so you have to think the data modelling process starting with your queries.
So, I'm thinking about something like this:
CREATE TABLE profiles (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((id, tenant))
);
Assuming that tenant is known at the application level, this mode will give you the following queries run fast:
SELECT * FROM profiles WHERE id = x and tenant = y
CREATE TABLE profiles_emails (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((email, tenant))
);
SELECT * FROM profiles WHERE email = x and tenant = y
CREATE TABLE profiles_tenants (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((tenant, id))
);
SELECT * FROM profiles WHERE tenant = x and id = y
CREATE TABLE tenants (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((tenant, date))
);
SELECT * FROM profiles WHERE tenant = x and date < y
or you may look to http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html
For "datasources" based search, you may use a different system like elasticsearch or solr. Or if the set is limited in values, then you may maintain a separate table for each of it.
Cassandra is fast at write operation, data duplication is not a problem, so you may write to all those tables in a batch.
You have also to take in consideration the consistency level, it has an impact on READ performance. Really depending on your use-case.
It could be kind of lame but in cassandra has the primary key to be unique?
For example in the following table:
CREATE TABLE users (
name text,
surname text,
age int,
adress text,
PRIMARY KEY(name, surname)
);
So if is it possible in my database to have 2 persons in my database with the same name and surname but different ages? Which means same primary key..
Yes the primary key has to be unique. Otherwise there would be no way to know which row to return when you query with a duplicate key.
In your case you can have 2 rows with the same name or with the same surname but not both.
By definition, the primary key has to be unique. But that doesn't mean you can't accomplish your goals. You just need to change your approach/terminology.
First of all, if you relax your goal of having the name+surname be a primary key, you can do the following:
CREATE TABLE users ( name text, surname text, age int, address text, PRIMARY KEY((name, surname),age) );
insert into users (name,surname,age,address) values ('name1','surname1',10,'address1');
insert into users (name,surname,age,address) values ('name1','surname1',30,'address2');
select * from users where name='name1' and surname='surname1';
name | surname | age | address
-------+----------+-----+----------
name1 | surname1 | 10 | address1
name1 | surname1 | 30 | address2
If, on the other hand, you wanted to ensure that the address is shared as well, then you probably just want to store a collection of ages in the user record. That could be achieved by:
CREATE TABLE users2 ( name text, surname text, age set<int>, address text, PRIMARY KEY(name, surname) );
insert into users2 (name,surname,age,address) values ('name1','surname1',{10,30},'address2');
select * from users2 where name='name1' and surname='surname1';
name | surname | address | age
-------+----------+----------+----------
name1 | surname1 | address2 | {10, 30}
So it comes back to what you actually need to accomplish. Hopefully the above examples give you some ideas.
The primary key is unique. With your data model, you can only have one age per (name, surname) combination.
Yes as mentioned in above comments you can have a composite key with name, surname, and age to achieve your goal but still, that won't solve the problem. Rather you can consider adding a new column userID and make that as the primary key. So even in case of name, surname and age duplicate, you don't have to revisit your data model.
CREATE TABLE users (
userId int,
name text,
surname text,
age int,
adress text,
PRIMARY KEY(userid)
);
I would state specifically that partition key should be unique.I could not get it in one place but from the following statements.
Cassandra needs all the partition key columns to be able to compute
the hash that will allow it to locate the nodes containing the
partition.
The partition key has a special use in Apache Cassandra beyond
showing the uniqueness of the record in the database..
Please note that there will not be any error if you insert same
partition key again and again as there is no constraint check.
Queries that you'll run equality searches on should be in a partition
key.
References
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
how Cassandra chooses the coordinator node and the replication nodes?
Insert query replaces rows having same data field in Cassandra clustering column
Is there a way to revoke a columnfamily update command? I tried to update a column but ended up with "update columnfamily dev ; " and now i see only the ids when I query. But the data seem to exist there if I run a nodetool status. I tried to restore a snapshot but even that did not help.
So if I get you correctly you've erased your column metadata and you now get something like this:
cqlsh:test> select * from user;
uuid
--------------------------------------
fd24b190-072d-11e3-a1c4-97db6b0653ce
054a43d0-072e-11e3-a1c4-97db6b0653ce
0aa71920-072e-11e3-a1c4-97db6b0653ce
07fda400-072e-11e3-a1c4-97db6b0653ce
while you wanted something like this:
uuid | email | name
--------------------------------------+----------------------+-------
fd24b190-072d-11e3-a1c4-97db6b0653ce | user0#somedomain.com | User0
054a43d0-072e-11e3-a1c4-97db6b0653ce | user1#somedomain.com | User1
0aa71920-072e-11e3-a1c4-97db6b0653ce | user3#somedomain.com | User3
07fda400-072e-11e3-a1c4-97db6b0653ce | user2#somedomain.com | User2
You can get the data back by adding the information about the columns.
Given the original table was defined like this:
CREATE TABLE user(
uuid timeuuid PRIMARY KEY,
name varchar,
email varchar
);
You can add missing column information using CQL:
cqlsh:test> ALTER TABLE user ADD email varchar;
cqlsh:test> ALTER TABLE user ADD name varchar;