We are using Datastax Cassandra for our social network and we are designing/data modeling tables we need, it is confusing for us and we don't know how to design some tables and we have some little problems!
As we understood for every query we have to have different tables, and for example user A is following user C and B.
Now, in Cassandra we have a table that is posts_by_user:
user_id | post_id | text | created_on | deleted | view_count
likes_count | comments_count | user_full_name
And we have a table according to the followers of users, we insert the post's info to the table called user_timeline that when the follower users are visiting the first web page we get the post from database from user_timeline table.
And here is user_timeline table:
follower_id | post_id | user_id (who posted) | likes_count |
comments_count | location_name | user_full_name
First, Is this data modeling correct for follow base (follower, following actions) social network?
And now we want to count likes of a post, as you see we have number of likes in both tables (user_timeline, posts_by_user), and imagine one user has 1000 followers then by each like action we have to update all 1000 rows in user_timeline and 1 row in posts_by_users; And this is not logical!
Then, my second question is How should it be? I mean how should like (favorite) table be?
Think of using posts_by_user as metadata for a post's information. This would allow you to house user_id, post_id, message_text, etc, but you would abstract the view_count, likes_count, and comments_count into a counter table. This would allow you to fetch either a post's metadata or counters as long as you had the post_id, but you would only have to update the counter_record once.
DSE Counter Documentation:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html
However,
The article below is a really good starting point in relation to data modeling for Cassandra. Namely, there are a few things to take into consideration when answering this question, many of which will depend on the internals of your system and how your queries are structured.
The first two rules are stated as:
Rule 1: Spread Data Evenly Around the Cluster
Rule 2: Minimize the Number of Partitions Read
Taking a moment to consider the "user_timeline" table.
user_id and created_on as a COMPOUND KEY* - This would be ideal if
You wanted to query for posts by a certain user and with the assumption that you would have a decent number of users. This would
distribute records evenly, and your queries would only be hitting a
partition at a time.
user_id and a hash_prefix as a COMPOUND KEY* - This would be ideal
if
You had a small number of users with a large number of posts, which would allow your data to be evenly spread across the
cluster. However you run the risk of having to query across
multiple partitions.
follower_id and created_on as a COMPOUND KEY* - This would be ideal
if
You wanted to query for posts being followed by a certain follower. The records would be distributed and you would minimize
queries across partitions
These were 3 examples for 1 table, and the point I wanted to convey is to design your tables around the queries you want to execute. Also don't be afraid to duplicate your data across multiple tables that are setup to handle various queries, this is the way Cassandra was meant to be modeled. Take a bit to read the article below and watch the DataStax Academy Data Modeling Course, to familiarize yourself with the nuances. I also included an example schema below to cover the basic counter schema I was pointing out earlier.
* The reason for the compound key is due to the fact that your PRIMARY KEY has to be unique, otherwise an INSERT with an existing PRIMARY KEY will become an UPDATE.
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
https://academy.datastax.com/courses
CREATE TABLE IF NOT EXISTS social_media.posts_by_user (
user_id uuid,
post_id uuid,
message_text text,
created_on timestamp,
deleted boolean,
user_full_name text,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.user_timeline (
follower_id uuid,
post_id uuid,
user_id uuid,
location_name text,
user_full_name text,
created_on timestamp,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.post_counts (
likes_count counter,
view_count counter,
comments_count counter,
post_id uuid,
PRIMARY KEY (post_id)
);
Related
I have the following table structure:
CREATE TABLE test_keyspace.persons (
id uuid,
country text,
city text,
address text,
phone_number text,
PRIMARY KEY (id, country, address)
);
My main scenario is to get person by id. But sometimes I want to get all cities inside country and all persons inside city as well.
I know that Cassandra must have at least one partition key and zero or more clustering keys, but I don't understand how to organize it to work most effectively (and generally work).
Can anybody give me advice?
So it sounds like you want to be able to query by both id and country. Typically in Cassandra, the way to build your data models is a "one table == one query" approach. In that case, you would have two tables, just keyed differently:
CREATE TABLE test_keyspace.persons_by_id (
id uuid,
country text,
city text,
address text,
phone_number text,
PRIMARY KEY (id));
TBH, you don't really to cluster on country and address, unless a person can have multiple addresses. But a single PK is a completely legit approach.
For the second table:
CREATE TABLE test_keyspace.persons_by_country (
id uuid,
country text,
city text,
address text,
phone_number text,
PRIMARY KEY (country,city,id));
This will allow you to query by country, with persons grouped/sorted by city and sorted by id. In theory, you could also serve the query by id approach here, as long as you also had the country and city. But that might not be possible in your scenario.
Duplicating data in Cassandra (NoSQL) to help queries perform better is ok. The trick becomes keeping the tables in-sync, but you can use the BATCH functionality to apply writes to both tables atomically.
In case you haven't already, you might benefit from DataStax's (free) course on data modeling - Data Modeling with Apache Cassandra and DataStax Enterprise.
Let's say I have customer orders data coming into my service and I would like do some reporting on this data. All customer orders are saved in a Cassandra table so that I can get all orders for a given customer:
TABLE customer_orders
store_id uuid,
customer_id text,
order_id text,
order_amount int,
order_date timestamp,
PRIMARY: KEY (store_id, customer_id)
But I would also like to find all the customers with a given number of orders. Ideally I would like to have this in a ready to query table in Cassandra. For example "get all customers who have 1 order".
Therefore I have a table like this:
TABLE order_count_to_customer
store_id uuid,
order_count int,
customer_id text
PRIMARY KEY ((store_id, order_count), customer_id)
So the idea is when an order arrives both of these tables to be updated.
So I create a third table:
TABLE customer_to_orders_count
store_id uuid,
customer_id text,
orders_count counter,
PRIMARY KEY (store_id, orders_count)
When an order arrives:
I save it in the first table
Then update the counter in the third table by incrementing it with 1.
Then I read the counter in the third table and insert a new record in the second table.
When I need to find all the customers with a given number of orders I just query the second table.
The problem with this is that counters are not atomic and consistent. If I update the counter say to 3 there is no guarantee that when I read it next in order to update the second table it would be 3. It could be 2. Even if I read the counter before I do the update of the counter it could be some value from several steps back. So no guarantee either.
Please note that I am aware of the limitations of the counters in Cassandra and I am not asking how to solve the issue with the counters.
I am rather giving this example, in order to ask for some general advice on how to model the data in order to be able to do aggregate counting on it. I can of course use Spark to do aggregate queries directly on the first table in my example. But it seems to me that there could be some more clever way to do this and also Spark would involve bringing the whole table data into memory.
Have you thought about using the CQL Batch command. https://docs.datastax.com/en/cql/3.1/cql/cql_reference/batch_r.html
You can use this with all your steps to keep all your steps in one logical atomic transaction where either they will all succeed or fail. However this functionality does have a performance penalty.
I have to design a web page where a group leader can invite people to join his/her group. My requirements are really simple.
No sending duplicate emails out, if person was already contacted.
Show the group leader a list of invites sorted by invite date in ascending order.
Seems easy. I created this table.
CREATE TABLE invites (
email_address text,
invite_date timeuuid,
PRIMARY KEY (email_address, invite_date)
) WITH CLUSTERING ORDER BY (invite_date ASC);
Problem 1: LWT no use with invite_date as a Cluster column.
I figured I'd use LWT to ensure email_address is unique, only to find out IF NOT EXISTS only seems to work on the whole PRIMARY KEY, so LWT in C* does not work for me.
Problem 2: I cannot get an ordered list of invites back to save me life even with invite_date as a Cluster column.
If I take invite_date out, I cannot issue an 'order by' in CQL. That said, having invite_date out of the PK let's me use LWT...
I can't even get a 2 column table to fulfill 2 easy requirements! Any help on data modeling design for this problem is much appreciated.
New Dec. 4, 2015:
Additional to business requirements, a technical requirement I have is:
I want to make sure I model this correctly in Cassandra, so that it allows me to use CQL's LIMIT and pagingState capabilities in the Java driver. This means, I cannot just read all the rows in, sort on Java side and return the results.
Problem 1:
I think that the easiest way to handle this might be to have two separate tables, one for the emails_in_group and one for invites_by_group. This will allow each query to be fulfilled independantly. The emails_in_group table would look something like this:
CREATE TABLE emails_in_group (
email_address text,
group_id text,
PRIMARY KEY (email_address , group_id));
Then this, combined with the table as defined in Problem 2 below could be updated using a conditional batch statement as shown here:
http://docs.datastax.com/en/cql/3.1/cql/cql_using/use-batch-static.html
Problem 2:
So the basic problem here is that as you have your data currently modeled each email_address value will be in it's own partition and then within that partition the invite_date will be ordered. #bydsky is right when he said that you need to add something like a group_id to your table and make it the partition key portion of your Primary Key. If you do this and then add invite_date as a cluster column to the partition key then all records for that group_id will be stored in the same partition and the Order By will work. Order By only works within the same partiton, not across partitions which is what you were asking it to do.
CREATE TABLE invites_by_group (
group_id text,
email_address text,
invite_date timestamp,
PRIMARY KEY (group_id, invite_date));
I think you should use group_id and email as primary key.
For ordering, maybe you could do it in your application.
CREATE TABLE invites ( group_id text,
email_address text,
invite_date timestamp,
PRIMARY KEY (group_id, email_address) );
For cassandra data modeling, it's a good start to watch DS220
I am trying to understand if it's going to be a performance issue if I choose
OPTION 1:
very high unique value column as partition key ( order_id), and create indexes on store_id and status. ( i can query on order_id | store_id | status | both store&status , and also ***update(important) status based on order_id)
Option 2:
store_id as partition_key and very high unique value column as clustering key ( order_id) and create secondary index on status ( so that i can filter on status)
( I can query on store_id | store&order_id | store&status | and also **update status based on store&order_id )
I would like to know what will be the performance issues in above scenarios. which one will be a better option. Thank you very much for your help and time.
Option 1 is interesting, but you need to be careful with your indices. See your other question for more information there (especially the bit concerning querying multiple secondary indices at the same time). That may be alleviated with tables purpose built for your index lookups (further discussed below).
The advantage of the highly unique partition key is that data will be more distributed around your cluster. The downside here is that when you perform a request with WHERE store_id = 'foo' all nodes in the cluster need to be queried as there is no limit on the partition key.
Option 2 you must be careful with. If your partition key is just store_id, then every order will be placed within this partition. For each order there will be n columns added to the single row for the store representing each attribute on the order. In regards to data location all orders for a given store will be placed on the same Cassandra node.
In both cases why not pursue a lookup table for orders by status? This will remove your need for a secondary index on that field. Especially given it's relatively small cardinality.
CREATE TABLE orders_by_store_id_status (
store_id VARCHAR,
status VARCHAR,
order_id VARCHAR,
... <additional order fields needed to satisfy your query> ...
PRIMARY KEY ((store_id, status), order_id)
);
This would allow you to query for all orders with a given store_id and status.
SELECT * FROM orders_by_store_id_status WHERE store_id = 'foo' AND status = 'open';
The read is fast as the partition key limits the number of nodes we perform the query against.
I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.