Table layout for social app in YugabyteDB - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
I was trying to see if we can avoid data de-normalization using YB’s secondary index , primary table is something like below :
CREATE TABLE posts_by_user(
user_id bigint,
post_id bigserial,
group_ids bigint[] null,
tag_ids bigint[] null,
content text null,
....
PRIMARY KEY (user_id, post_id)
)
-- there could be multiple group ids(up to 20) which user can select to publish his/her post in
-- there could be multiple tag ids(up to 20) which user can select to publish his/her post with
This structure makes fetching by user_id easier but, suppose I want to fetch by group_id(s) or tag_id(s), then either I will need to de-normalize it into secondary tables using YB transaction, which will require additional app logic and also could lead to performance issues because data will be written into multiple nodes based hash primary keys(group_ids and tag_ids).
Or I could use a secondary index to avoid writing additional logic, I have the following doubts regarding that :
YB stable version 2.8 does not allow creating a secondary index on array columns using GIN , any rough timelines it will be available as stable release version ?
will this also suffer same performance issue since multiple index will be updated at the time of client call in multiple nodes based on partition key group_id(s) or tag_id(s) ?
Other ideas are also most welcome for saving data to enable faster queries based on user_id(s), group_id(s), tag_id(s) in a scalable way.

The problem with the GIN index is that it won't be sorted on disk by the timestamp.
You have to create an index for (user_id, datetime desc).
While for groups you can maintain a separate table, with a primary key of (group_id desc, datetime desc, post_id desc). The same for tags.
And on each feed-request, you can make multiple queries for, say, 5 posts on each user_id or group_id and then merge them in the application layer.
This will be the most efficient since all records will be sorted on-disk and in-memory at write-time.

Related

Cassandra - Shall I have to do so many writes?

I have 5 Tables:
users_by_id
users_by_username
users_by_email
users_by_likes
users_by_followers
I have to write 5 Statements every time if a user registered. Is that not expensive or bad ?
INSERT INTO users_by_id (...) values (..)
INSERT INTO users_by_email (...) values (..)
INSERT INTO users_by_username (...) values (..)
INSERT INTO users_by_likes (...) values (..)
INSERT INTO users_by_followers (...) values (..)
The second question: Maybe I update users_by_id I have to write 5 Update statments. Is there another solution? Or is that not this bad ?
Cassandra advocates denormalization of your data and creating data model according to your queries. You will have to write your data model such that it satisfies all the queries with good performance. For performance (due to its architecture and design) Cassandra asks for writing and reading using partition key.
It is not expensive to write 5 insertions for same set of data in 5 different tables. Your reads will perform better and as data size increases to web scale, you will thank your decision of creating 5 tables and writing to them.
You can explore materialized views (Materialized View and Datastax Link for Materialized View but remember it is an experimental feature. So you have to understand it properly and also identify open issues with materialized views.
I would recommend you study Cassandra data model that will make things easier to grasp.
Cassandra is designed to be write intensive database so do not hesitate to duplicate your data. One should always design tables for the read queries. If one table satisfies one query, it is a fine design.
Answer to your second question, you should design your tables such a way that you do not have to update table. Always think about inserting new values.
For example, below table design
CREATE TABLE user_by_email (
email text,
timestamp timestamp,
name text,
fullname text,
userId text,
PRIMARY KEY (email,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
INSERT INTO user_by_email (email, DateTime.Now ........)
In this design, you should get the latest inserted value. Additionally , this design keeps change history for that key.
Think about, how many times we have to update values like user id, email, username? rarely.

Cassandra secondary vs extra table and read

I'm facing a dilemma that my small knowledge of Cassandra doesn't allow me to solve.
I have a index table used to retrieve data from an item (a notification) using an external id. However, the data contained in that table (in that case the status of the notification) is modified so I need to update the index table as well. Here is the tables design:
TABLE notification_by_external_id (
external_id text,
partition_key_date text,
id uuid,
status text,
...
PRIMARY KEY (external_id, partition_key_date, id)
);
TABLE notification (
partition_key_date text,
status text,
id uuid,
...
PRIMARY KEY (partition_key_date, status, id)
);
The problem is that when I want to update the notification status (and hence the notification_by_external_id table), I don't have access to the external ID.
So far I came up to 2 solutions, none of which seems optimal, and I can't decide which one to go with.
Solution 1
Create an index on notification_by_external_id.id, but this will obviously be a high cardinality column. There can be several external IDs for each notifications, but we're talking about something around 5-10 to one top.
Solution 2
Create a table
TABLE external_id_notification (
notification_id uuid,
external_id text
PRIMARY KEY (notification_id, external_id)
);
but that would mean making one extra read operation (and of course maintain another table) which I understood is also a bad practice.
The thing to understand about secondary indexes is, that their scalability issue is not with the number of rows in the table, but with the amount of nodes in your cluster. A select on an index column means that every single node will have to process it and respond to it, just that it itself will be able to process the select efficiently.
Use secondary indexes for administrative purposes (i.e. you on cqlsh) only. Do not use it for productive purposes.
That being said. You could duplicate all the information into your external_id_notification table. That would alleviate the need for an extra read operation. I know that relational databases taught you, that duplicate data is bad (what if it differs?), and that you should always normalize. But you are not on a relational database. Denormalization is a thing, and on Cassandra, you should always go for that, unless you absolutely cannot.

How to maintain data consistency across multiple tables in cassandra?

I'm having trouble figuring out how to maintain attribute updates across multiple tables to ensure data consistency.
For example, suppose I have many-to-many relationship between actors and fans. A fan can support many actors, and an actor have many fans. I make several tables to support my queries
CREATE TABLE fans (
fan_id uuid,
fan_attr_1 int,
fan_attr_2 int
PRIMARY KEY ((fan_id))
)
CREATE TABLE actors (
actor_id uuid,
actor_attr_1 int,
actor_attr_2 int
PRIMARY KEY ((actor_id))
)
CREATE TABLE actors_by_fan (
fan_id uuid,
actor_id uuid,
actor_attr_1 int,
actor_attr_2 int
PRIMARY KEY (fan_id, actor_id)
)
CREATE TABLE fans_by_actor (
actor_id uuid,
fan_id uuid,
fan_attr_1 int,
fan_attr_2 int
PRIMARY KEY (actor_id, fan_id)
)
Let's say I'm a fan and I'm on my settings page and I want to change my fan_attr_1 to a different value.
On the fans table I can update my attribute just fine since the application knows my fan_id and can key on that.
However I cannot change my fan_attr_1 on the fans_by_actor without first querying for the actor_ids tied to the fan.
This problem occurs for any time you want to update any attribute of either fans or actors.
I've tried looking online for people experiencing similar problems, but I couldn't find them. For example, in Datastax's Data Modeling course they use the examples with actors and videos in a many to many relationship where they have tables actors_by_video and videos_by_actor. The course, like the other online resources I've consulted, discussed modeling tables after queries, but haven't dug into how to maintain data integrity. In the actors_by_video table, what would happen if I want to change an actor's attribute? Wouldn't have have to go through every row of actors_by_video to find the partitions that contain the actor and update the attribute? That sounds very inefficient. The other option is to look for the video id's beforehand, but I read elsewhere that reads before writes are an antipattern in Cassandra.
What would be the best approach for tackling this problem either from a data modeling standpoint or from a CQL standpoint?
EDIT:
- Fixed sentence stubs
- Added context and prior research
Data Modeling
Cassandra is not an Relational Database and there are certain basic rules need to be followed on DataModeling, at high-level the following goals need to be followed for our data model.
1) Spread data evenly around the cluster
2) Minimize the number of partitions read
Moreover we should go for a single big table rather than breaking it into multiple tables and adding relationship between the tables. In this approach duplication of records will occur. Duplication of records is not a costlier operation since it takes only a little more Disk Space rather than CPU, memory, disk IOPs, or network.
Please note that there is a size restriction on column key names and values. The maximum column key (and row key) size is 64KB. The maximum column value size is 2 GB. But becuase there is no streaming and the whole value is fetched in heap memory when requested, limit the size to only a few MBs.
More Info:
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html
CQL
Maintaining Consistency across tables can be done using Batch or Materialized Views. Materialized views is available from version 3.0
Please see
How to ensure data consistency in Cassandra on different tables?
My preference would be to change the data model and design it
accordingly for our queries and if possible make it as a single big table.
Hope it Helps!
Materialized Views are probably the best choice:
CREATE MATERIALIZED VIEW actors_by_fan
AS SELECT fan_id, actor_id, actor_attr_1, actor_attr_2
FROM fans
PRIMARY KEY (fan_id, actor_id);
CREATE MATERIALIZED VIEW fans_by_actor
AS SELECT actor_id, fan_id, fan_attr_1, fan_attr_2
FROM actors
PRIMARY KEY (actor_id, fan_id);
In versions prior to 3.0, create secondary indices and evaluate if their performance is acceptable. Later, after upgrading to 3.x, just drop the secondary indexes and create materialized views.
The way you solve these kind of problems is to manually update all the changed records.
Since you can't use materialized views, in order to update fan_attr_1 on your data you need to:
Update the fan table by issuing UPDATE fan ... WHERE fan_id = xxx.
Select all the actor_ids from the actors_by_fan by issuing SELECT actor_id ... WHERE fan_id = xxx.
Update all the corresponding rows in the fans_by_actor table by issuing UPDATE fans_by_actor ... WHERE actor_id IN (...), or alternatively loop over the actor_ids and run each update async.
As long as you have a small amount of actor_id in the step 2, say less than 20, you can group all the queries and maintain strong consistency between tables by running them in a single BATCH. You need to guarantee the consistency between tables in other way otherwise.
This can be as inefficient as it sounds, but I don't think there are other smarter solutions. By the way, you are issuing one read (the step 2) and multiple writes (step 1 and step 3). This won't be the end of the world, especially if you don't change attributes so often (eg every 10 milliseconds).

Updates in cassandra

Cassandra data modeling respects "Denormalization and duplication of data is a fact of life with Cassandra". But one of the cons for demormalized data is making the updates very hard. For example, if I have three tables catering for different queries, selecting is fine. However, if in my app, I want to update a username and I need to update these three tables? The update on first table looks ok. How about the latter two? The upates are going to be very expensive? How should I handle this case?
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
CREATE TABLE groups (
groupname text,
username text,
email text,
age int,
hash_prefix int,
PRIMARY KEY ((groupname, hash_prefix), username)
)
This is a typical problem I see when people try to put relational model in Cassandra which is being updated through time. Cassandra is a great database and for what it does, it works wonders. There are many features that enable all kinds of different data models and you can cover almost all use cases. When you look at your use case the question is why would you use Cassandra for relational model?
If you really want to make Cassandra cover your use case you will have to do a lot of different operations on application level just to execute updates and keep your data in consistent state.
After watching a few youtube clips, it looks like Canssandra's update is a simple write to append a record to the commit log in the file system. Then the data is put to memtable in cassandra server and send acknowledge to the client straight away. So the update call finishes. This makes the updating fast to the clients.
The whole compaction process happens afterwards, including flushing, sequential writing and merging based on the timestamp.

Cassandra/Redis: Way to create feed without Cassandra 'IN' secondary index?

I'm having a bit of an issue with my application functionality integrating with Cassandra. I'm trying to create a content feed for my users. Users can create posts which, in turn, have the field user_id. I'm using Redis for the entire social graph and using Cassandra columns solely for objects. In Redis, user 1 has a set named user:1:followers with all of his/her follower ids. These follower ids correspond with the Cassandra ids in the users table and user_ids in the posts table.
My goal was originally to simply plug all of the user_ids from this Redis set into a query that would use FROM posts WHERE user_id IN (user_ids here) and grab all of the posts from the secondary index user_id. The issue is that Cassandra purposely does not support the IN operator in secondary indexes because that index would force Cassandra to search ALL of its nodes for that value. I'm left with only two options I can see: Either create a Redis list of user:1:follow_feed for the post IDs then search Cassandra's primary index for those posts in a single query, or keep it the way I have it now and run an individual query for every user_id in the user:1:follower set.
I'm really leaning against the first option because I already have tons and tons of graph data in Redis, and this option would add a new list for every user. The second way is far worse. I would put a massive read load on Cassandra and it would take a long time to run individual queries for a set of ids. I'm kind of stuck between a rock and a hard place, as far as I see it. Is there any way to query the secondary indexes with multiple values? If not, is there a more efficient way to load these content feeds (RAM and speed wise) compared to the options of more Redis lists or multiple Cassandra queries? Thanks in advance.
Without knowing the schema of the posts table (and preferably the others, as well), it's really hard to make any useful suggestions.
It's unclear to me why you need to have user_id be a secondary index, as opposed to your primary key.
In general it's quite useful to key content like posts off of the user that created it, since it allows you to do things like retrieve all posts (optionally over a given range, assuming they are chronologically sorted) very efficiently.
With Cassandra, if you find that a table can effectively answer some of the queries that you want to perform but not others, you are usually best of denormalizing that table and creating another table with a different structure in order to keep your queries to a single CQL partition and node.
CREATE TABLE posts (
user_id int,
post_id int,
post_text text,
PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC)
This table can answer queries such as:
select * from posts where user_id = 1234;
select * from posts where user_id = 1 and post_id = 53;
select * from posts where user_id = 1 and post_id > 5321 and post_id < 5400;
The reverse clustering on post_id is to make retrieving the most recent posts the most efficient by placing them at the beginning of the partition physically within the sstable.
In that example, user_id being a partition column, means "all cql rows with this user_id will be hashed to the same partition, and hence the same physical nodes, and eventually, the same sstables. That's why it's possible to
retrieve all posts with that user_id, as they are store contiguously
retrieve a slice of them by doing a ranged query on post_id
retrieve a single post by supplying both the partition column(user_id) and the clustering column (post_id)
In effect, this become a hashmap of a hashmap lookup. The one major caveat, though, is that when using partition and clustering columns, you always need to supply all columns from left to right in your query, without skipping any. So in this case, that means you can't retrieve an individual post without knowing the user_id that the post_id belongs to. That is addressable in user-code(by storing a reverse mapping and doing the lookup when necessary, or by encoding the user_id into the post_id that is passed around your application), but is definitely something to take into consideration.

Resources