Cassandra materialized view partition key update performance

Cassandra materialized view partition key update performance - cassandra

I am trying to update a column in base table which is a partition key in the materialized view and trying to understand its performance implications in a production environment.
Base Table:
CREATE TABLE if not exists data.test
(  foreignid uuid,
  id uuid, kind text,
  version text, createdon timestamp, **certid** text,
  PRIMARY KEY(foreignid,createdon,id)  );
Materialized view:
CREATE MATERIALIZED VIEW if not exists data.test_by_certid
AS  SELECT * FROM data.test  WHERE id IS NOT NULL AND foreignid
IS NOT NULL AND createdon IS NOT NULL AND certid IS NOT NULL
PRIMARY KEY (**certid**, foreignid, createdon, id);
So, certid is the new partition key in our materialized view
What takes place :
1. When we first insert into the test table , usually the certids would
be empty which would be replaced by "none" string and inserted into
the test base table.
2.The row gets inserted into materialized view as well
3. When the user provides us with certid , the row gets updated in the test base table with the new certid
4.the action gets mirrored and the row is updated in materialized view wherein the partition key certid is getting updated from "none"
to a new value
Questions:
1.What is the perfomance implication of updating the partition key certid in the materialized view?
2.For my use case, is it better to create a new table with certid as partition key (insert only when certid in non-empty) and manually
maintain all CRUD operations to the new table or should I use MV and
let cassandra do the bookkeeping?
It is to be noted that performance is an important criteria since it will be used in a production environment.
Thanks

Updating a table for which one or more views exist is always more expensive then updating a table with no views, due to the overhead of performing a read-before-write and locking the partition to ensure concurrent updates play well with the read-before-write. You can read more about the internals of materialized views in Cassandra in ScyllaDb's wiki.
If changing the certid is a one-time operation, then the performance impact shouldn't be too much of a worry. Regardless, it is always a better idea to let Cassandra deal with updating the MV because it will take care of anomalies (such as what happens when the node storing the view is partitioned away and the update is unable to propagate), and eventually ensure consistency.
If you are worried about performance, consider replacing Cassandra with Scylla.

Related

Cassandra - Shall I have to do so many writes?

I have 5 Tables:
users_by_id
users_by_username
users_by_email
users_by_likes
users_by_followers
I have to write 5 Statements every time if a user registered. Is that not expensive or bad ?
INSERT INTO users_by_id (...) values (..)
INSERT INTO users_by_email (...) values (..)
INSERT INTO users_by_username (...) values (..)
INSERT INTO users_by_likes (...) values (..)
INSERT INTO users_by_followers (...) values (..)
The second question: Maybe I update users_by_id I have to write 5 Update statments. Is there another solution? Or is that not this bad ?

Cassandra advocates denormalization of your data and creating data model according to your queries. You will have to write your data model such that it satisfies all the queries with good performance. For performance (due to its architecture and design) Cassandra asks for writing and reading using partition key.
It is not expensive to write 5 insertions for same set of data in 5 different tables. Your reads will perform better and as data size increases to web scale, you will thank your decision of creating 5 tables and writing to them.
You can explore materialized views (Materialized View and Datastax Link for Materialized View but remember it is an experimental feature. So you have to understand it properly and also identify open issues with materialized views.
I would recommend you study Cassandra data model that will make things easier to grasp.

Cassandra is designed to be write intensive database so do not hesitate to duplicate your data. One should always design tables for the read queries. If one table satisfies one query, it is a fine design.
Answer to your second question, you should design your tables such a way that you do not have to update table. Always think about inserting new values.
For example, below table design
CREATE TABLE user_by_email (
email text,
timestamp timestamp,
name text,
fullname text,
userId text,
PRIMARY KEY (email,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
INSERT INTO user_by_email (email, DateTime.Now ........)
In this design, you should get the latest inserted value. Additionally , this design keeps change history for that key.
Think about, how many times we have to update values like user id, email, username? rarely.

Avoiding filtering with a compound partition key in Cassandra

I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?

You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries

How to maintain data consistency across multiple tables in cassandra?

I'm having trouble figuring out how to maintain attribute updates across multiple tables to ensure data consistency.
For example, suppose I have many-to-many relationship between actors and fans. A fan can support many actors, and an actor have many fans. I make several tables to support my queries
CREATE TABLE fans (
fan_id uuid,
fan_attr_1 int,
fan_attr_2 int
PRIMARY KEY ((fan_id))
)
CREATE TABLE actors (
actor_id uuid,
actor_attr_1 int,
actor_attr_2 int
PRIMARY KEY ((actor_id))
)
CREATE TABLE actors_by_fan (
fan_id uuid,
actor_id uuid,
actor_attr_1 int,
actor_attr_2 int
PRIMARY KEY (fan_id, actor_id)
)
CREATE TABLE fans_by_actor (
actor_id uuid,
fan_id uuid,
fan_attr_1 int,
fan_attr_2 int
PRIMARY KEY (actor_id, fan_id)
)
Let's say I'm a fan and I'm on my settings page and I want to change my fan_attr_1 to a different value.
On the fans table I can update my attribute just fine since the application knows my fan_id and can key on that.
However I cannot change my fan_attr_1 on the fans_by_actor without first querying for the actor_ids tied to the fan.
This problem occurs for any time you want to update any attribute of either fans or actors.
I've tried looking online for people experiencing similar problems, but I couldn't find them. For example, in Datastax's Data Modeling course they use the examples with actors and videos in a many to many relationship where they have tables actors_by_video and videos_by_actor. The course, like the other online resources I've consulted, discussed modeling tables after queries, but haven't dug into how to maintain data integrity. In the actors_by_video table, what would happen if I want to change an actor's attribute? Wouldn't have have to go through every row of actors_by_video to find the partitions that contain the actor and update the attribute? That sounds very inefficient. The other option is to look for the video id's beforehand, but I read elsewhere that reads before writes are an antipattern in Cassandra.
What would be the best approach for tackling this problem either from a data modeling standpoint or from a CQL standpoint?
EDIT:
- Fixed sentence stubs
- Added context and prior research

Data Modeling
Cassandra is not an Relational Database and there are certain basic rules need to be followed on DataModeling, at high-level the following goals need to be followed for our data model.
1) Spread data evenly around the cluster
2) Minimize the number of partitions read
Moreover we should go for a single big table rather than breaking it into multiple tables and adding relationship between the tables. In this approach duplication of records will occur. Duplication of records is not a costlier operation since it takes only a little more Disk Space rather than CPU, memory, disk IOPs, or network.
Please note that there is a size restriction on column key names and values. The maximum column key (and row key) size is 64KB. The maximum column value size is 2 GB. But becuase there is no streaming and the whole value is fetched in heap memory when requested, limit the size to only a few MBs.
More Info:
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html
CQL
Maintaining Consistency across tables can be done using Batch or Materialized Views. Materialized views is available from version 3.0
Please see
How to ensure data consistency in Cassandra on different tables?
My preference would be to change the data model and design it
accordingly for our queries and if possible make it as a single big table.
Hope it Helps!

Materialized Views are probably the best choice:
CREATE MATERIALIZED VIEW actors_by_fan
AS SELECT fan_id, actor_id, actor_attr_1, actor_attr_2
FROM fans
PRIMARY KEY (fan_id, actor_id);
CREATE MATERIALIZED VIEW fans_by_actor
AS SELECT actor_id, fan_id, fan_attr_1, fan_attr_2
FROM actors
PRIMARY KEY (actor_id, fan_id);
In versions prior to 3.0, create secondary indices and evaluate if their performance is acceptable. Later, after upgrading to 3.x, just drop the secondary indexes and create materialized views.

The way you solve these kind of problems is to manually update all the changed records.
Since you can't use materialized views, in order to update fan_attr_1 on your data you need to:
Update the fan table by issuing UPDATE fan ... WHERE fan_id = xxx.
Select all the actor_ids from the actors_by_fan by issuing SELECT actor_id ... WHERE fan_id = xxx.
Update all the corresponding rows in the fans_by_actor table by issuing UPDATE fans_by_actor ... WHERE actor_id IN (...), or alternatively loop over the actor_ids and run each update async.
As long as you have a small amount of actor_id in the step 2, say less than 20, you can group all the queries and maintain strong consistency between tables by running them in a single BATCH. You need to guarantee the consistency between tables in other way otherwise.
This can be as inefficient as it sounds, but I don't think there are other smarter solutions. By the way, you are issuing one read (the step 2) and multiple writes (step 1 and step 3). This won't be the end of the world, especially if you don't change attributes so often (eg every 10 milliseconds).

select older versions of data after update in Cassandra

This is my use-case.
I have inserted a row of data in Cassandra with the following query:
INSERT INTO TableWide1 (UID, TimeStampCol, Value, DateCol) VALUES ('id1','2016-03-24 17:54:36',45,'2015-03-24 00:00:00');
I update one row to have a new value.
update TableWide1 set Value = 46 where uid = 'id1' and datecol='2015-03-24 00:00:00' and timestampcol='2016-03-24 17:54:36';
Now, I would like to see all versions of this data from Cassandra. I know in HBase, this is pretty straightforward, but in Cassandra, is this even possible?
I explored a bit using writetime(), but it just gives the latest time of the newly updated data. And this cannot be used in where clause too.
This is how my schema looks like:
CREATE TABLE TableWide1(
UID varchar,
TimeStampCol timestamp,
Value double,
DateCol timestamp,
PRIMARY KEY ((UID,DateCol), TimeStampCol)
);
So is this technically possible, given the fact the old data still exists in Cassandra?

If your partitions wont get too wide you could exclude the time partitioning:
CREATE TABLE table_wide (
UID varchar,
TimeStampCol timestamp,
Value double,
PRIMARY KEY ((UID), TimeStampCol)
);
Thats generally bad though since eventually you will hit the limits of a partition.
But really you had it right. You wont be able to make a single statement, but under the covers you cant stream the entire set over anyway, and it will have to page through it. So you can just iterate through results of each day one at a time. If your dataset has days with no data and you dont want to waste reads, you can keep an additional table around to mark which days have data
CREATE TABLE table_wide_partition_list (
UID varchar,
DateCol timestamp,
PRIMARY KEY (UID)
);
And make one query to it first.
Really if you want HBase like behavior for scans, you are probably looking for more OLAP style of thing instead of normal C* usage. For this its almost universally recommended to use Spark with Cassandra currently.

Cassandra does not retain old data when updated.
It marks the old data into tombstone, and get rid of this, when compaction happens.
Hbase, was not made for handling real time application, and hot data from/for application server, though things have improved since the old times with Hbase.
People use Hbase, mainly because they already have a hadoop cluster.
Another noticeable and important difference is Cassandra is very fast on retrieval of single/multiple record based on key but not on range like >10 && <10 because data is stored based on hashed key. Hbase on the other hand stores data in sorted manner and is ideal candidate for range query.
Anyways, since cassandra doesn't retain old data. You cannot retrieve it.

ranking where the score is a function of time

I would like to migrate my db which is currently on mysql to C*. At the moment I have a table that I have trouble imagining how to "migrate" it.
Entity
Id
score(s)
hotscore
Where hotscore is f(s,d) = log10 + (s.t/45000). S is score and t is timestamp since epoch.
Essentially what I would be looking into querying is the top 20 of that entity. With mysql and a cron job I'm updating the hotscore every minute. For that reason hot score cannot be suited for a partition key. I'm trying to see if I can make this happen before moving to c*. As far as I know a primary key like (id, hotscore) wouldn't be good because it means C* has to scan every entry.

You'll soon be able to handle this use case with materialized views when Cassandra 3.0 is released.
See an example of ordering rows in a materialized view here and here.
The way it works is in your base table you don't use the score as a clustering column, but you do use it as a clustering column in the materialized view. Then when you update the base table, the ordering in the view is automatically updated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string