What is better in cassandra many tables of same structure or one table with many rows - cassandra

Suppose that I have 1000 entities with exactly the same structure. For example all entities have three fields:
String id;
String name;
int amount;
Also I expect that there will be huge amount of every type of entity in the system.
So I have two variants right now:
For each entity create separate table which looks like:
CREATE TABLE <SOME_ENTITY_NAME> (
id text PRIMARY KEY,
name text,
amount int
)
I'll create only one table but with composite priamry key:
CREATE TABLE ALL_ENTITIES_TABLE (
entity_name text,
id text,
name text,
amount int,
PRIMARY KEY ((entity_name, id))
);
Of course, supporting only one table is more simplier, but what is with performance?
So, the question is what variant is better in terms of performance, taking into account that each type of entity will have millions(may be billions) of records?

There is a limitation on the number of the tables that can be created in the Cassandra cluster. Usual recommendation is too keep this number lower than 200, with ~500 is like a "hard stop"...
The reason for this is that every table requires allocation of additional memory, and other resources to keep auxiliary data, like, key/row caches, bloom filters, etc. Depending on the Cassandra version, every table may require 1-2Mb of memory.
So in your case, the 2nd design is better because you keep all data in single table, and your partition key will allow to spread data evenly between nodes of cluster.

In my opinion the first approach is incorrect in terms of maintainability. Too much of dynamically created tables should be tough to maintain. Also, If you use your partitioning/clustering order properly (as per the need of data retrieval) it should be easier and efficient to query. Also if you are using 3.x version of Cassandra, secondary indexes can come in handy.
NOTE: Secondary indexes don't allow sorting.

Cassandra was designed around the fact that disk space is the cheapest resource among all. You must build your data model around the queries that you will be using the most regardless of whether this model would consume more disk space or not - as long as it serves the purpose of your queries in the most efficient way. I wouldn't be able to answer your question without taking a look at the queries you will be using. In general, you must feel free to create as many tables as needed as long as it serves the purpose of your queries. I would recommend having a look here.

Related

Regarding Cassandra's (sloppy, still confusing) documentation on keys, partitions

I have a high-write table I'm moving from Oracle to Cassandra. In Oracle the PK is a (int: clientId, id: UUID). There are about 10 billion rows. Right off the bat I run into this nonsensical warning:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html :
"If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index."
Not only does this seem to defeat efficient find by PK it fails to define what it means to "query between the fields" and what the difference is between a built-in index, a secondary-index, and the primary_key+clustering subphrases in a create table command. A junk description. This is 2019. Shouldn't this be fixed by now?
AFAIK it's misleading anyway:
CREATE TABLE dev.record (
clientid int,
id uuid,
version int,
payload text,
PRIMARY KEY (clientid, id, version)
) WITH CLUSTERING ORDER BY (id ASC, version DESC)
insert into record (id,version,clientid,payload) values
(d5ca94dd-1001-4c51-9854-554256a5b9f9,3,1001,'');
insert into record (id,version,clientid,payload) values
(d5ca94dd-1002-4c51-9854-554256a5b9e5,0,1002,'');
The token on clientid indeed shows they're in different partitions as expected.
Turning to the big point. If one was looking for a single row given the clientId, and UUID ---AND--- Cassandra allowed you to skip specifying the clientId so it wouldn't know which node(s) to search, then sure that find could be slow. But it doesn't:
select * from record where id=
d5ca94dd-1002-4c51-9854-554256a5b9e5;
InvalidRequest: ... despite the performance unpredictability,
use ALLOW FILTERING"
And ditto with other variations that exclude clientid. So shouldn't we conclude Cassandra handles high cardinality tables searches that return "very few results" just fine?
Anything that requires reading the entire context of the database wont work which is the case with scanning on id since any of your clientid partition key's may contain one. Walking through potentially thousands of sstables per host and walking through each partition of each of those to check will not work. If having hard time with data model and not totally getting difference between partition keys and clustering keys I would recommend you walk through some introduction classes (ie datastax academy), youtube videos or book etc before designing your schema. This is not a relational database and designing around your data instead of your queries will get you into trouble. When moving from oracle you should not just copy your tables over and move the data or it will not work as well.
The clustering key is the order in which the data for a partition is ordered on disk which is what it is referring to as "build-in index". Each sstable has an index component that contains the partition key locations for that sstable. This also includes an index of the clustering keys for each partition every 64kb (by default at least) that can be searched on. The clustering keys that exist between each of these indexed points are unknown so they all have to be checked. A long time ago there was a bloom filter of clustering keys kept as well but it was such a rare use case where it helped vs the overhead that it was removed in 2.0.
Secondary indexes are difficult to scale well which is where the warning comes from about cardinality, I would strongly recommend just denormalizing data and not using index in any form as using large scatter gather queries across a distributed system is going to have availability and performance issues. If you really need it check out http://www.doanduyhai.com/blog/?p=13191 to try to get the data right (not worth it in my opinion).

Secondary index on for low cardinality clustering column

Using Cassandra as db:
Say we have this schema
primary_key((id1),id2,type) with index on type, because we want to query by id1 and id2.
Does query like
SELECT * FROM my_table WHERE id1=xxx AND type='some type'
going to perform well?
I wonder if we have to create and manage another table for this situation?
The way you are planning to use secondary index is ideal (which is rare). Here is why:
you specify the partition key (id1) in your query. This ensures that
only the relevant partition (node) will be queried, instead of
hitting all the nodes in the cluster (which is not scalable)
You are (presumably) indexing an attribute of low cardinality (I can imagine you have maybe a few hundred types?), which is the sweet spot when using secondary indexes.
Overall, your data model should perform well and scale. Yet, if you look for optimal performances, I would suggest you use an additional table ((id1), type, id2).
Finale note: if you have a limited number of type, you might consider using solely ((id1), type, id2) as a single table. When querying by id1-id2, just issue a few parallel queries against the possible value of type.
The final decision needs to take into account your target latency, the disk usage (duplicating table with a different primary key is sometimes too expensive), and the frequency of each of your queries.

How to maintain data consistency across multiple tables in cassandra?

I'm having trouble figuring out how to maintain attribute updates across multiple tables to ensure data consistency.
For example, suppose I have many-to-many relationship between actors and fans. A fan can support many actors, and an actor have many fans. I make several tables to support my queries
CREATE TABLE fans (
fan_id uuid,
fan_attr_1 int,
fan_attr_2 int
PRIMARY KEY ((fan_id))
)
CREATE TABLE actors (
actor_id uuid,
actor_attr_1 int,
actor_attr_2 int
PRIMARY KEY ((actor_id))
)
CREATE TABLE actors_by_fan (
fan_id uuid,
actor_id uuid,
actor_attr_1 int,
actor_attr_2 int
PRIMARY KEY (fan_id, actor_id)
)
CREATE TABLE fans_by_actor (
actor_id uuid,
fan_id uuid,
fan_attr_1 int,
fan_attr_2 int
PRIMARY KEY (actor_id, fan_id)
)
Let's say I'm a fan and I'm on my settings page and I want to change my fan_attr_1 to a different value.
On the fans table I can update my attribute just fine since the application knows my fan_id and can key on that.
However I cannot change my fan_attr_1 on the fans_by_actor without first querying for the actor_ids tied to the fan.
This problem occurs for any time you want to update any attribute of either fans or actors.
I've tried looking online for people experiencing similar problems, but I couldn't find them. For example, in Datastax's Data Modeling course they use the examples with actors and videos in a many to many relationship where they have tables actors_by_video and videos_by_actor. The course, like the other online resources I've consulted, discussed modeling tables after queries, but haven't dug into how to maintain data integrity. In the actors_by_video table, what would happen if I want to change an actor's attribute? Wouldn't have have to go through every row of actors_by_video to find the partitions that contain the actor and update the attribute? That sounds very inefficient. The other option is to look for the video id's beforehand, but I read elsewhere that reads before writes are an antipattern in Cassandra.
What would be the best approach for tackling this problem either from a data modeling standpoint or from a CQL standpoint?
EDIT:
- Fixed sentence stubs
- Added context and prior research
Data Modeling
Cassandra is not an Relational Database and there are certain basic rules need to be followed on DataModeling, at high-level the following goals need to be followed for our data model.
1) Spread data evenly around the cluster
2) Minimize the number of partitions read
Moreover we should go for a single big table rather than breaking it into multiple tables and adding relationship between the tables. In this approach duplication of records will occur. Duplication of records is not a costlier operation since it takes only a little more Disk Space rather than CPU, memory, disk IOPs, or network.
Please note that there is a size restriction on column key names and values. The maximum column key (and row key) size is 64KB. The maximum column value size is 2 GB. But becuase there is no streaming and the whole value is fetched in heap memory when requested, limit the size to only a few MBs.
More Info:
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html
CQL
Maintaining Consistency across tables can be done using Batch or Materialized Views. Materialized views is available from version 3.0
Please see
How to ensure data consistency in Cassandra on different tables?
My preference would be to change the data model and design it
accordingly for our queries and if possible make it as a single big table.
Hope it Helps!
Materialized Views are probably the best choice:
CREATE MATERIALIZED VIEW actors_by_fan
AS SELECT fan_id, actor_id, actor_attr_1, actor_attr_2
FROM fans
PRIMARY KEY (fan_id, actor_id);
CREATE MATERIALIZED VIEW fans_by_actor
AS SELECT actor_id, fan_id, fan_attr_1, fan_attr_2
FROM actors
PRIMARY KEY (actor_id, fan_id);
In versions prior to 3.0, create secondary indices and evaluate if their performance is acceptable. Later, after upgrading to 3.x, just drop the secondary indexes and create materialized views.
The way you solve these kind of problems is to manually update all the changed records.
Since you can't use materialized views, in order to update fan_attr_1 on your data you need to:
Update the fan table by issuing UPDATE fan ... WHERE fan_id = xxx.
Select all the actor_ids from the actors_by_fan by issuing SELECT actor_id ... WHERE fan_id = xxx.
Update all the corresponding rows in the fans_by_actor table by issuing UPDATE fans_by_actor ... WHERE actor_id IN (...), or alternatively loop over the actor_ids and run each update async.
As long as you have a small amount of actor_id in the step 2, say less than 20, you can group all the queries and maintain strong consistency between tables by running them in a single BATCH. You need to guarantee the consistency between tables in other way otherwise.
This can be as inefficient as it sounds, but I don't think there are other smarter solutions. By the way, you are issuing one read (the step 2) and multiple writes (step 1 and step 3). This won't be the end of the world, especially if you don't change attributes so often (eg every 10 milliseconds).

Real time complex queries on Cassandra

We're looking for a tool (preferably open source) which helps us to perform complex queries (advanced filtering and joins, no need full SQL) in real time.
Assume that all the data needed fits in memory, and we want to avoid, if possible, the overhead of map reduce tools.
To be more specific, we need to load n partitions of a single table, and join them by clustering column.
Variables Table:
Variable ID: Partition key
Person ID: Clustering key
Variable Value
Desired output columns:
Person ID, Variable 1 Value, Variable 2 Vale, ..., Variable N Value
We can achieve it by an in-memory load-filter-join process, but we were wondering if there's any tool out there with this use case covered out of the box and with a fair performance.
We've tested Spark, but the partitioning of Spark C* connector is based on the primary key, so each Variable ID would be loaded in a different Spark node, and the join process would be really slow (all the data would travel all over the Spark cluster).
Any tips? known tools?
I believe that you have a number of options to perform this task:
Rethink your database schema, denormalize it. var_id:person_id:value rows are not the best table schema if you want to query by person_id (and it smells really bad as an entity-attribute-value db antipattern):
EAV gives a flexibility to the developer to define the schema as needed and this is good in some circumstances. On the other hand it performs very poorly in the case of an ill-defined query and can support other bad practices. In other words, EAV gives you enough rope to hang yourself and in this industry, things should be designed to the lowest level of complexity because the guy replacing you on the project will likely be an idiot.
You can use schema with multiple columns (cassandra can handle a lot of them):
create table person_data (
person_id int primary key,
var1 text,
var2 text,
var3 text,
var4 text,
....
);
If you don't have a predefined set of variables, you can use cql3 collections like map for storing the data in a more flexible way.
Create a secondary index on person_id (even it's a clustering key already). You can query for all data for a specific user without using joins, but with some issues:
As your query will hit multiple partitions, it will require not a single disk seek, but a series of them, so your query latency may be higher than you're expecting.
secondary indexes are not free: C* must perform more work under the hood if you insert a row to a table with indexed columns.
Use external index like ElasticSearch/Solr if you plan to have a lot of complex queries which do not fit well into cql3.

Is a read with one secondary index faster than a read with multiple in cassandra?

I have this structure that I want a user to see the other user's feeds.
One way of doing it is to fan out an action to all interested parties's feed.
That would result in a query like select from feeds where userid=
otherwise i could avoid writing so much data and since i am already doing a read I could do:
select from feeds where userid IN (list of friends).
is the second one slower? I don't have the application yet to test this with a lot of data/clustering. As the application is big writing code to test a single node is not worth it so I ask for your knowledge.
If your title is correct, and userid is a secondary index, then running a SELECT/WHERE/IN is not even possible. The WHERE/IN clause only works with primary key values. When you use it on a column with a secondary index, you will see something like this:
Bad Request: IN predicates on non-primary-key columns (columnName) is not yet supported
Also, the DataStax CQL3 documentation for SELECT has a section worth reading about using IN:
When not to use IN
The recommendations about when not to use an index apply to using IN
in the WHERE clause. Under most conditions, using IN in the WHERE
clause is not recommended. Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
As for your first query, it's hard to speculate about performance without knowing about the cardinality of userid in the feeds table. If userid is unique or has a very high number of possible values, then that query will not perform well. On the other hand, if each userid can have several "feeds," then it might do ok.
Remember, Cassandra data modeling is about building your data structures for the expected queries. Sometimes, if you have 3 different queries for the same data, the best plan may be to store that same, redundant data in 3 different tables. And that's ok to do.
I would tackle this problem by writing a table geared toward that specific query. Based on what you have mentioned, I would build it like this:
CREATE TABLE feedsByUserId
userid UUID,
feedid UUID,
action text,
PRIMARY KEY (userid, feedid));
With a composite primary key made up of userid as the partitioning key you will then be able to run your SELECT/WHERE/IN query mentioned above, and achieve the expected results. Of course, I am assuming that the addition of feedid will make the entire key unique. if that is not the case, then you may need to add an additional field to the PRIMARY KEY. My example is also assuming that userid and feedid are version-4 UUIDs. If that is not the case, adjust their types accordingly.

Resources