Cassandra replication - replicate some data but keep some local - cassandra

How do you configure Cassandra so that some tables are NOT replicated at all but others are? Is this actually a good use case for Cassandra?
I have a group of customers (max. 50) that will all supply data on a daily basis (~50,000 records per customer per day, ~200 fields per record). I need to pre-process the data to obfuscate sensitive information locally, then combine the data centrally for analysis and then allow reporting against the combined data set. I am planning on each customer having a local Cassandra node for the raw data load (several flat files), but I don't want this replicated until the obfuscation is complete. Can I do this with different tables spaces and replication factors? The data can be keyed using customer ID as a PK, if that helps.

You could have a keyspace for the customer raw data with a replication factor of 1 and keep the raw data tables in there and then have the obfuscated data tables in a separate keyspace with a replication factor > 1.

Related

Is it hacky to do RF=ALL + CL=TWO for a small frequently used Cassandra table?

I plan to enhance the search for our retail service, which is managed by DataStax. We have data of about 500KB in raw from our wheels and tires and could be compressed and encrypted to about 20KB. This table is frequently used and changes about every day. We send the data to the frontend, which will be processed with Next.js later. Now we want to store this data in a single row table in a separate keyspace with a consistency level of TWO and RF equal to all nodes, replicating the table to all of the nodes.
Now the question: Is this solution hacky or abnormal? Is any solution rather this that fits best in this situation?
The quick answer to your question is yes, it is a hacky solution to do RF=ALL.
The table is very small so there is no benefit to replicating it to all nodes in the cluster. In practice, the tables are so small that the data will be cached anyway.
Since you are running with DataStax Enterprise (DSE), you might as well take advantage of the DSE In-Memory feature which allows you to keep data in RAM to save from disk seeks. Since your table can easily fit in RAM, it is a perfect use case for DSE In-Memory.
To configure the table to run In-Memory, set the table's compaction strategy to MemoryOnlyStrategy:
CREATE TABLE inmemorytable (
...
PRIMARY KEY ( ... )
) WITH compaction= {'class': 'MemoryOnlyStrategy'}
AND caching = {'keys':'NONE', 'rows_per_partition':'NONE'};
To alter the configuration of an existing table:
ALTER TABLE inmemorytable
WITH compaction= {'class': 'MemoryOnlyStrategy'}
AND caching = {'keys':'NONE', 'rows_per_partition':'NONE'};
Note that tables configured with DSE In-Memory are still persisted to disk so you won't lose any data in the event of a power outage or service disruption. In-Memory tables operate the same as regular tables so the same backup and restore processes still apply with the only difference being that a copy of the data is kept in memory for faster read performance.
For details, see DataStax Enterprise In-Memory. Cheers!

Client data isolation: can Cassandra store data in different partitions in separate file sets?

Suppose I have a Cassandra table with an integer partition key.
Question: is it possible to arrange for Cassandra to store the table data and indexes for that table in a sets of files by partition value? Alternative approaches like per partition keyspaces or duplicating tables Account1 (for partition key 1), Account2 (for partition key 2) is deemed to undercut Cassandra performance.
The desired outcome is to reduce the possibility of selecting sensitive client data for partition 1 getting other partitions in the process. If the data is kept separate (and searched separately) this risk is reduced --- obviously not eliminated. Essentially it shifts the responsibility of using the right partition key at the right time somewhat onto Cassandra from the application code.
It's not possible in the Cassandra itself, until you separate data into tables/keyspaces, but as you mentioned - it will lead to bad performance.
DataStax Enterprise (DSE) has functionality called Row Level Access Control that allows you to set permissions based on the value of partition key (or part of partition key).
If you need to stick to plain Cassandra, then you need to do it on the application level.

How to determine cassandra partition for a given PK on the client?

I'm trying Cassandra to replace mysql at a large dataset I have (2.5Tb/5 billion rows) that I can't scale more in a single server.
I insert/update a few million rows every hour. Currently, I'm inserting and querying one by one in cassandra because I don't know which partition has the data, and grouping them seem to be slower. But one by one, I can't match the speed of a single mysql server even with 3 cassandra nodes.
In mysql, I can batch because I know it stores all in the same server. Is it possible, using the value of the primary key, to determine the partition on client side, so I can group the queries more effectively with BATCH or SELECT..IN?
I mean, given a group of PKs like 1, 2, 3, 4, 5, 6 ... and N servers, i'd like to know that say, rows 1 3, 5 are in the same partition, so I can group then in my queries. Is this possible with cassandra?
If you're performing queries with WHERE on partition key, then most of time drivers take care of most effective routing of data to replicas that have this data (only if you didn't change load balancing policy - by default all drivers use so-called TokenAware policy) by calculating token for given partition key, and find replica(s) for it.
If you need to fetch multiple entries, then running N queries in parallel via async API & merging results on client side will be more effective than performing query with IN.
P.S. In Cassandra BATCH has slightly different semantic than in relational databases. Please check this documentation for recommended patterns.

Cassandra partition keys organisation

I am trying to store the following structure in cassandra.
ShopID, UserID , FirstName , LastName etc....
The most of the queries on it are
select * from table where ShopID = ? , UserID = ?
That's why it is useful to set (ShopID, UserID) as the primary key.
According to docu the default partitioning key by Cassandra is the first column of primary key - for my case it's ShopID, but I want to distribute the data uniformly on Cassandra cluster, I can not allow that all data from one shopID are stored only in one partition, because some of shops have 10M records and some only 1k.
I can setup (ShopID, UserID) as partitioning keys then I can reach the uniform distribution of records in the Cassandra cluster . But after that I can not receive all users that belong to some shopid.
select *
from table
where ShopID = ?
Its obvious that this query demand full scan on the whole cluster but I have no any possibility to do it. And it looks like very hard constraint.
My question is how to reorganize the data to solve both problem (uniform data partitioning, possibility to make full scan queries) in the same time.
In general you need to make user id a clustering column and add some artificial information to your table and partition key during saving. It allows to break a large natural partition to multiple synthetic. But now you need to query all synthetic partitions during reading to combine back natural partition. So the goal is find a reasonable trade-off between number(size) of synthetic partitions and read queries to combine all of them.
Comprehensive description of possible implementations can be found here and here
(Example 2: User Groups).
Also take a look at solution (Example 3: User Groups by Join Date) when querying/ordering/grouping is performed by clustering column of date type. It can be useful if you also have similar queries.
Each node in Cassandra is responsible for some token ranges. Cassandra derives a token from row's partition key using hashing and sends the record to node whose token range includes this token. Different records can have the same token and they are grouped in partitions. For simplicity we can assume that each cassandra nodes stores the same number of partitions. And we also want that partitions will be equal in size for uniformly distribution between nodes. If we have a too huge partition that means that one of our nodes needs more resources to process it. But if we break it in multiple smaller we increase the chance that they will be evenly distirbuted between all nodes.
However distribution of token ranges between nodes doesn't related with distribution of records between partitions. When we add a new node it just assumes responsibility for even portion of token ranges from other nodes and as the result the even number of partitions. If we had 2 nodes with 3 GB of data, after adding a third node each node stores 2 GB of data. That's why scalability isn't affected by partitioning and you don't need to change your historical data after adding a new node.

Change replication factor of selected objects

Is there any cloud storage system (i.e Cassandra, Hazelcast, Openstack Swift) where we can change the replication factor of selected objects? For instance lets say, we have found out hotspot objects in the system so we can increase the replication factor as a solution?
Thanks
In Cassandra the replication factor is controlled based on keyspaces. So you first define a keyspace by specifying the replication factor the keyspace should have in each of your data centers. Then within a keyspace, you create database tables, and those tables are replicated according to the keyspace they are defined in. Objects are then stored in rows in a table using a primary key.
You can change the replication factor for a keyspace at any time by using the "alter keyspace" CQL command. To update the cluster to use the new replication factor, you would then run "nodetool repair" for each node (most installations run this periodically anyway for anti-entropy).
Then if you use for example the Cassandra java driver, you can specify the load balancing policy to use when accessing the cluster, such as round robin, and token aware policy. So if you have multiple replicas of the the table holding the objects, then the load of accessing the object could be set to round robin on just the nodes that have a copy of the row you are accessing. If you are using a read consistency level of ONE, then this would spread out the read load.
So the granularity of this is not at the object level, but at the table level. If you had all your objects stored in one table, then changing the replication factor would change it for all objects in that table and not just one. You could have multiple keyspaces with different replication factors and keep high demand objects in a keyspace with a high RF, and less frequently accessed objects in a keyspace with a low RF.
Another way you could reduce the hot spot for an object in Cassandra is to make additional copies of it by inserting it into additional rows of a table. The rows are accessed on nodes by the compound partition key, so one field of the partition key could be a "copy_number" value, and when you go to read the object, you randomly set a copy_number value (from 0 to the number of copy rows you have) so that the load of reading the object will likely hit a different node for each read (since rows are hashed across the cluster based on the partition key). This approach would give you more granularity at the object level compared to changing the replication factor for the whole table, at the cost of more programming work to manage randomly reading different rows.
In Infinispan, you can also set number of owners (replicas) on each cache (equivalent to Hazelcast's map or Cassandra's table), but not for one specific entry. Since the routing information (aka consistent hash table) does not contain all keys but splits the hashCode() 32-bit range to variable amount of segments, and then specifies the distribution only for these segments, there's no way to specify the number of replicas per entry.
Theoretically, with specially forged keys and custom consistent hash table factory, you could achieve something similar even in one cache (certain sorts of keys would be replicated different amount of times), but that would require coding with deep understanding of the system.
Anyway, the reader would have to know the number of replicas in advance as this would be part of the routing information (cache in simple case, special keys as described above), therefore, it's not really practical unless the reader can know that.
I guess you want to use the replication factor for the sake of speeding up reads.
The regular Map (IMap) implementation, uses a master slave(s) setup, so all reads will go through the master. But there is a special setting available, so you are also allowed to read from backups. So if you have a 10 node cluster, and have a backup count of 5, there will be in total 6 members that have the information stored. 5 members in the cluster will hit the master, and 5 members in the cluster will hit the backup (since they have the backup locally available).
There also is a fully replicated map available, here every item is send to every machine. So in a 10 node cluster, all reads will be local since every machine has the same data.
In case of the IMap, we don't provide control on the number of backups on the key/value level. So the whole map is configured with a certain backup-count.

Resources