How to determine cassandra partition for a given PK on the client? - cassandra

I'm trying Cassandra to replace mysql at a large dataset I have (2.5Tb/5 billion rows) that I can't scale more in a single server.
I insert/update a few million rows every hour. Currently, I'm inserting and querying one by one in cassandra because I don't know which partition has the data, and grouping them seem to be slower. But one by one, I can't match the speed of a single mysql server even with 3 cassandra nodes.
In mysql, I can batch because I know it stores all in the same server. Is it possible, using the value of the primary key, to determine the partition on client side, so I can group the queries more effectively with BATCH or SELECT..IN?
I mean, given a group of PKs like 1, 2, 3, 4, 5, 6 ... and N servers, i'd like to know that say, rows 1 3, 5 are in the same partition, so I can group then in my queries. Is this possible with cassandra?

If you're performing queries with WHERE on partition key, then most of time drivers take care of most effective routing of data to replicas that have this data (only if you didn't change load balancing policy - by default all drivers use so-called TokenAware policy) by calculating token for given partition key, and find replica(s) for it.
If you need to fetch multiple entries, then running N queries in parallel via async API & merging results on client side will be more effective than performing query with IN.
P.S. In Cassandra BATCH has slightly different semantic than in relational databases. Please check this documentation for recommended patterns.

Related

Client data isolation: can Cassandra store data in different partitions in separate file sets?

Suppose I have a Cassandra table with an integer partition key.
Question: is it possible to arrange for Cassandra to store the table data and indexes for that table in a sets of files by partition value? Alternative approaches like per partition keyspaces or duplicating tables Account1 (for partition key 1), Account2 (for partition key 2) is deemed to undercut Cassandra performance.
The desired outcome is to reduce the possibility of selecting sensitive client data for partition 1 getting other partitions in the process. If the data is kept separate (and searched separately) this risk is reduced --- obviously not eliminated. Essentially it shifts the responsibility of using the right partition key at the right time somewhat onto Cassandra from the application code.
It's not possible in the Cassandra itself, until you separate data into tables/keyspaces, but as you mentioned - it will lead to bad performance.
DataStax Enterprise (DSE) has functionality called Row Level Access Control that allows you to set permissions based on the value of partition key (or part of partition key).
If you need to stick to plain Cassandra, then you need to do it on the application level.

Cassandra Query Performance: Using IN clause for one portion of the composite partition key

I currently have a table set up in Cassandra that has either text, decimal or date type columns with a composite partition key of a business_date and an account_number. For queries to this table, I need to be able to support look-ups for a single account, or for a list of accounts, for a given date.
Example:
select x,y,z from my_table where business_date = '2019-04-10' and account_number IN ('AAA', 'BBB', 'CCC')
//Note: Both partition keys are provided for this query
I've been struggling to resolve performance issues related to accessing this data because I'm noticing latency patterns that I am having trouble trying to understand / explain.
In many scenarios, the same exact query can be run a total of three times in a short period by the client application. For these scenarios, I see that two out of three requests will have really bad response times (800 ms), and one of them will have a really fast one (50 ms). At first I thought this would be due to key or row caches, however, I'm not so sure since I believe that if this were true, the third request out of the three should always be the fastest, which isn't the case.
The second issue I believed I was facing was the actual data model itself. Although the queries are being submitted with all the partition keys being provided, since it's an IN clause, the results would be separate partitions and can be distributed across the cluster and so, this would be a bad access pattern. However, I see these latency problems when even single account queries are run. Additionally, I see queries that come with 15 - 20 accounts performing really well (under 50ms), so I'm not sure if the data model is actually an issue.
Cluster setup:
Datacenters: 2
Number of nodes per data center: 3
Keyspace Replication:local_dc = 2, remote_dc = 2
Java Driver set:
Load-balancing: DCAware with LatencyAware
Protocol: v3
Queries are still set up to use "IN" clauses instead of async individual queries
Read_consistency: LOCAL_ONE
Does anyone have any ideas / clues of what I should be focusing on in terms of really identifying the root cause of this issue?
the use of IN on the partition key is always the bad idea, even for composite partition keys. The value of partition key defines the location of your data in cluster, and different values of partition key will most probably put data onto different servers. In this case, coordinating node (that received the query) will need to contact nodes that hold the data, wait that these nodes will deliver results, and only after that, send you results back.
If you need to query several partition keys, then it will be faster if you issue individual queries asynchronously, and collect result on client side.
Also, please note that TokenAware policy works best when you use PreparedStatement - in this case, driver is able to extract value of partition key, and find what server holds data for it.

Replication without partitioning in Cassandra

In Mongo we can go for any of the below model
Simple replication(without shard where one node will be working as master and other as slaves) or
Shard(where data will be distributed on different shard based on partition key)
Both 1 and 2
My question - Can't we have Cassandra just with replication without partitioning just like model_1 in mongo ?
From Cassandra vs MongoDB in respect of Secondary Index?
In case of Cassandra, the data is distributed into multiple nodes based on the partition key.
From above it looks like it is mandatory to distribute the data based on some p[artition key when we have more than one node ?
In Cassandra, replication factor defines how many copies of data you have. Partition key is responsible for distributing of data between nodes. But this distribution may depend on the amount of nodes that you have. For example, if you have 3 nodes cluster & replication factor equal to 3, then all nodes will get data anyway...
Basically your intuition is right: The data is always distributed based on the partition key. The partition key is also called row key or primary key, so you can see: you have one anyway. The 1. case of your mongo example is not doable in cassandra, mainly because cassandra does not know the concept of masters and slaves. If you have a 2 node cluster and a replication factor of 2, then the data will be held on 2 nodes, like Alex Ott already pointed out. When you query (read or write), your client will decide to which to connect and perform the operation. To my knowledge, the default here would be a round robin load balancing between the two nodes, so either of them will receive somewhat the same load. If you have 3 nodes and a replication factor of 2, it becomes a little more tricky. The nice part is though, that you can determine the set of nodes which hold your data in the client code, thus you don't lose any performance by connecting to a "wrong" node.
One more thing about partitions: you can configure some of this, but this would be per server and not per table. I've never used this, and personally i wouldn't recommend to do so. Just stick to the default mechanism of cassandra.
And one word about the secondary index thing. Use materialized views

Select All Performance in Cassandra

I'm current using DB2 and planning to use cassandra because as i know cassandra have a read performance greater than RDBMS.
May be this is a stupid question but I have experiment that compare read performance between DB2 and Cassandra.
Testing with 5 million records and same table schema.
With query SELECT * FROM customer. DB2 using 25-30s and Cassandra using 40-50s.
But query with where condition SELECT * FROM customer WHERE cusId IN (100,200,300,400,500) DB2 using 2-3s and Cassandra using 3-5ms.
Why Cassandra faster than DB2 with where condition? So i can't prove which database is greater with SELECT * FROM customer right?
FYI.
Cassandra: RF=3 and CL=1 with 3 nodes each node run on 3 computers (VM-Ubuntu)
DB2: Run on windows
Table schema:
cusId int PRIMARY KEY, cusName varchar
If you look at the types of problems that Cassandra is good at solving, then the reasons behind why unbound ("Select All") queries suck become quite apparent.
Cassandra was designed to be a distributed data base. In many Cassandra storage patterns, the number of nodes is greater than the replication factor (I.E., not all nodes contain all of the data). Therefore, limiting the number of network hops becomes essential to modeling high-performing queries. Cassandra performs very well with specific queries (which utilize the partition/clustering key structure), because it can quickly locate the node primarily responsible for the data.
Unbound queries (A.K.A. multi-key queries) incur the extra network time because a coordinator node is required. So one node acts as the coordinator, queries all other nodes, collates data, and returns the result set. Specifying a WHERE clause (with at least a partition key) and while using a "Token Aware" load balancing policy, performs well for two reasons:
A coordinator node is not required.
The node primarily responsible for the range is queried, returning the result set in a single netowrk hop.
tl;dr;
Querying Cassandra with an unbound query, causes it to incur a lot of extra processing and network time that it normally wouldn't have to do, had the query been specified with a WHERE clause.
Even as a troublesome query like a no-condition range query, 40-50s is pretty extreme for C*. Is the coordinator hitting GCs with the coordination? Can you include code used for your test?
When you make a select * vs millions of records, it wont fetch them all at once, it will grab the fetchSize at a time. If your just iterating through this, the iterator will actually block even if you used executeAsync initially. This means that every 10k (default) records it will issue a new query that you will block on. The serialized nature of this will take time just from a network perspective. http://docs.datastax.com/en/developer/java-driver/3.1/manual/async/#async-paging explains how to do it in a non-blocking way. You can use this to to kick off the next page fetch while processing the current which would help.
Decreasing the limit or fetch size could also help, since the coordinator may walk token ranges (parallelism is possible here but its heuristic is not perfect) one at a time until it has read enough. If it has to walk too many nodes to respond it will be slow, this is why empty tables can be very slow to do a select * on, it may serially walk every replica set. With 256 vnodes this can be very bad.

Cassandra partition keys organisation

I am trying to store the following structure in cassandra.
ShopID, UserID , FirstName , LastName etc....
The most of the queries on it are
select * from table where ShopID = ? , UserID = ?
That's why it is useful to set (ShopID, UserID) as the primary key.
According to docu the default partitioning key by Cassandra is the first column of primary key - for my case it's ShopID, but I want to distribute the data uniformly on Cassandra cluster, I can not allow that all data from one shopID are stored only in one partition, because some of shops have 10M records and some only 1k.
I can setup (ShopID, UserID) as partitioning keys then I can reach the uniform distribution of records in the Cassandra cluster . But after that I can not receive all users that belong to some shopid.
select *
from table
where ShopID = ?
Its obvious that this query demand full scan on the whole cluster but I have no any possibility to do it. And it looks like very hard constraint.
My question is how to reorganize the data to solve both problem (uniform data partitioning, possibility to make full scan queries) in the same time.
In general you need to make user id a clustering column and add some artificial information to your table and partition key during saving. It allows to break a large natural partition to multiple synthetic. But now you need to query all synthetic partitions during reading to combine back natural partition. So the goal is find a reasonable trade-off between number(size) of synthetic partitions and read queries to combine all of them.
Comprehensive description of possible implementations can be found here and here
(Example 2: User Groups).
Also take a look at solution (Example 3: User Groups by Join Date) when querying/ordering/grouping is performed by clustering column of date type. It can be useful if you also have similar queries.
Each node in Cassandra is responsible for some token ranges. Cassandra derives a token from row's partition key using hashing and sends the record to node whose token range includes this token. Different records can have the same token and they are grouped in partitions. For simplicity we can assume that each cassandra nodes stores the same number of partitions. And we also want that partitions will be equal in size for uniformly distribution between nodes. If we have a too huge partition that means that one of our nodes needs more resources to process it. But if we break it in multiple smaller we increase the chance that they will be evenly distirbuted between all nodes.
However distribution of token ranges between nodes doesn't related with distribution of records between partitions. When we add a new node it just assumes responsibility for even portion of token ranges from other nodes and as the result the even number of partitions. If we had 2 nodes with 3 GB of data, after adding a third node each node stores 2 GB of data. That's why scalability isn't affected by partitioning and you don't need to change your historical data after adding a new node.

Resources